[00:00:20] now, for an IP whitelist config change, the chances are quite low, but, if tomorrow at 13:00 UTC is fine, that should be used :) [00:01:14] CST as in? [00:01:18] So once I get the IPv4 and IPv6 addresses and create the task in phabricator, will the task automatically be dealt with tomorrow? [00:01:20] I'd prefer UTC offsets [00:01:45] Central Standard Time (GMT -5) [00:02:23] arseny92 - UTC 05 [00:02:25] Vai2fc_: yes, I'll take care of it [00:02:31] okay, great. [00:02:56] Not everyone has all the timezone abbreviations in his head lol. And also IIRC there are more than one tz that is abbreviated CST [00:03:27] still waiting for the IP address. thanks, Dereckson. [00:03:27] Cst is actually utc -6 because of DST [00:03:34] On my IRC client, sessions, camera, etc. I use UTC time. That's indeed more useful. [00:03:44] arseny92: I typed UTC :) [00:04:30] (oh, I see you were replying to Dereckson about CST, /me moves on) [00:06:01] so... 7am -6? [00:06:14] I checked in the configuration log if we've something about Nashville or Vanderbilt, but that seems the first throttle rule request we have for that library. [00:09:53] just submitted it, Dereckson. I actually called the library and the supervisor at the desk said that they could only find the IPv4 address...? [00:11:36] Vai2fc_ , whats the task id? [00:12:16] arseny92: T149063 [00:12:17] T149063: Vanderbilt 2016-10-25 edit-a-thon - https://phabricator.wikimedia.org/T149063 [00:14:37] Dereckson , are you preparing the change or ? [00:14:40] yes I'm [00:15:20] what is CTCP TIME? Sorry for the ignorance. [00:15:37] A way to get your local time, I wanted to be sure it was 19:xx [00:15:41] Tells local time of x user [00:15:51] Ah, yes, Dereckson. 19:15 [00:16:08] (03PS1) 10Dereckson: Nashville Architecture edit-a-thon (Vanderbilt library) throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317727 (https://phabricator.wikimedia.org/T149063) [00:17:21] (03CR) 10Dereckson: [C: 04-1] Nashville Architecture edit-a-thon (Vanderbilt library) throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317727 (https://phabricator.wikimedia.org/T149063) (owner: 10Dereckson) [00:18:58] (03PS2) 10Dereckson: Nashville Architecture edit-a-thon (Vanderbilt library) throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317727 (https://phabricator.wikimedia.org/T149063) [00:19:10] (03CR) 10Filippo Giunchedi: [C: 04-1] Add mtail program to track thumbor OOM kills (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/315272 (https://phabricator.wikimedia.org/T148962) (owner: 10Gilles) [00:19:25] Thanks, Dereckson! What next? [00:19:45] Have this deployed next morning [00:19:50] Nothing, that will be deployed tommorow. [00:20:02] Thanks to have provided the information. [00:20:05] Thank you, thank you, thank you, Dereckson and arseny92!!! [00:20:20] I'll make sure not to make a habit of this last-minute nonsense. [00:29:32] I see you signed up last thursday fot that event as per your sig on the wiki description. May I ask why the throttle problem wasn't brought up earlier then lol [00:29:32] Vai2fc_: yes, we prefer one week before the event generally [00:29:32] Sorry, Dereckson and arseny92 - I'm new to the job, this is my first wikipedia event, and a colleague was supposed to do this request last week! [00:29:33] Welcome aboard and have a nice event. [00:29:33] btw about that throttle stuff, isn't the account creator permission supposed to override that trottle if a user has that? [00:29:34] Thank you, Dereckson! Looking forward to getting to know my way around. [00:29:35] arseny92 - is there a better way to create new accounts for edit-a-thons in the future? [00:29:35] better is subjective [00:29:36] advantages/disadvantages, Reedy? [00:29:36] I realize this is for those who may wan't to sign up themselves without having an entry "account created by xxx" in the registration log.... [00:29:37] Oh, got it. [00:30:19] Vai2fc_ , these are the only two options: https://meta.wikimedia.org/wiki/Mass_account_creation and currently we go with the second [00:32:18] arseny92: account creator permission for events has some drawbacks: the event organize is forced to create the accounts in serial, participants must give a mail [00:32:31] As accounts created by another user are treated as such "created by" andyou usually can't decide on which name a user wants unless you have a list for the acccreator to process etc [00:32:56] arseny92 - that's the page I landed on when looking into this. It seems like requesting a temporary suspension of the IP address throttle is preferable for most of our events, since they are relatively infrequent. I'll make sure to get the requests in earlier, though. [00:33:06] (03CR) 10Filippo Giunchedi: Log when HTTP status codes from Mediawiki and Thumbor are different (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/315648 (https://phabricator.wikimedia.org/T147918) (owner: 10Gilles) [00:33:20] Dereckson yes thats the point anyways its preferred if a user can register by himself [00:34:28] The accountcreator permission is great to create an account for someone using legitimately a blocked IPs range, then grant ipexempt to the newly created account. [00:37:37] Thanks again, y'all! [00:42:00] PROBLEM - Disk space on ms-be1005 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdd1 is not accessible: Input/output error [00:45:40] PROBLEM - MegaRAID on ms-be1005 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) [00:47:42] odd, I would have expected a ticket to be opened for ms-be1005 LD failed, I'll leave it as it is in case volans wants to take a look at it tomorrow [01:03:35] RECOVERY - Disk space on ms-be1005 is OK: DISK OK [01:05:49] PROBLEM - puppet last run on ms-be1005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[mkfs-/dev/sdd1] [01:08:27] PROBLEM - Disk space on elastic1018 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 78315 MB (15% inode=99%) [01:18:26] RECOVERY - Disk space on elastic1018 is OK: DISK OK [01:28:28] 06Operations, 10DNS, 10Domains, 10Traffic, and 2 others: Point wikipedia.in to 180.179.52.130 instead of URL forward - https://phabricator.wikimedia.org/T144508#2740896 (10Naveenpf) Hi @CRoslof, Please find my answer inline. Thank you naveenpf >>! In T144508#2644566, @CRoslof wrote: > I'm not sure I... [01:33:22] urandom: i was out and missed that earlier (re: tcpircbot) yea, i had some thoughts on that, it would be nice if we could reuse an existing list of restbase servers in hiera [01:33:36] urandom: let's talk tomorrow morning or so if you like [01:33:47] mutante: sounds good. [01:34:46] urandom: i think hieradata/role/eqiad/restbase/server.yaml looks promising with the "restbase::seeds" [01:34:50] alright [01:35:14] 06Operations, 10DNS, 10Domains, 10Traffic, and 2 others: Point wikipedia.in to 180.179.52.130 instead of URL forward - https://phabricator.wikimedia.org/T144508#2740901 (10Naveenpf) @Aklapper Can you please change title to.... add new IP address ? We have changed to new server for better performance. Our... [01:35:18] well, no, restbase::hosts without the -a or -b suffix, i guess, we'll figure it out [01:37:21] 06Operations, 10DNS, 10Domains, 10Traffic, and 2 others: Point wikipedia.in to 205.147.101.160 instead of URL forward - https://phabricator.wikimedia.org/T144508#2740905 (10Dzahn) [01:37:40] 06Operations, 10DNS, 10Domains, 10Traffic, and 2 others: Point wikipedia.in to 205.147.101.160 instead of URL forward - https://phabricator.wikimedia.org/T144508#2602033 (10Dzahn) >>! In T144508#2740901, @Naveenpf wrote: > @Aklapper Can you please change title to.... add new IP address ? Done [01:41:16] (03PS2) 10Dzahn: icinga: default duration for icinga-downtime [puppet] - 10https://gerrit.wikimedia.org/r/317720 (owner: 10Filippo Giunchedi) [01:41:26] (03CR) 10Dzahn: [C: 032] icinga: default duration for icinga-downtime [puppet] - 10https://gerrit.wikimedia.org/r/317720 (owner: 10Filippo Giunchedi) [01:43:34] (03PS2) 10Dzahn: icinga: also schedule host services downtime [puppet] - 10https://gerrit.wikimedia.org/r/317721 (owner: 10Filippo Giunchedi) [01:43:38] ACKNOWLEDGEMENT - Host elastic2020 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T149006 [01:45:41] 06Operations, 10ops-eqiad: ms-be1005 - MegaRAID - CRITICAL: 1 failed LD(s) (Offline) - https://phabricator.wikimedia.org/T149069#2740922 (10Dzahn) [01:46:37] 06Operations, 10ops-eqiad: ms-be1005 - MegaRAID - CRITICAL: 1 failed LD(s) (Offline) - https://phabricator.wikimedia.org/T149069#2740935 (10Dzahn) [01:46:47] ACKNOWLEDGEMENT - MegaRAID on ms-be1005 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) daniel_zahn https://phabricator.wikimedia.org/T149069 [01:47:14] 06Operations, 10ops-eqiad: ms-be1005 - MegaRAID - CRITICAL: 1 failed LD(s) (Offline) - https://phabricator.wikimedia.org/T149069#2740922 (10Dzahn) CRITICAL: Puppet has 1 failures. Last run 15 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[**mkfs-/dev/sdd1**] [01:48:56] (03CR) 10Dzahn: [C: 032] icinga: also schedule host services downtime [puppet] - 10https://gerrit.wikimedia.org/r/317721 (owner: 10Filippo Giunchedi) [01:53:42] (03CR) 10Dzahn: "apparently: "text/javascript is obsolete, and application/x-javascript was experimental (hence the x- prefix) for a transitional period u" [puppet] - 10https://gerrit.wikimedia.org/r/317500 (owner: 10Paladox) [01:54:45] (03CR) 10Dzahn: Phabricator: Add javascript to files.viewable-mime-types (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/317500 (owner: 10Paladox) [01:56:07] (03PS8) 10Dzahn: Strip out branch HEAD in git.wikimedia.org tree link [puppet] - 10https://gerrit.wikimedia.org/r/302747 (https://phabricator.wikimedia.org/T141965) (owner: 10Paladox) [01:56:50] (03PS9) 10Paladox: phab/ex gitblit: Strip out branch HEAD in git.wikimedia.org tree link [puppet] - 10https://gerrit.wikimedia.org/r/302747 (https://phabricator.wikimedia.org/T141965) [01:58:38] (03CR) 10Dzahn: [C: 032] phab/ex gitblit: Strip out branch HEAD in git.wikimedia.org tree link [puppet] - 10https://gerrit.wikimedia.org/r/302747 (https://phabricator.wikimedia.org/T141965) (owner: 10Paladox) [02:20:50] I'm going to deploy a bugfix to some JavaScript in Core (https://gerrit.wikimedia.org/r/#/c/317743/) [02:33:33] !log ori@mira Synchronized php-1.28.0-wmf.22/resources/src/mediawiki/mediawiki.js: I1d61f4dcf: mw.loader: Fix off-by-one error in splitModuleKey() (duration: 02m 15s) [02:33:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:34:41] !log l10nupdate@tin scap sync-l10n completed (1.28.0-wmf.22) (duration: 10m 46s) [02:34:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:39:53] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Oct 25 02:39:53 UTC 2016 (duration 5m 12s) [02:39:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:36:58] 06Operations, 10Deployment-Systems, 06Release-Engineering-Team, 13Patch-For-Review: Incomplete /srv/mediawiki-staging state on deployment servers - https://phabricator.wikimedia.org/T148571#2741003 (10demon) 05Open>03Resolved [03:39:19] (03CR) 10Chad: "Can this be abstracted into a library that handles patches? I'd like to just incorparate this into a singular `setup-new-branch` or w/e co" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312013 (https://phabricator.wikimedia.org/T118478) (owner: 1020after4) [03:54:50] ACKNOWLEDGEMENT - puppet last run on ms-be1005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 25 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[mkfs-/dev/sdd1] daniel_zahn https://phabricator.wikimedia.org/T149069 [03:55:47] (03CR) 10Chad: "I'm curious if we should introduce a dependency on pygerrit (https://pypi.python.org/pypi/pygerrit/) into scap core so we can have a nice " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306259 (https://phabricator.wikimedia.org/T142880) (owner: 1020after4) [03:59:32] ACKNOWLEDGEMENT - All Flannel etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/etcd/flannel - 386 bytes in 3.927 second response time daniel_zahn https://phabricator.wikimedia.org/T149072 [04:05:42] (03CR) 10Chad: "Bleh, pygerrit is the better library but doesn't seem to be a debian package. gerritlib is packaged but I don't like it :(" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306259 (https://phabricator.wikimedia.org/T142880) (owner: 1020after4) [04:10:23] (03PS5) 10Chad: Added a new commonly typed typo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315743 (owner: 10Zppix) [04:10:28] (03CR) 10Chad: [C: 032] Added a new commonly typed typo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315743 (owner: 10Zppix) [04:11:06] (03Merged) 10jenkins-bot: Added a new commonly typed typo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315743 (owner: 10Zppix) [04:17:05] (03PS1) 10Chad: Add annoying thing we don't check in [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317749 [04:18:57] (03CR) 10Dzahn: [C: 031] Phabricator: Add javascript to files.viewable-mime-types [puppet] - 10https://gerrit.wikimedia.org/r/317500 (owner: 10Paladox) [04:23:34] RECOVERY - All Flannel etcd nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 2.496 second response time [04:25:23] thanks for poking, mutante [04:25:25] PROBLEM - Disk space on ms-be1001 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdf1 is not accessible: Input/output error [04:25:31] (03CR) 10Dzahn: "Join to ##javascript < mutante> MIME types: which one is it .. < olalonde> Hmm, I think that covers it.. < olalonde> https://cs.chromium.o" [puppet] - 10https://gerrit.wikimedia.org/r/317500 (owner: 10Paladox) [04:27:55] (03PS1) 10Lixxx235: enwiki: Create 'patroller' group with 'patrol' permission. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317752 (https://phabricator.wikimedia.org/T149019) [04:28:04] (03CR) 10jenkins-bot: [V: 04-1] enwiki: Create 'patroller' group with 'patrol' permission. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317752 (https://phabricator.wikimedia.org/T149019) (owner: 10Lixxx235) [04:30:13] (03Abandoned) 10Lixxx235: enwiki: Create 'patroller' group with 'patrol' permission. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317752 (https://phabricator.wikimedia.org/T149019) (owner: 10Lixxx235) [04:36:59] PROBLEM - MegaRAID on ms-be1001 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) [04:37:41] PROBLEM - puppet last run on ms-be1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[mkfs-/dev/sdf1] [05:01:34] RECOVERY - Disk space on ms-be1001 is OK: DISK OK [05:03:19] 06Operations, 10ops-eqiad, 10media-storage: ms-be1001 - disk failure /dev/sdf1 - https://phabricator.wikimedia.org/T149073#2741066 (10Dzahn) [05:06:01] ACKNOWLEDGEMENT - MegaRAID on ms-be1001 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) daniel_zahn https://phabricator.wikimedia.org/T149073 [05:06:01] ACKNOWLEDGEMENT - puppet last run on ms-be1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[mkfs-/dev/sdf1] daniel_zahn https://phabricator.wikimedia.org/T149073 [05:08:17] 06Operations, 10ops-eqiad, 10media-storage: ms-be1001 - disk failure /dev/sdf1 - https://phabricator.wikimedia.org/T149073#2741078 (10Dzahn) though also just minutes later we got 22:02 < icinga-wm> RECOVERY - Disk space on ms-be1001 is OK: DISK OK while the RAID check stays as it is [05:13:43] 06Operations, 10ops-eqiad, 10media-storage: ms-be1001 - disk failure /dev/sdf1 - https://phabricator.wikimedia.org/T149073#2741084 (10Dzahn) also T149069 [05:14:11] 06Operations, 10ops-eqiad, 10media-storage: ms-be1005 - MegaRAID - CRITICAL: 1 failed LD(s) (Offline) - https://phabricator.wikimedia.org/T149069#2740922 (10Dzahn) [05:27:49] PROBLEM - puppet last run on maerlant is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:53:08] RECOVERY - puppet last run on maerlant is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:40:05] 06Operations, 10ops-eqiad, 10media-storage: ms-be1005 - MegaRAID - CRITICAL: 1 failed LD(s) (Offline) - https://phabricator.wikimedia.org/T149069#2741099 (10Volans) @Dzahn thanks for taking care of this. Yes, it should have been opened automatically **but** there was an error retrieving the RAID status on t... [06:51:58] !log rebooting osmium for kernel update [06:52:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:52:45] !log rebooting elastic1035 for kernel update [06:52:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:59:17] !log rebooting wezen for kernel update [06:59:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:02:49] (03PS39) 10Chad: Scap swat command [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306259 (https://phabricator.wikimedia.org/T142880) (owner: 1020after4) [07:02:51] (03PS1) 10Chad: WIP: Rewrite checkoutMediaWiki as scap3 plugin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317757 [07:04:23] (03CR) 10Chad: "WIP cuz it's completely untested, this was just rewriting the logic and tidying things up." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317757 (owner: 10Chad) [07:05:29] !log rebooting elasticsearch relforge cluster for kernel update [07:05:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:06:58] !log rebooting tungsten for kernel update [07:07:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:15:52] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS1299/IPv4: Connect, AS1299/IPv6: Connect [07:31:16] (03PS1) 10ArielGlenn: add poincare.acc.umu.se to dumps mirrors [puppet] - 10https://gerrit.wikimedia.org/r/317761 [07:33:38] (03CR) 10ArielGlenn: [C: 032] add poincare.acc.umu.se to dumps mirrors [puppet] - 10https://gerrit.wikimedia.org/r/317761 (owner: 10ArielGlenn) [07:36:50] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 12, down: 0, shutdown: 0 [07:37:46] PROBLEM - puppet last run on ms-be1026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:40:11] !log rebooting netmon1001 for kernel update [07:40:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:41:51] (03CR) 10Jcrespo: [C: 032] proxysql: Add firewall to labs role [puppet] - 10https://gerrit.wikimedia.org/r/317548 (https://phabricator.wikimedia.org/T148500) (owner: 10Jcrespo) [07:41:56] (03PS4) 10Jcrespo: proxysql: Add firewall to labs role [puppet] - 10https://gerrit.wikimedia.org/r/317548 (https://phabricator.wikimedia.org/T148500) [07:42:03] (03CR) 10Jcrespo: [V: 032] proxysql: Add firewall to labs role [puppet] - 10https://gerrit.wikimedia.org/r/317548 (https://phabricator.wikimedia.org/T148500) (owner: 10Jcrespo) [07:45:08] (03PS14) 10Alexandros Kosiaris: Remove absented /etc/icinga/puppet_hostextinfo.cfg entry [puppet] - 10https://gerrit.wikimedia.org/r/315244 [07:45:13] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Remove absented /etc/icinga/puppet_hostextinfo.cfg entry [puppet] - 10https://gerrit.wikimedia.org/r/315244 (owner: 10Alexandros Kosiaris) [07:47:33] (03PS17) 10Alexandros Kosiaris: icinga: Remove event_profiling_enabled [puppet] - 10https://gerrit.wikimedia.org/r/315085 [07:47:37] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] icinga: Remove event_profiling_enabled [puppet] - 10https://gerrit.wikimedia.org/r/315085 (owner: 10Alexandros Kosiaris) [07:48:33] (03PS2) 10Alexandros Kosiaris: role::tcpircbot: Add an ensure parameter [puppet] - 10https://gerrit.wikimedia.org/r/317665 [07:50:31] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] role::tcpircbot: Add an ensure parameter [puppet] - 10https://gerrit.wikimedia.org/r/317665 (owner: 10Alexandros Kosiaris) [07:51:52] (03CR) 10Muehlenhoff: proxysql: Add firewall to labs role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/317548 (https://phabricator.wikimedia.org/T148500) (owner: 10Jcrespo) [07:54:14] (03PS2) 10Elukey: logstash: Stop dropping mod_proxy_fcgi warnings [puppet] - 10https://gerrit.wikimedia.org/r/306943 (https://phabricator.wikimedia.org/T73487) (owner: 10BryanDavis) [07:54:50] (03PS1) 10Jcrespo: proxysql: Add notrack to firewall [puppet] - 10https://gerrit.wikimedia.org/r/317762 (https://phabricator.wikimedia.org/T148500) [07:56:25] (03CR) 10Muehlenhoff: [C: 031] proxysql: Add notrack to firewall [puppet] - 10https://gerrit.wikimedia.org/r/317762 (https://phabricator.wikimedia.org/T148500) (owner: 10Jcrespo) [07:56:41] (03CR) 10Jcrespo: [C: 032] proxysql: Add notrack to firewall [puppet] - 10https://gerrit.wikimedia.org/r/317762 (https://phabricator.wikimedia.org/T148500) (owner: 10Jcrespo) [08:04:16] RECOVERY - puppet last run on ms-be1026 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:05:03] !log rebooting chromium for kernel update [08:05:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:05:23] !log elastic@codfw reindexing jawiki, thwiki and zhwiki T147498 (logs in terbium:~dcausse/bm25_reindex/cirrus_log) [08:05:25] T147498: reindex search cluster for BM25 test - https://phabricator.wikimedia.org/T147498 [08:05:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:08:40] (03PS14) 10Alexandros Kosiaris: Replace neon with einsteinium where applicable [puppet] - 10https://gerrit.wikimedia.org/r/315257 [08:08:42] (03PS1) 10Alexandros Kosiaris: icinga: Increase max_concurrent_checks [puppet] - 10https://gerrit.wikimedia.org/r/317763 [08:09:56] RECOVERY - salt-minion processes on dbproxy1011 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:12:37] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [08:13:13] !log reimaging tegmen [08:13:18] PROBLEM - puppet last run on analytics1049 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:13:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:14:15] Can anyone check the 5xx alert above? [08:14:48] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [08:15:05] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [08:15:14] seems like a spike in graphite right now [08:15:26] akosiaris: Also one user report at -tech [08:15:46] hmm [08:16:11] I can't see anything related on fluorine at a glance [08:16:28] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [08:17:06] I don't think you would, those are not 500s [08:17:25] seems like upload [08:17:30] hm [08:17:34] the user report is about api [08:18:23] hm, no they actually are 500s... [08:18:37] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [08:18:44] the spike in https://grafana.wikimedia.org/dashboard/file/varnish-http-errors.json has subsided though [08:18:49] it's back to normal [08:19:01] I do not see api errors on oxygen [08:19:22] whatever it was it lasted 2 mins [08:19:24] tops [08:19:57] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:21:11] it is interesting to see a large increase on 404 at 20:45 yesterday: https://grafana.wikimedia.org/dashboard/db/varnish-http-errors [08:21:12] from https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes it does not seem coming from a specific D [08:21:15] DC [08:21:30] grep '2016-10-25T08:1[02]:' 5xx.json | jq . |grep uri_host | sort | uniq -c | sort -rn [08:21:30] 626 "uri_host": "upload.wikimedia.org", [08:21:30] 2 "uri_host": "stream.wikimedia.org", [08:21:30] 2 "uri_host": "phab.wmfusercontent.org", [08:21:31] 1 "uri_host": "piwik.wikimedia.org", [08:21:32] 1 "uri_host": "cxserver.wikimedia.org", [08:22:10] hmm [08:22:41] hm I forgot a dash there.. 0-2 [08:22:45] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:22:49] but still.. no change in the distribution [08:23:19] (03CR) 10Gilles: Log when HTTP status codes from Mediawiki and Thumbor are different (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/315648 (https://phabricator.wikimedia.org/T147918) (owner: 10Gilles) [08:23:30] well, the top player (upload) has more (1086), the rest are the say more or less, so yes a change in the distribution [08:23:34] but still... [08:23:47] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:26:21] akosiaris: from https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes it seems text though.. am I missing something? [08:26:46] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:27:18] (03PS1) 10Jcrespo: proxysql: fix line separator [puppet] - 10https://gerrit.wikimedia.org/r/317764 (https://phabricator.wikimedia.org/T148500) [08:27:36] elukey: not sure.. it's puzzling indeed [08:27:39] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (open graph via native scraper) is CRITICAL: Could not fetch url http://citoid.svc.codfw.wmnet:1970/api: Timeout on connection while downloading http://citoid.svc.codfw.wmnet:1970/api [08:28:38] looks like text indeed... but then the 5xx.json logs is fully unhelpful [08:28:38] (03CR) 10Jcrespo: [C: 032] proxysql: fix line separator [puppet] - 10https://gerrit.wikimedia.org/r/317764 (https://phabricator.wikimedia.org/T148500) (owner: 10Jcrespo) [08:29:39] https://grafana.wikimedia.org/dashboard/db/restbase?panelId=12&fullscreen&from=now-3h&to=now shows an increase in restbase calls to api afaics [08:31:11] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy [08:32:01] elukey: maybe some template change ? [08:32:12] 06Operations, 10Cassandra, 06Discovery, 06Maps, 03Interactive-Sprint: increase replication factor for system_auth keyspace on maps / cassandra - https://phabricator.wikimedia.org/T149074#2741142 (10Gehel) [08:32:26] but that should not have caused all these 5xx [08:33:17] 06Operations, 06Multimedia, 10Traffic, 15User-Josve05a, 15User-Urbanecm: Thumbnails failing to render sporadically (ERR_CONNECTION_CLOSED or ERR_SSL_BAD_RECORD_MAC_ALERT) - https://phabricator.wikimedia.org/T148917#2736599 (10Nenntmichruhigip) >>! In T148917#2736954, @Aklapper wrote: > @Paladox: Please s... [08:33:48] jynus: can you check https://logstash.wikimedia.org/app/kibana#/dashboard/mediawiki-errors ? I can see a lot of Error connecting to 10.64.48.26: Can't connect to MySQL server on '10.64.48.26' [08:33:52] for the time of the spike [08:35:28] those are info errors about lag from rpcs [08:36:32] ah ok, so nothing really relevant [08:36:35] thanks [08:36:42] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: /api (open graph via native scraper) is CRITICAL: Test open graph via native scraper returned the unexpected status 520 (expecting: 200) [08:36:42] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (open graph via native scraper) is CRITICAL: Test open graph via native scraper returned the unexpected status 520 (expecting: 200) [08:37:01] which is a bit crazy [08:37:02] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [08:37:27] because they are complaining about 1 second lag, when the measuring error is of +-1 second [08:39:11] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (open graph via native scraper) is CRITICAL: Could not fetch url http://citoid.svc.eqiad.wmnet:1970/api: Timeout on connection while downloading http://citoid.svc.eqiad.wmnet:1970/api [08:39:24] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [08:39:24] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [08:39:36] actually, I can see an s5 api server going down [08:41:13] RECOVERY - puppet last run on analytics1049 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [08:44:13] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [08:48:58] !log Stopping replication db2058 s4 - using it to clone another host - T146261 [08:48:59] T146261: Reimage dbstore2001 as jessie - https://phabricator.wikimedia.org/T146261 [08:49:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:51:51] (03CR) 1020after4: "@chad: I don't think it's worth using any of the gerrit libs I looked at. It's straightforward to do any gerrit api call with http://docs." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306259 (https://phabricator.wikimedia.org/T142880) (owner: 1020after4) [08:57:47] (03PS1) 10Gehel: maps/cassandra - ensure that system_auth keyspace is replicated [puppet] - 10https://gerrit.wikimedia.org/r/317776 (https://phabricator.wikimedia.org/T149074) [09:02:23] gehel: about --^ [09:02:57] I am reading https://docs.datastax.com/en/cassandra/2.1/cassandra/architecture/architectureDataDistributeReplication_c.html and NetworkTopologyStrategy seems rack aware, meanwhile SimpleStrategy isn't [09:03:10] now you are replicating the keyspace to the whole cluster so this is not really relevant :D [09:03:23] but maybe for future expansion it might be good to have it by default [09:03:47] also I am trying to think if replication = 4 could be too much [09:04:00] but that keyspace is mostly read with local one iirc [09:04:06] so it shouldn't be an issue [09:04:22] ( I am thinking out loud to catch other feedbacks) [09:04:41] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (open graph via native scraper) is CRITICAL: Could not fetch url http://citoid.svc.codfw.wmnet:1970/api: Timeout on connection while downloading http://citoid.svc.codfw.wmnet:1970/api [09:04:46] (03PS15) 10Alexandros Kosiaris: Replace neon with einsteinium where applicable [puppet] - 10https://gerrit.wikimedia.org/r/315257 [09:04:48] (03PS2) 10Alexandros Kosiaris: icinga: Increase max_concurrent_checks [puppet] - 10https://gerrit.wikimedia.org/r/317763 [09:04:50] (03PS1) 10Alexandros Kosiaris: einsteinium: Assign the icinga box roles [puppet] - 10https://gerrit.wikimedia.org/r/317780 [09:05:15] elukey: Thanks for those thoughts! My thinking it that this auth keyspace is really small, almost only read only [09:05:54] elukey: I'm actually not up to date at all on replication strategies. I based this on the Cassandra docs that we have: https://wikitech.wikimedia.org/wiki/Cassandra#Authentication [09:07:11] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy [09:08:24] gehel: let's follow the docs, probably urandom wrote them and they are trustable :) [09:08:30] LGTM! [09:09:37] !log rebooting hassaleh for kernel update [09:09:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:10:23] elukey: thanks! [09:10:41] I am also going to answer to your email [09:10:56] this thing has happened to me the first time I had to reboot aqs [09:11:07] and we discovered only afterwards the problem [09:11:34] so there might be some gap in propagating information to non-restbase clusters [09:11:55] it could have been avoided and you would have only needed to fix the wrong reboot [09:12:01] not a whole cluster meltdown [09:12:19] elukey: kool! I'm definitely a newbie with Cassandra... [09:14:00] the major difficulty that I've encountered is the philosophy behind its config: there are tons of tunables and the defaults are most of the times not suitable for your use case.. [09:14:24] but you probably realize an issue while it is already causing fire :D [09:14:32] then you discover tunableX [09:14:49] you set it and everything works well [09:15:47] elasticsearch is great for this. It always seems to "just work (tm)" for the normal cases. [09:16:20] Then you start to try to do tricky things and the documentation falls appart... [09:19:38] (03CR) 10Gehel: [C: 032] maps/cassandra - ensure that system_auth keyspace is replicated [puppet] - 10https://gerrit.wikimedia.org/r/317776 (https://phabricator.wikimedia.org/T149074) (owner: 10Gehel) [09:19:38] 06Operations: Upgrade firejail to 0.44 - https://phabricator.wikimedia.org/T149078#2741240 (10MoritzMuehlenhoff) [09:23:00] !log rebooting hassium for kernel update [09:23:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:26:39] 06Operations, 10Cassandra, 06Discovery, 06Maps, and 2 others: increase replication factor for system_auth keyspace on maps / cassandra - https://phabricator.wikimedia.org/T149074#2741255 (10Gehel) 05Open>03Resolved system_auth keyspace is updated with a replication factor of 4 for the maps1*, maps2* an... [09:30:24] (03PS6) 10Gilles: Add mtail program to track thumbor OOM kills [puppet] - 10https://gerrit.wikimedia.org/r/315272 (https://phabricator.wikimedia.org/T148962) [09:38:15] (03PS2) 10Alexandros Kosiaris: einsteinium: Assign the icinga box roles [puppet] - 10https://gerrit.wikimedia.org/r/317780 [09:38:20] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] einsteinium: Assign the icinga box roles [puppet] - 10https://gerrit.wikimedia.org/r/317780 (owner: 10Alexandros Kosiaris) [09:42:33] (03PS2) 10Aklapper: Also list name of acting user for project creations and name changes [puppet] - 10https://gerrit.wikimedia.org/r/317317 [09:42:45] (03PS2) 10Aklapper: Drop "Phabricator workboards with single column only" query [puppet] - 10https://gerrit.wikimedia.org/r/317318 [09:43:03] (03PS2) 10Aklapper: Also display column name when hiding/showing workboard columns [puppet] - 10https://gerrit.wikimedia.org/r/317323 [09:43:07] !log Deploying ALTER table s4 commonswiki.templatelinks - T149079 (db2058 only) [09:43:08] T149079: codfw: Fix S4 commonswiki.templatelinks partitions - https://phabricator.wikimedia.org/T149079 [09:43:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:43:27] (03PS2) 10Aklapper: Also list parent project for (sub)project creations and name changes [puppet] - 10https://gerrit.wikimedia.org/r/317321 [09:45:29] !log rebooting achernar for kernel update [09:45:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:46:06] 06Operations, 10Continuous-Integration-Config, 06Operations-Software-Development: Add shell scripts CI validations - https://phabricator.wikimedia.org/T148494#2741300 (10hashar) > My two cents: agreed the "single command" option is better vs a specialized CI job For `operations/puppet.git`, CI currently rel... [09:50:10] (03PS1) 10Alexandros Kosiaris: icinga.wikimedia.org 5M TTL [dns] - 10https://gerrit.wikimedia.org/r/317785 [09:50:12] (03PS1) 10Alexandros Kosiaris: Add IPv6 addresses to einsteinium, tegmen [dns] - 10https://gerrit.wikimedia.org/r/317786 [09:50:14] (03PS1) 10Alexandros Kosiaris: icinga/tendril: Use einsteinium [dns] - 10https://gerrit.wikimedia.org/r/317787 [09:50:53] !log rebooting tin for kernel update [09:50:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:51:22] PROBLEM - puppet last run on mw2093 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:52:44] (03CR) 10Alexandros Kosiaris: [C: 032] icinga.wikimedia.org 5M TTL [dns] - 10https://gerrit.wikimedia.org/r/317785 (owner: 10Alexandros Kosiaris) [09:56:12] (03PS3) 10Jcrespo: Install parallel gzip (pigz) and parallel xz (pxz) on all servers [puppet] - 10https://gerrit.wikimedia.org/r/293743 [09:56:25] (03PS1) 10Marostegui: db-eqiad.php: Depool db1059 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317788 (https://phabricator.wikimedia.org/T146261) [09:58:40] !log rearmed keyholder on tin [09:58:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:00:09] (03PS2) 10Alexandros Kosiaris: Add IPv6 addresses to einsteinium, tegmen [dns] - 10https://gerrit.wikimedia.org/r/317786 [10:00:11] (03PS2) 10Alexandros Kosiaris: icinga/tendril: Use einsteinium [dns] - 10https://gerrit.wikimedia.org/r/317787 [10:00:20] (03CR) 10Jcrespo: [C: 04-1] db-eqiad.php: Depool db1059 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317788 (https://phabricator.wikimedia.org/T146261) (owner: 10Marostegui) [10:01:12] (03CR) 10Alexandros Kosiaris: [C: 032] Add IPv6 addresses to einsteinium, tegmen [dns] - 10https://gerrit.wikimedia.org/r/317786 (owner: 10Alexandros Kosiaris) [10:03:14] (03PS2) 10Marostegui: db-eqiad.php: Depool db1059 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317788 (https://phabricator.wikimedia.org/T146261) [10:05:13] (03CR) 10Jcrespo: [C: 031] db-eqiad.php: Depool db1059 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317788 (https://phabricator.wikimedia.org/T146261) (owner: 10Marostegui) [10:06:17] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1059 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317788 (https://phabricator.wikimedia.org/T146261) (owner: 10Marostegui) [10:06:46] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1059 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317788 (https://phabricator.wikimedia.org/T146261) (owner: 10Marostegui) [10:08:40] (03CR) 10Paladox: "Thankyou." [puppet] - 10https://gerrit.wikimedia.org/r/302747 (https://phabricator.wikimedia.org/T141965) (owner: 10Paladox) [10:09:30] !log marostegui@mira Synchronized wmf-config/db-eqiad.php: Depool db1059 to clone another host from it - T146261 (duration: 01m 36s) [10:09:30] T146261: Reimage dbstore2001 as jessie - https://phabricator.wikimedia.org/T146261 [10:09:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:11:49] !log reimaging mc103[1-6] to Jessie [10:11:51] (03PS1) 10Ema: nginx (1.11.4-1+wmf4) jessie-wikimedia; urgency=medium [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/317790 [10:11:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:13:09] (03PS1) 10Alexandros Kosiaris: icinga: Specify mode for nagios_host, nagios_service [puppet] - 10https://gerrit.wikimedia.org/r/317791 [10:14:39] gehel: just found out I gave a review of your elasticsearch-tool patch on https://gerrit.wikimedia.org/r/#/c/309573/ :D [10:16:30] hashar: yeah I need to go back to that one... [10:16:34] RECOVERY - puppet last run on mw2093 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [10:20:45] PROBLEM - puppet last run on lvs1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:24:16] (03CR) 10Muehlenhoff: [C: 04-1] "We have the package "moreutils" in standard_packages and "parallel" has a Conflicts: on moreutils, so this would cause a puppet loop where" [puppet] - 10https://gerrit.wikimedia.org/r/293743 (owner: 10Jcrespo) [10:29:31] (03CR) 10Muehlenhoff: [C: 031] "Ignore my earlier comment" [puppet] - 10https://gerrit.wikimedia.org/r/293743 (owner: 10Jcrespo) [10:30:43] (03CR) 10Jcrespo: [C: 032] Install parallel gzip (pigz) and parallel xz (pxz) on all servers [puppet] - 10https://gerrit.wikimedia.org/r/293743 (owner: 10Jcrespo) [10:36:33] (03PS1) 10Jcrespo: Revert "Install parallel gzip (pigz) and parallel xz (pxz) on all servers" [puppet] - 10https://gerrit.wikimedia.org/r/317796 [10:36:45] (03CR) 10Jcrespo: [C: 032 V: 032] Revert "Install parallel gzip (pigz) and parallel xz (pxz) on all servers" [puppet] - 10https://gerrit.wikimedia.org/r/317796 (owner: 10Jcrespo) [10:37:40] !log rebooting acamar for kernel update [10:37:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:40:49] PROBLEM - puppet last run on db2052 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:40:49] PROBLEM - puppet last run on db2018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:41:04] PROBLEM - puppet last run on db1092 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:41:05] PROBLEM - puppet last run on pc1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:41:10] PROBLEM - puppet last run on db2040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:41:24] PROBLEM - puppet last run on db1078 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:41:32] RECOVERY - puppet last run on lvs1002 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [10:41:39] PROBLEM - puppet last run on db1087 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:41:40] PROBLEM - puppet last run on db1037 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:41:40] PROBLEM - puppet last run on db1080 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:41:40] PROBLEM - puppet last run on db2017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:41:40] PROBLEM - puppet last run on db2042 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:41:41] PROBLEM - puppet last run on es1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:42:03] PROBLEM - puppet last run on db1023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:42:03] PROBLEM - puppet last run on db1035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:42:09] PROBLEM - puppet last run on labsdb1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:42:21] PROBLEM - puppet last run on db2036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:43:02] PROBLEM - puppet last run on db1052 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:43:20] PROBLEM - puppet last run on db1075 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:43:40] PROBLEM - puppet last run on es1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:43:55] PROBLEM - puppet last run on db1085 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:43:55] PROBLEM - puppet last run on db2010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:44:11] RECOVERY - puppet last run on db1037 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [10:44:33] RECOVERY - puppet last run on db1023 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:44:33] RECOVERY - puppet last run on db1035 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [10:45:59] PROBLEM - puppet last run on ms-be2012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:48:40] RECOVERY - puppet last run on ms-be2012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:48:43] RECOVERY - puppet last run on pc1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:49:50] RECOVERY - puppet last run on labsdb1003 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [10:51:22] RECOVERY - puppet last run on es1011 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [10:51:42] RECOVERY - puppet last run on db2010 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [10:51:59] RECOVERY - puppet last run on es1015 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:53:21] RECOVERY - puppet last run on db1052 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [10:53:32] RECOVERY - puppet last run on db1075 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [10:54:12] RECOVERY - puppet last run on db1085 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:55:11] RECOVERY - puppet last run on db2036 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:57:01] RECOVERY - puppet last run on db2042 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [10:58:52] RECOVERY - puppet last run on db2052 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:58:59] RECOVERY - puppet last run on db1092 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:59:52] RECOVERY - puppet last run on db1087 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:00:00] RECOVERY - puppet last run on db2017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:02:04] RECOVERY - puppet last run on db2040 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:02:20] RECOVERY - puppet last run on db1080 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:06:54] RECOVERY - puppet last run on db1078 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:06:55] RECOVERY - puppet last run on db2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:09:34] PROBLEM - NTP on hassium is CRITICAL: NTP CRITICAL: Offset unknown [11:10:28] (03PS1) 10Marostegui: Maintenance is no longer needed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317802 [11:11:31] (03CR) 10Marostegui: [C: 032] Maintenance is no longer needed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317802 (owner: 10Marostegui) [11:11:59] (03Merged) 10jenkins-bot: Maintenance is no longer needed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317802 (owner: 10Marostegui) [11:14:04] !log marostegui@mira Synchronized wmf-config/db-eqiad.php: Repool db1059 - T146261 (duration: 01m 22s) [11:14:05] T146261: Reimage dbstore2001 as jessie - https://phabricator.wikimedia.org/T146261 [11:14:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:14:15] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [11:15:56] PROBLEM - puppet last run on restbase1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:16:38] !log bounced ntp on hassium (stuck in XFAC state) [11:16:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:19:14] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] [11:19:25] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [1000.0] [11:20:23] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] [11:21:28] (03PS1) 10Muehlenhoff: Assign debdeploy grain for url_downloader via the role [puppet] - 10https://gerrit.wikimedia.org/r/317806 [11:21:55] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] [11:26:54] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [11:27:08] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [11:28:13] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [11:29:55] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [11:31:56] !log rebooting copper for kernel update [11:32:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:35:44] RECOVERY - NTP on hassium is OK: NTP OK: Offset -0.002498149872 secs [11:42:03] RECOVERY - puppet last run on restbase1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:51:37] (03PS2) 10BBlack: Create nginx-{full,light,extras}-dbg by hand. [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/317790 (owner: 10Ema) [11:51:39] (03PS1) 10BBlack: revert potential event pipe breakage from 1.11.4 [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/317811 [11:51:41] (03PS1) 10BBlack: add 3x post-1.11.4 bugfixes [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/317812 [11:51:43] (03PS1) 10BBlack: nginx (1.11.4-1+wmf4) jessie-wikimedia; urgency=medium [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/317813 [11:53:05] !log rolling reboot of mc2* for kernel update [11:53:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:53:14] !log elastic@eqiad reindexing top10 wikis with BM25 from terbium T147508 (logs in ~dcausse/bm25_reindex/cirrus_log) [11:53:16] T147508: BM25: initial limited release into production - https://phabricator.wikimedia.org/T147508 [11:53:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:53:25] (03CR) 1020after4: [C: 031] phabricator: add vcs::listen_addresses for codfw [puppet] - 10https://gerrit.wikimedia.org/r/317295 (https://phabricator.wikimedia.org/T143363) (owner: 10Dzahn) [11:54:17] (03CR) 1020after4: [C: 031] add git-ssh.codfw.wikimedia.org service IP [dns] - 10https://gerrit.wikimedia.org/r/317296 (https://phabricator.wikimedia.org/T143363) (owner: 10Dzahn) [11:54:42] (03CR) 1020after4: [C: 031] rename iridium-vcs to phab1001-vcs [dns] - 10https://gerrit.wikimedia.org/r/317290 (https://phabricator.wikimedia.org/T143363) (owner: 10Dzahn) [11:55:33] PROBLEM - puppet last run on db1055 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:56:22] (03CR) 1020after4: [C: 031] "Looks good, just one question:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309363 (owner: 10Chad) [11:56:51] !log rebooting hydrogen for kernel update [11:56:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:57:07] PROBLEM - puppet last run on kafka1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:59:26] (03CR) 1020after4: [C: 031] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/310719 (https://phabricator.wikimedia.org/T135427) (owner: 10Thcipriani) [12:01:26] 06Operations, 06Multimedia, 10Traffic, 15User-Josve05a, 15User-Urbanecm: Thumbnails failing to render sporadically (ERR_CONNECTION_CLOSED or ERR_SSL_BAD_RECORD_MAC_ALERT) - https://phabricator.wikimedia.org/T148917#2741442 (10BBlack) Content-Encoding issues are a separate thing unrelated to this ticket.... [12:03:00] Operation-devs: https://ticket.wikimedia.org/otrs/index.pl?Action=AgentTicketZoom;TicketID=9520163 [12:03:04] (in info-en9 [12:03:05] )* [12:13:45] <_joe_> Josve05a: I don't have access to OTRS AFAIK [12:15:14] _joe_: SLL seems to be "broken" for some Chrome users...NET::ERR_CERT_AUTHORITY_INVALID [12:16:35] PROBLEM - DPKG on fluorine is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:17:26] <_joe_> bblack: ^^ [12:18:08] <_joe_> Josve05a: do you know which OS they were on? [12:18:34] no...just came from Google searches apparently... [12:18:44] _joe_: have you signed any of these https://phabricator.wikimedia.org/legalpad/ [12:19:14] RECOVERY - DPKG on fluorine is OK: All packages OK [12:20:33] PROBLEM - puppet last run on fluorine is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[git-fat] [12:20:36] <_joe_> Josve05a: I do have an NDA, yes, but I'll let you speak with bblack/ema, who know the current situation better than me - I also have to run AFK right now [12:21:48] RECOVERY - puppet last run on db1055 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:22:00] 06Operations: investigate shared inbox options - https://phabricator.wikimedia.org/T146746#2669983 (10mark) Mailing into Phabricator has indeed proven to be pretty hit-and-miss, so let's not do that for now. The Google Shared mailbox seems worth trying, and should be easy to experiment with. @Dzahn - could you... [12:23:14] RECOVERY - puppet last run on kafka1018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:24:03] !log rebooting druid100[123] for kernel upgrades [12:24:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:25:26] Josve05a: I've never had OTRS access either AFAIK [12:26:16] I donno, will try my LDAP, seems I've tried that before though [12:26:49] nope, no access w/ LDAP login [12:28:15] RECOVERY - puppet last run on fluorine is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:30:06] !log rebooting etcd1001 for kernel update [12:30:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:35:33] 06Operations, 06Discovery, 06Maps, 10Maps-data, 03Interactive-Sprint: Configure monitoring / alerting of Postgresql / redis / ... cluster for maps - https://phabricator.wikimedia.org/T135647#2741528 (10Gehel) [12:37:36] 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint: Maps - error when doing initial tiles generation: "Error: could not create converter for SQL_ASCII"" - https://phabricator.wikimedia.org/T148031#2741544 (10Gehel) [12:37:55] (03PS2) 10Alexandros Kosiaris: icinga: Specify mode for nagios_host, nagios_service [puppet] - 10https://gerrit.wikimedia.org/r/317791 [12:37:57] (03PS16) 10Alexandros Kosiaris: Replace neon with einsteinium where applicable [puppet] - 10https://gerrit.wikimedia.org/r/315257 [12:37:59] (03PS3) 10Alexandros Kosiaris: icinga: Increase max_concurrent_checks [puppet] - 10https://gerrit.wikimedia.org/r/317763 [12:38:01] (03PS1) 10Alexandros Kosiaris: tegmen/einsteinium: Use the same resolv.conf as neon [puppet] - 10https://gerrit.wikimedia.org/r/317816 [12:38:03] (03PS1) 10Alexandros Kosiaris: switch neon and einsteinium as primary/backup icinga hosts [puppet] - 10https://gerrit.wikimedia.org/r/317817 [12:38:49] 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint: Maps - error when doing initial tiles generation: "Error: could not create converter for SQL_ASCII"" - https://phabricator.wikimedia.org/T148031#2712688 (10Gehel) The issue should be solved now that T148114 is resolved. This will be confirmed once th... [12:40:43] 06Operations, 07Puppet, 10Beta-Cluster-Infrastructure: grain-ensure erroneous mismatch with (bool)True vs (str)true - https://phabricator.wikimedia.org/T146914#2674583 (10hashar) [12:42:03] !log rebooting etcd1002 for kernel update [12:42:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:44:28] 06Operations, 06Discovery, 06Maps, 06Services (watching), 15User-mobrovac: Update Node on Maps to v4.6.0 - https://phabricator.wikimedia.org/T148661#2741558 (10Gehel) 05Open>03Resolved a:03Gehel the maps clusters have all been upgraded to nodejs 4.6.0 [12:49:54] 06Operations: Firewall sets not being loaded post-reboot due to a @resolve race - https://phabricator.wikimedia.org/T148986#2741563 (10MoritzMuehlenhoff) I'll create an Icinga check. As a secondary step it's worth looking into backporting the systemd unit from stretch. [12:50:06] 06Operations: Firewall sets not being loaded post-reboot due to a @resolve race - https://phabricator.wikimedia.org/T148986#2741564 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [12:50:46] !log rebooting etcd1003 for kernel update [12:50:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:54:50] !log repooled maerlant (was depooled for some reason, possibly forgotten to repool after maintenance) [12:54:52] (03Abandoned) 10Ori.livneh: Puppet compiler for Tim's redirects.dat DSL [puppet] - 10https://gerrit.wikimedia.org/r/138292 (owner: 10Ori.livneh) [12:54:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:55:27] (03Abandoned) 10Ori.livneh: mediawiki apache config: don't load mod_deflate on HHVMs [puppet] - 10https://gerrit.wikimedia.org/r/222673 (owner: 10Ori.livneh) [12:56:21] (03Abandoned) 10Ori.livneh: Make configuration parsing maximally forgiving of minor errors [debs/pybal] - 10https://gerrit.wikimedia.org/r/233043 (owner: 10Ori.livneh) [12:57:09] !log rebooting etcd1004 for kernel update [12:57:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161025T1300). [13:00:04] Dereckson: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:19] o/ [13:01:02] (03PS3) 10Hashar: Nashville Architecture edit-a-thon (Vanderbilt library) throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317727 (https://phabricator.wikimedia.org/T149063) (owner: 10Dereckson) [13:02:10] (03CR) 10Hashar: [C: 032] Nashville Architecture edit-a-thon (Vanderbilt library) throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317727 (https://phabricator.wikimedia.org/T149063) (owner: 10Dereckson) [13:02:45] (03Merged) 10jenkins-bot: Nashville Architecture edit-a-thon (Vanderbilt library) throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317727 (https://phabricator.wikimedia.org/T149063) (owner: 10Dereckson) [13:03:18] !log rebooting etcd1005 for kernel update [13:03:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:04:26] tested on mw1099 [13:05:41] hashar , how you actually can something that is bound to time [13:06:12] well the throttle system is none to work [13:06:15] known [13:06:22] I have reviewed the code vs the task requests [13:06:23] !log hashar@mira Synchronized wmf-config/throttle.php: Nashville Architecture edit-a-thon (Vanderbilt library) throttle rule - T149063 (duration: 02m 07s) [13:06:24] T149063: Vanderbilt 2016-10-25 edit-a-thon - https://phabricator.wikimedia.org/T149063 [13:06:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:06:31] made sure that the throttle.php does not break the website [13:06:32] then pushed [13:07:22] !log European SWAT complete [13:07:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:07:39] Hello. arseny92: there are lot of things we can't test in such matter. For example, we can't verify the IP will be the exact one used. We can't verify they don't forgot one, like an IPv6 or another exit node. [13:07:50] hashar: are you done with mw1099? I need to abuse it a little [13:07:59] yeah [13:08:02] thanks [13:08:05] ori: yeah it is complete be bold!!! :] [13:08:14] arseny92: but as hashar indicated, the priority is to avoid the configuration change to add the throttle rule breaks anything else [13:08:40] Dereckson: thank you for yesterday SWAT! I completely forgot about it and went to sleep instead :( [13:08:42] i.e. the throttle rule only takes effect at the specified time() so you can't really "test" it beyond just the normal syntax check (lint, xunit etc) which jenkins already does [13:10:03] though the tests only have partial coverage [13:10:18] and we never know what kind of side effects / breakage a patch could cause [13:10:39] so syncing on mw1099 first, try out a few page/actions as to confirm the site is not going to explode then sync!!!! :D [13:11:14] !log rebooting etcd1006 for kernel update [13:11:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:16:00] (03PS3) 10Gilles: Log when HTTP status codes from Mediawiki and Thumbor are different [puppet] - 10https://gerrit.wikimedia.org/r/315648 (https://phabricator.wikimedia.org/T147918) [13:16:16] (03CR) 10Gilles: Log when HTTP status codes from Mediawiki and Thumbor are different (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/315648 (https://phabricator.wikimedia.org/T147918) (owner: 10Gilles) [13:18:20] PROBLEM - HHVM rendering on mw1099 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.064 second response time [13:18:21] (03PS1) 10BBlack: control: back to openssl-1.0.2 [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/317821 [13:18:23] (03PS1) 10BBlack: nginx (1.11.4-1+wmf5) jessie-wikimedia; urgency=medium [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/317822 [13:19:20] PROBLEM - Apache HTTP on mw1099 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.008 second response time [13:20:31] PROBLEM - HHVM processes on mw1099 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm [13:23:28] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, 10hardware-requests: elasticsearch new servers (5x eqiad / 12x codfw) - https://phabricator.wikimedia.org/T149089#2741596 (10Gehel) [13:26:37] poor 1099 [13:26:56] [16:07] hashar: are you done with mw1099? I need to abuse it a little [13:26:59] ^ [13:27:03] yes yes [13:27:15] this is why I said "poor 1099" :D [13:27:36] what are you doing with it anyway lol [13:28:29] he is testing a corner case in our apache config that comes up when hhvm is busted [13:28:38] (03PS3) 10Alexandros Kosiaris: icinga: Specify mode for nagios_host, nagios_service [puppet] - 10https://gerrit.wikimedia.org/r/317791 [13:28:40] RECOVERY - HHVM processes on mw1099 is OK: PROCS OK: 6 processes with command name hhvm [13:28:41] (03PS2) 10Alexandros Kosiaris: tegmen/einsteinium: Use the same resolv.conf as neon [puppet] - 10https://gerrit.wikimedia.org/r/317816 [13:28:42] (03PS2) 10Alexandros Kosiaris: switch neon and einsteinium as primary/backup icinga hosts [puppet] - 10https://gerrit.wikimedia.org/r/317817 [13:28:47] (03PS17) 10Alexandros Kosiaris: Replace neon with einsteinium where applicable [puppet] - 10https://gerrit.wikimedia.org/r/315257 [13:28:49] (03PS4) 10Alexandros Kosiaris: icinga: Increase max_concurrent_checks [puppet] - 10https://gerrit.wikimedia.org/r/317763 [13:29:17] elukey: the only questionable thing I can see is that if I insert `trigger_error("Fatal error", E_USER_ERROR);` into /srv/mediawiki/errorpages/404.php below line 2, I get an HTTP 500 with 'Cache-Control: s-maxage=2678400, max-age=2678400' [13:29:42] 06Operations, 06Discovery, 10Elasticsearch, 10hardware-requests, 06Discovery-Search (Current work): elasticsearch new servers (5x eqiad / 12x codfw) - https://phabricator.wikimedia.org/T149089#2741596 (10Gehel) [13:30:03] that doesn't seem related [13:30:16] ori: but does it make sense what I wrote in the task? It seems to me that the issue is gone now [13:30:27] yes I think so, I am just replying to that effect [13:30:38] super [13:30:47] 06Operations, 10Traffic, 10Wikimedia-Apache-configuration, 13Patch-For-Review: Sometimes apache error 503s redirect to /503.html and this redirect gets cached - https://phabricator.wikimedia.org/T109226#2741637 (10ori) I suspect @elukey is right and the problem went away due to some change in the interval... [13:31:19] RECOVERY - Apache HTTP on mw1099 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.050 second response time [13:31:47] ori: thanks a lot! [13:31:48] RECOVERY - HHVM rendering on mw1099 is OK: HTTP OK: HTTP/1.1 200 OK - 72545 bytes in 0.173 second response time [13:32:01] yeah, sorry it was a waste of your time :/ [13:32:35] not a waste if we can clear out another long-backlogged ticket! :) [13:32:59] no no it was good to poke around the apache config [13:33:19] the more experience the better to avoid fireworks :D [13:33:51] bblack: worth to create some varnishlog instances checking for 503.html or not? [13:34:06] !log rebooting rcstream servers for kernel update [13:34:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:34:13] (03PS3) 10Alexandros Kosiaris: tegmen/einsteinium: Use the same resolv.conf as neon [puppet] - 10https://gerrit.wikimedia.org/r/317816 [13:34:37] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] tegmen/einsteinium: Use the same resolv.conf as neon [puppet] - 10https://gerrit.wikimedia.org/r/317816 (owner: 10Alexandros Kosiaris) [13:35:26] elukey: if we were caching them from isoalted random failure, I think we'd know it from user reports. [13:35:42] yeah probably [13:36:16] I think before it wasn't so random/isolated. I think the original testcase there was some deploy error that caused the portal page to 503 maybe, and then after fixing the MW side the error was still cached. [13:40:28] (03CR) 10Krinkle: "text/x-javascript doesn't exist and is most likely a typo from Paladox." [puppet] - 10https://gerrit.wikimedia.org/r/317500 (owner: 10Paladox) [13:40:32] 06Operations, 10Traffic, 10Wikimedia-Apache-configuration, 13Patch-For-Review: Sometimes apache error 503s redirect to /503.html and this redirect gets cached - https://phabricator.wikimedia.org/T109226#2741647 (10elukey) 05Open>03Resolved From an IRC conversation with BBlack we decided to close this t... [13:40:34] (03CR) 10Krinkle: [C: 04-1] Phabricator: Add javascript to files.viewable-mime-types [puppet] - 10https://gerrit.wikimedia.org/r/317500 (owner: 10Paladox) [13:41:41] (03PS6) 10Paladox: Phabricator: Add javascript to files.viewable-mime-types [puppet] - 10https://gerrit.wikimedia.org/r/317500 [13:41:44] (03PS7) 10Paladox: Phabricator: Add javascript to files.viewable-mime-types [puppet] - 10https://gerrit.wikimedia.org/r/317500 [13:42:27] (03CR) 10Paladox: "> text/x-javascript doesn't exist and is most likely a typo from" [puppet] - 10https://gerrit.wikimedia.org/r/317500 (owner: 10Paladox) [13:47:32] (03PS1) 10Ori.livneh: Drop support for the legacy configuration format [debs/pybal] - 10https://gerrit.wikimedia.org/r/317823 [13:47:37] ^ _joe_ [13:49:02] (03CR) 10Mark Bergsma: [C: 031] Drop support for the legacy configuration format [debs/pybal] - 10https://gerrit.wikimedia.org/r/317823 (owner: 10Ori.livneh) [13:50:50] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [13:51:41] (03PS1) 10Cenarium: Create patroller usergroup for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317824 (https://phabricator.wikimedia.org/T149019) [13:51:45] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [13:52:08] <_joe_> misc? [13:52:12] <_joe_> lemme see [13:54:43] <_joe_> there are a few timeouts on rcstream I'd say [13:55:00] (03PS1) 10R4q3NWnUx2CEhVyr: Check the Request Authorization Header for '%u' [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/317825 [13:55:33] (03PS1) 10Giuseppe Lavagetto: kubernetes: add monitoring cluster, debdeploy config [puppet] - 10https://gerrit.wikimedia.org/r/317826 [13:57:29] _joe_: the rcstream servers have been rebooted, should settle on it's own? [13:57:37] <_joe_> moritzm: I guess, yes [13:58:09] (03Abandoned) 10Elukey: Allow the wmde LDAP group to access pivot.w.o [puppet] - 10https://gerrit.wikimedia.org/r/316217 (owner: 10Elukey) [14:01:55] <_joe_> !log refreshing puppet facts [14:02:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:03:32] (03PS3) 10Elukey: logstash: Stop dropping mod_proxy_fcgi warnings [puppet] - 10https://gerrit.wikimedia.org/r/306943 (https://phabricator.wikimedia.org/T73487) (owner: 10BryanDavis) [14:03:55] gehel: would you have 10 minutes to deploy --^ with me? [14:04:01] tested in deployment-prep and it works fine [14:04:13] but better be safe than sorry :) [14:05:00] 06Operations: investigate shared inbox options - https://phabricator.wikimedia.org/T146746#2741695 (10Dzahn) @mark Yep, sounds good. I'll get into that. [14:05:54] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:06:54] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:07:01] (03CR) 10Giuseppe Lavagetto: [C: 032] kubernetes: add monitoring cluster, debdeploy config [puppet] - 10https://gerrit.wikimedia.org/r/317826 (owner: 10Giuseppe Lavagetto) [14:07:37] elukey: sure, I'm available right now... [14:07:42] \o/ [14:07:48] all right, merging [14:08:03] (03CR) 10Elukey: [C: 032] logstash: Stop dropping mod_proxy_fcgi warnings [puppet] - 10https://gerrit.wikimedia.org/r/306943 (https://phabricator.wikimedia.org/T73487) (owner: 10BryanDavis) [14:08:09] (03PS4) 10Elukey: logstash: Stop dropping mod_proxy_fcgi warnings [puppet] - 10https://gerrit.wikimedia.org/r/306943 (https://phabricator.wikimedia.org/T73487) (owner: 10BryanDavis) [14:10:25] https://grafana.wikimedia.org/dashboard/db/rcstream moritzm _joe_ [14:10:55] It's been more stable as of ~ 8PM yesterday [14:11:05] And dropped as of ~ 30min ago [14:11:20] gehel: puppet run on logstash1001 was fine [14:11:26] I mean 8PM on the 23rd, not yesterday [14:11:27] logstash reloaded as expected [14:11:38] elukey: the logs look clean [14:11:48] Krinkle: the rcstream servers were rebooted for a kernel security update, I guess clients will reconnect soonish [14:12:00] see https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:32] OK. [14:12:37] What happened on the 23rd though [14:13:15] not seeing anything in SAL for that [14:13:20] gehel: 1002 done, all seems good [14:14:20] Krinkle: hmm, not sure [14:14:55] !log removed logstash filter for Apache (https://logstash.wikimedia.org/app/kibana#/dashboard/apache2log) - T144005 [14:14:56] T144005: Remove apache error log blacklist in Logstash's config - https://phabricator.wikimedia.org/T144005 [14:15:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:15:49] logstash1003 done as well [14:16:59] elukey: still nothing exploding... [14:17:20] 06Operations, 06Operations-Software-Development, 07HHVM, 13Patch-For-Review: Upgrade all mw* servers to debian jessie - https://phabricator.wikimedia.org/T143536#2741712 (10elukey) [14:17:22] 06Operations, 10Wikimedia-Apache-configuration: Remove apache error log blacklist in Logstash's config - https://phabricator.wikimedia.org/T144005#2741709 (10elukey) 05Open>03Resolved a:03elukey Will keep an eye on the logstash dashboard but everything looks good. [14:17:30] thanks a lot gehel! [14:17:46] 06Operations, 10Diffusion: svn.wikimedia.org redirects to Diffusion main page, hence hard to find e.g. "flexbisonparse" - https://phabricator.wikimedia.org/T140594#2741713 (10Aklapper) [14:18:14] * gehel did not do anything ... [14:18:41] bd808: logstash filter for apache removed (finally) :) [14:19:25] 06Operations, 10Prod-Kubernetes, 10vm-requests, 05Kubernetes-production-experiment, 15User-Joe: Ganeti VM for docker registry - https://phabricator.wikimedia.org/T148961#2741729 (10Joe) p:05Triage>03High a:03Joe [14:20:21] !log reboot of wdqs cluster codfw for kernel upgrade [14:20:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:23:53] 06Operations, 10Prod-Kubernetes, 10vm-requests, 05Kubernetes-production-experiment, 15User-Joe: Ganeti VM for docker registry - https://phabricator.wikimedia.org/T148961#2741753 (10Joe) @akosiaris I would avoid using a k8s specific name, I'd use an element name as this is the typical support VM doing not... [14:24:08] !log rebooting maerlant for kernel update [14:24:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:24:23] 06Operations, 10Prod-Kubernetes, 10vm-requests, 05Kubernetes-production-experiment, 15User-Joe: Ganeti VM for docker registry - https://phabricator.wikimedia.org/T148961#2741755 (10Joe) a:05Joe>03None [14:25:43] * elukey learns new hostnames from moritzm daily [14:26:23] esams, DNS, right? [14:27:24] yes, DNS recursor [14:27:31] and ntp [14:27:38] at least from site.pp [14:29:03] (03Restored) 10Giuseppe Lavagetto: mariadb: split role classes into separate files [puppet] - 10https://gerrit.wikimedia.org/r/315343 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [14:32:01] 06Operations, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 / 21/10/2016 - https://phabricator.wikimedia.org/T148478#2741782 (10ArielGlenn) The %p option exists on cobalt's openjdk install: ariel@cobalt:/usr/lib/jvm/java-1.7.0-openj... [14:32:25] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#2741783 (10Papaul) p:05Triage>03Normal a:03Papaul [14:32:26] !log reboot of wdqs cluster eqiad for kernel upgrade [14:32:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:42:20] (03CR) 10Giuseppe Lavagetto: "@Dzhan what Jaime wanted to say is that, as long as everything in" [puppet] - 10https://gerrit.wikimedia.org/r/315343 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [14:45:38] (03PS4) 10Alexandros Kosiaris: icinga: Specify mode for nagios_host, nagios_service [puppet] - 10https://gerrit.wikimedia.org/r/317791 [14:45:40] (03PS3) 10Alexandros Kosiaris: switch neon and einsteinium as primary/backup icinga hosts [puppet] - 10https://gerrit.wikimedia.org/r/317817 [14:45:42] (03PS18) 10Alexandros Kosiaris: Replace neon with einsteinium where applicable [puppet] - 10https://gerrit.wikimedia.org/r/315257 [14:45:44] (03PS5) 10Alexandros Kosiaris: icinga: Increase max_concurrent_checks [puppet] - 10https://gerrit.wikimedia.org/r/317763 [14:45:46] (03PS1) 10Alexandros Kosiaris: icinga: Add einstenium/tegmen in some openstack places [puppet] - 10https://gerrit.wikimedia.org/r/317828 [14:50:00] 06Operations, 10ops-eqiad, 13Patch-For-Review: Rack/Setup new memcache servers mc1019-36 - https://phabricator.wikimedia.org/T137345#2741818 (10elukey) [14:53:51] 06Operations, 10ops-eqiad, 13Patch-For-Review: Rack/Setup new memcache servers mc1019-36 - https://phabricator.wikimedia.org/T137345#2741829 (10elukey) Hosts reimaged: ``` elukey@neodymium:~$ sudo -i salt -E 'mc10(19|2[0-9]|3[0-6]).eqiad.wmnet' cmd.run 'uname -a' mc1034.... [14:54:50] (03CR) 10Alexandros Kosiaris: "ugly as hell, but PCC is happy at https://puppet-compiler.wmflabs.org/4477/" [puppet] - 10https://gerrit.wikimedia.org/r/317828 (owner: 10Alexandros Kosiaris) [14:54:54] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] icinga: Add einstenium/tegmen in some openstack places [puppet] - 10https://gerrit.wikimedia.org/r/317828 (owner: 10Alexandros Kosiaris) [14:54:58] (03PS2) 10Alexandros Kosiaris: icinga: Add einstenium/tegmen in some openstack places [puppet] - 10https://gerrit.wikimedia.org/r/317828 [14:55:00] (03CR) 10Alexandros Kosiaris: [V: 032] icinga: Add einstenium/tegmen in some openstack places [puppet] - 10https://gerrit.wikimedia.org/r/317828 (owner: 10Alexandros Kosiaris) [14:56:21] 06Operations, 10ops-codfw, 06DC-Ops: wtp2019 issues an uncorrectable memory error - https://phabricator.wikimedia.org/T148710#2741831 (10Papaul) The log shows that the system has a bad memory on DIMM A1. I will call Dell to request a replacement. {F4659052} [14:57:17] (03CR) 10Chad: "Helper class should probably go in core then :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306259 (https://phabricator.wikimedia.org/T142880) (owner: 1020after4) [14:58:58] (03PS1) 10Ori.livneh: coal: die if no events in N seconds [puppet] - 10https://gerrit.wikimedia.org/r/317831 [14:59:00] (03PS1) 10Ori.livneh: coal: using stdlib's logging [puppet] - 10https://gerrit.wikimedia.org/r/317832 [15:01:54] (03PS4) 10Alexandros Kosiaris: switch neon and einsteinium as primary/backup icinga hosts [puppet] - 10https://gerrit.wikimedia.org/r/317817 [15:02:10] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] switch neon and einsteinium as primary/backup icinga hosts [puppet] - 10https://gerrit.wikimedia.org/r/317817 (owner: 10Alexandros Kosiaris) [15:06:30] hi all - Dereckson and arseny92 helped me yesterday to submit a request to have a cap on the number of new accounts from our IP address lifted (T149063). I just got to the event location, however, and the IP address I was given last night was incorrect. The event starts in about an hour - can anything be done at this point? [15:06:30] T149063: Vanderbilt 2016-10-25 edit-a-thon - https://phabricator.wikimedia.org/T149063 [15:07:13] vai2fc_: yes. what is the correct IP? [15:07:50] ori: IPv4 is 129.59.122.18 [15:08:17] vai2fc_: OK -- I'll submit and deploy and fix-up. Can you document this in the task, for record-keeping? [15:08:27] absolutely, ori. Thank you so much! [15:08:31] np [15:12:26] welcome logmsgbot and icinga-wm :-) [15:12:34] 06Operations, 10ops-codfw, 06DC-Ops: mw2098 unreachable by mgmt - https://phabricator.wikimedia.org/T148719#2741845 (10Papaul) 05Open>03Resolved Perform an AC power cycle. The sever is back up on-line. [15:13:09] <_joe_> akosiaris: eheh [15:13:26] (03PS1) 10Giuseppe Lavagetto: puppet: kill manifests/roles, move under $modulepath [puppet] - 10https://gerrit.wikimedia.org/r/317837 (https://phabricator.wikimedia.org/T119042) [15:14:00] (03CR) 10Alexandros Kosiaris: [C: 032] icinga/tendril: Use einsteinium [dns] - 10https://gerrit.wikimedia.org/r/317787 (owner: 10Alexandros Kosiaris) [15:14:28] elukey: awesome. Thanks for working through all of that apache craziness. :) [15:14:35] (03CR) 10jenkins-bot: [V: 04-1] puppet: kill manifests/roles, move under $modulepath [puppet] - 10https://gerrit.wikimedia.org/r/317837 (https://phabricator.wikimedia.org/T119042) (owner: 10Giuseppe Lavagetto) [15:14:44] (03CR) 10Ori.livneh: [C: 032] coal: die if no events in N seconds [puppet] - 10https://gerrit.wikimedia.org/r/317831 (owner: 10Ori.livneh) [15:14:59] (03PS1) 10Ori.livneh: Use correct IP for Vanderbilt 2016-10-25 edit-a-thon throttle exception [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317838 (https://phabricator.wikimedia.org/T149063) [15:15:10] (03PS2) 10Ori.livneh: coal: die if no events in N seconds [puppet] - 10https://gerrit.wikimedia.org/r/317831 [15:15:15] <_joe_> WAT? [15:15:16] (03CR) 10Ori.livneh: [V: 032] coal: die if no events in N seconds [puppet] - 10https://gerrit.wikimedia.org/r/317831 (owner: 10Ori.livneh) [15:15:37] <_joe_> we have a fucking test that there are no files in modules/manifests/roles? [15:15:41] <_joe_> wtf? [15:15:48] (03PS2) 10Ori.livneh: coal: using stdlib's logging [puppet] - 10https://gerrit.wikimedia.org/r/317832 [15:15:55] (03CR) 10Ori.livneh: [C: 032 V: 032] coal: using stdlib's logging [puppet] - 10https://gerrit.wikimedia.org/r/317832 (owner: 10Ori.livneh) [15:17:17] (03CR) 10Ori.livneh: [C: 032] Use correct IP for Vanderbilt 2016-10-25 edit-a-thon throttle exception [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317838 (https://phabricator.wikimedia.org/T149063) (owner: 10Ori.livneh) [15:17:45] (03Merged) 10jenkins-bot: Use correct IP for Vanderbilt 2016-10-25 edit-a-thon throttle exception [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317838 (https://phabricator.wikimedia.org/T149063) (owner: 10Ori.livneh) [15:20:33] akosiaris: I think scap's configuration needs to be updated [15:20:58] I got this when deploying just now: https://dpaste.de/SeqB/raw [15:21:10] ori: I hope that does not mean every single scap.cfg... [15:21:12] the timeout is probably from trying to !log via the old IP [15:21:22] I do have a change ready to merge though [15:21:27] dunno, I'm not sure how it is configured these days [15:21:35] !log Synchronized wmf-config/throttle.php: I049bd463: Use correct IP for Vanderbilt 2016-10-25 edit-a-thon throttle exception (T149063) (duration: 01m 20s) [15:21:35] T149063: Vanderbilt 2016-10-25 edit-a-thon - https://phabricator.wikimedia.org/T149063 [15:21:38] https://gerrit.wikimedia.org/r/#/c/315257/ [15:21:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:21:48] updates dolomsg [15:21:51] merging now [15:21:52] vai2fc_: done [15:21:55] cool, thanks [15:22:16] (03PS19) 10Alexandros Kosiaris: Replace neon with einsteinium where applicable [puppet] - 10https://gerrit.wikimedia.org/r/315257 [15:22:19] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Replace neon with einsteinium where applicable [puppet] - 10https://gerrit.wikimedia.org/r/315257 (owner: 10Alexandros Kosiaris) [15:22:40] _joe_: there was (is?) a bug that was triggered repeatedly by role::foo and clashes with the manifests/role/*.pp files [15:23:05] role::foo under modules/role and manifests/role [15:23:08] clash of namespaces [15:23:14] after the last round of it blowing stuff up hashar added a test to barf when a new one was added [15:23:58] blarf ? :D [15:24:24] thanks, ori! Really appreciate the last-minute assist. [15:25:13] "Barf may refer to: Vomit (medically, emesis)" -- https://en.wikipedia.org/wiki/Barf [15:25:34] legoktm has added in puppet.git modules/role/tests/role_test.py which should assert that no .pp be added directly in modules/role/manifests/ [15:26:01] ok neon is not referenced anywhere under /srv/deployment on mira/tin :-) [15:26:05] eg modules/role/manifests/foobar.py is invalid [15:26:40] but one might well have manifests/role/foobar.py and modules/roles/manifests/foobar/init.pp both defining "role::foobar" as I understand puppet [15:30:10] <_joe_> you understand incorrectly [15:30:29] <_joe_> and well, part of my patch should just remove that check :) [15:31:50] (03PS1) 10Thiemo Mättig (WMDE): Enable Wikibase #statements parser function on all test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317840 (https://phabricator.wikimedia.org/T142940) [15:32:23] (03PS2) 10Giuseppe Lavagetto: puppet: kill manifests/roles, move under $modulepath [puppet] - 10https://gerrit.wikimedia.org/r/317837 (https://phabricator.wikimedia.org/T119042) [15:32:39] <_joe_> hashar: this ^^ should fix things once and for all [15:33:36] (03CR) 10jenkins-bot: [V: 04-1] puppet: kill manifests/roles, move under $modulepath [puppet] - 10https://gerrit.wikimedia.org/r/317837 (https://phabricator.wikimedia.org/T119042) (owner: 10Giuseppe Lavagetto) [15:34:25] (03CR) 10Jcrespo: "Giuseppe, what I meant is that of course I know this will be eventually done but:" [puppet] - 10https://gerrit.wikimedia.org/r/315343 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [15:34:48] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#2741878 (10Papaul) 05Open>03Resolved System is back up on-line. [15:35:01] Hallo. Is there any problem with Gerrit? I am trying to `git review -d I1e16f20ca79436afe17eba711981b2ae43fed9e1`, and I get an error. I tried several patches. [15:36:04] aharoni: And the error is..? :P [15:36:19] aharoni: I didn't realise it worked with the changeid [15:36:26] I use teh numeric number from the browser [15:36:37] _joe_: neat :] [15:37:07] https://etherpad.wikimedia.org/p/0WdKUDRA58 [15:38:15] Reedy: ^ [15:38:37] aharoni , git-review -d 110342 [15:38:43] git review -d 110342 [15:38:44] yeah [15:39:22] (03CR) 10Chad: "Eh, 1 or 2 files already did absolute, so I was going for consistency and it was what I happened to copy+paste. No strong preference tho, " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309363 (owner: 10Chad) [15:39:25] (03CR) 10Hoo man: "Hm… shall we maybe start with beta only (which we could do any time)?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317840 (https://phabricator.wikimedia.org/T142940) (owner: 10Thiemo Mättig (WMDE)) [15:39:42] (03CR) 10Giuseppe Lavagetto: "@Jaime I don't think I said I expect you to do any work on this with any timeline, did I? :)" [puppet] - 10https://gerrit.wikimedia.org/r/315343 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [15:40:29] aharoni: git remote rm gerrit ; git remote set-url --push origin ssh://gerrit.wikimedia.org:29418/mediawiki/core [15:40:48] aharoni: phuedx|afk had the same issue earlier today with git-review. Maybe that is due to a git-review upgrade [15:43:27] (03CR) 10BryanDavis: [C: 04-1] "Hiding this file will make vendor broken for anyone using PHP 5.6+. See T135161." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317749 (owner: 10Chad) [15:44:23] (03PS5) 10Alexandros Kosiaris: icinga: Specify mode for nagios_host, nagios_service [puppet] - 10https://gerrit.wikimedia.org/r/317791 [15:44:27] (03PS6) 10Alexandros Kosiaris: icinga: Increase max_concurrent_checks [puppet] - 10https://gerrit.wikimedia.org/r/317763 [15:44:29] (03PS1) 10Alexandros Kosiaris: icinga: Update the icons [puppet] - 10https://gerrit.wikimedia.org/r/317843 [15:47:13] Reedy, arseny92 - it doesn't help [15:47:18] I'll try hashar's suggestion [15:47:38] hashar , Reedy , I also prefer to use the numeric changeid alias, its also easier to identify . Also all docs say to use the num id anyway [15:47:47] aharoni: feel free to paste the output of: git-review --version; git remote -v; git-review --verbose -d 110342 [15:48:20] the heuristic to detect the Gerrit URL for the REST API is wrong somehow [15:48:27] it lacks the /r/ prefix in the path [15:50:07] (03Abandoned) 10Chad: Add annoying thing we don't check in [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317749 (owner: 10Chad) [15:50:17] for the second command i'd prefer git remote show origin  though as it will also show the branches you track [15:50:17] hashar: this happens SO OFTEN and if you manage to resolve it, it'll be amazing. we have some hacks for this bug in our apache configs. [15:50:19] # git-review for some reason sometimes uses [15:50:20] # instead of , except when somebody is [15:50:20] # trying to reproduce this behavior. But people run into this all the time. [15:50:20] RewriteRule ^/tools/hooks/commit-msg$ https://<%= @host %>/r/tools/hooks/commit-msg [15:50:20] hashar: git-review version 1.25.0 [15:50:55] aharoni: same that phuedx is using [15:51:35] i also have that version yet downloading by num id works [15:52:40] hashar: https://etherpad.wikimedia.org/p/0WdKUDRA58 [15:52:45] (03PS1) 10Chad: Revert "Gerrit: Have log4j re-read its configuration every 60 seconds" [puppet] - 10https://gerrit.wikimedia.org/r/317845 [15:53:02] (03CR) 10jenkins-bot: [V: 04-1] Revert "Gerrit: Have log4j re-read its configuration every 60 seconds" [puppet] - 10https://gerrit.wikimedia.org/r/317845 (owner: 10Chad) [15:55:27] (03PS2) 10Chad: Revert "Gerrit: Have log4j re-read its configuration every 60 seconds" [puppet] - 10https://gerrit.wikimedia.org/r/317845 [15:56:06] (03PS2) 10Alexandros Kosiaris: icinga: Update the icons [puppet] - 10https://gerrit.wikimedia.org/r/317843 [15:56:10] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] icinga: Update the icons [puppet] - 10https://gerrit.wikimedia.org/r/317843 (owner: 10Alexandros Kosiaris) [15:56:24] MatmaRex: the bug is in get_remote_url() method at https://github.com/openstack-infra/git-review/blob/1.25.0/git_review/cmd.py#L467-L487 and the method just above alias_url() [15:56:43] 06Operations, 10ops-codfw, 06DC-Ops: db2019 has a failed disk - https://phabricator.wikimedia.org/T120073#2741951 (10jcrespo) 05Open>03Resolved [15:57:59] then git-review 1.25.0 is quite old but no new release has been tagged yet :( [15:58:46] (03PS3) 10ArielGlenn: Revert "Gerrit: Have log4j re-read its configuration every 60 seconds" [puppet] - 10https://gerrit.wikimedia.org/r/317845 (owner: 10Chad) [16:00:04] godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161025T1600). Please do the needful. [16:00:04] Pchelolo: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:00:19] (03CR) 10ArielGlenn: [C: 032] Revert "Gerrit: Have log4j re-read its configuration every 60 seconds" [puppet] - 10https://gerrit.wikimedia.org/r/317845 (owner: 10Chad) [16:00:19] <_joe_> I'm swatting :) [16:00:41] hashar: mmm... so, any other ideas? [16:00:48] Don't use git-review :p [16:00:51] That's my idea! [16:00:54] aharoni: feel free to paste the output of: git-review --version; git remote -v; git-review --verbose -d 110342 [16:00:58] (it's always my idea lol) [16:01:17] hashar: already did at https://etherpad.wikimedia.org/p/0WdKUDRA58 [16:01:21] <_joe_> Pchelolo: what's your patch up for puppetSWAT? [16:01:30] <_joe_> the linked change is wrong [16:02:43] aharoni: well seems you cant ssh to gerrit.wikimedia.org ? :D [16:02:53] aharoni: try: ssh -p 29418 gerrit.wikimedia.org [16:03:06] aharoni: I guess your local username is different from the Gerrit username. [16:03:30] (03PS1) 10Chad: Removing support for DOLOGMSGNOLOG [puppet] - 10https://gerrit.wikimedia.org/r/317848 [16:04:13] !log delete dangling indices on elasticsearch codfw: jawiki_general_first, jawiki_content_first, zhwiki_general_first and zhwiki_content_first [16:04:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:04:46] (03PS6) 10Giuseppe Lavagetto: service::node - support sampled logging [puppet] - 10https://gerrit.wikimedia.org/r/302309 (https://phabricator.wikimedia.org/T139674) (owner: 10Ppchelko) [16:05:12] aharoni: so either add an entry in ~/.ssh/config to have Host gerrit.wikimedia ... User [16:05:25] hashar: that's true, now how do I fix it? it started happening after `git remote rm gerrit ; git remote set-url --push origin ssh://gerrit.wikimedia.org:29418/mediawiki/core` [16:05:34] aharoni: or add your username in the git remote url: git remote set-url --push origin ssh://@gerrit.wikimedia.org:29418/mediawiki/core [16:05:44] it tries to ssh [16:05:49] and that gives you a permission denied [16:06:20] you would get the same error doing: ssh -p 29418 gerrit.wikimedia.org gerrit [16:07:16] (03PS1) 10Chad: Gerrit: Let log4j.properties be read by anyone [puppet] - 10https://gerrit.wikimedia.org/r/317850 [16:08:18] (03CR) 10Thiemo Mättig (WMDE): "I'm sorry? What's the argument to not let people test it?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317840 (https://phabricator.wikimedia.org/T142940) (owner: 10Thiemo Mättig (WMDE)) [16:08:22] !log restarting ferm on elastic2020 [16:08:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:08:57] (03CR) 10ArielGlenn: [C: 032] Gerrit: Let log4j.properties be read by anyone [puppet] - 10https://gerrit.wikimedia.org/r/317850 (owner: 10Chad) [16:08:58] hashar: I understand that part more or less, but I'm mostly an idiot when it comes to ssh. Is there a way to fix that? [16:09:19] 06Operations: Firewall sets not being loaded post-reboot due to a @resolve race - https://phabricator.wikimedia.org/T148986#2738609 (10Joe) FTR, we already have a parser function called ipresolve() that mostly does what we need here. [16:09:50] aharoni: have you tried : ssh -p 29418 gerrit.wikimedia.org gerrit [16:10:00] that is what git-review / git uses under the hood [16:10:11] which attempt to connect with your local username by default [16:10:11] hashar: You can also set the Port in your ssh config too, then it's just `ssh gerrit.wikimedia.org` :D [16:10:20] hashar: "Permission denied (publickey)." [16:10:23] which might not match the account you need to connect to gerrit [16:10:35] aharoni: Sounds like a username/key mismatch then! [16:10:36] so from there repeeating myself: [16:10:41] so either add an entry in ~/.ssh/config to have Host gerrit.wikimedia ... User [16:10:46] OR add your username in the git remote url: git remote set-url --push origin ssh://@gerrit.wikimedia.org:29418/mediawiki/core [16:12:05] hashar: merci, the ~/.ssh/config thing worked [16:12:39] magic! [16:12:48] then it is fixed for all your repo automagically [16:13:02] not sure why git-review suddenly stopped working though [16:13:18] 06Operations, 10Prod-Kubernetes, 05Kubernetes-production-experiment, 15User-Joe: Experiment with Swift as docker registry backend - https://phabricator.wikimedia.org/T149098#2741998 (10fgiunchedi) [16:13:33] _joe_: Lemme know when you're done with puppetswat. I need a quick gerrit reboot to pick up a config change. [16:15:00] I am off bbl [16:16:17] akosiaris: re: poolcounter, I can simply remove potassium from mediawiki-config while reimaging it as poolcounter1002 and add it back when done? [16:17:01] godog: yes [16:17:31] neat, thanks! [16:18:02] remember though Murphy's law, so be fast :-) [16:18:07] (03PS3) 10Giuseppe Lavagetto: puppet: kill manifests/roles, move under $modulepath [puppet] - 10https://gerrit.wikimedia.org/r/317837 (https://phabricator.wikimedia.org/T119042) [16:19:32] (03PS4) 10Giuseppe Lavagetto: puppet: kill manifests/roles, move under $modulepath [puppet] - 10https://gerrit.wikimedia.org/r/317837 (https://phabricator.wikimedia.org/T119042) [16:23:54] akosiaris: hehe mhhh actually that's a good point, I can do a switcheroo and put helium back in while reimage is in progress [16:24:04] (03CR) 10Thcipriani: "Looks pretty neat! Lots of comments inline." (0312 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317757 (owner: 10Chad) [16:24:29] (03PS7) 10Chad: Gerrit: Enable logging for jvm gc [puppet] - 10https://gerrit.wikimedia.org/r/317582 (https://phabricator.wikimedia.org/T148478) (owner: 10Paladox) [16:25:45] (03CR) 10ArielGlenn: [C: 032] Gerrit: Enable logging for jvm gc [puppet] - 10https://gerrit.wikimedia.org/r/317582 (https://phabricator.wikimedia.org/T148478) (owner: 10Paladox) [16:26:46] (03CR) 10Hoo man: "People can still test it in these cases (on beta)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317840 (https://phabricator.wikimedia.org/T142940) (owner: 10Thiemo Mättig (WMDE)) [16:27:58] godog: sounds like a good idea [16:31:42] (03PS1) 10Alexandros Kosiaris: neon: Warn against neon deprecation [puppet] - 10https://gerrit.wikimedia.org/r/317852 [16:32:30] (03PS2) 10Alexandros Kosiaris: neon: Warn against neon deprecation [puppet] - 10https://gerrit.wikimedia.org/r/317852 [16:32:34] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] neon: Warn against neon deprecation [puppet] - 10https://gerrit.wikimedia.org/r/317852 (owner: 10Alexandros Kosiaris) [16:35:03] 06Operations, 10ops-codfw, 06DC-Ops: wtp2019 issues an uncorrectable memory error - https://phabricator.wikimedia.org/T148710#2742050 (10Papaul) @akosiaris I will need a maintenance window tomorrow between 10:30 am and 11am to replace the memory on the system. Dell Customer Communication Papaul,... [16:35:16] akosiaris: https://phabricator.wikimedia.org/T148710 [16:36:24] papaul: ok cool [16:36:57] akosiaris: thanks [16:39:55] Anyone know how Varnish cached responses are spread across DCs and servers? I imagine that, just like memcache (and everything else) there's redundancy and segmentation or something... Maybe it's just by Varnish server? [16:40:17] bblack: ^ ? ...thx in advance!! [16:41:28] AndyRussG, it would help knowing what you want to do/problem you have [16:41:35] that is a very long question [16:44:12] AndyRussG: https://www.infoq.com/fr/presentations/varnishcon-emanuele-rocca-scaling-wikipedia [16:44:39] that's ema's talk from last june [16:47:42] 06Operations, 10ops-codfw, 10DBA: db2011 disk media errors - https://phabricator.wikimedia.org/T149099#2742068 (10jcrespo) [16:48:08] (03CR) 10Chad: WIP: Rewrite checkoutMediaWiki as scap3 plugin (0311 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317757 (owner: 10Chad) [16:48:26] (03PS2) 10Chad: WIP: Rewrite checkoutMediaWiki as scap3 plugin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317757 [16:50:32] (03CR) 10Jcrespo: [C: 031] "+1 for the mariadb part" [puppet] - 10https://gerrit.wikimedia.org/r/317837 (https://phabricator.wikimedia.org/T119042) (owner: 10Giuseppe Lavagetto) [16:51:50] (03CR) 10Chad: WIP: Rewrite checkoutMediaWiki as scap3 plugin (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317757 (owner: 10Chad) [16:52:43] (03CR) 10Lydia Pintscher: "The question is: do we want people to test it already or is it still changing too much?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317840 (https://phabricator.wikimedia.org/T142940) (owner: 10Thiemo Mättig (WMDE)) [16:56:34] 06Operations, 10ops-codfw, 10DBA: db2011 disk media errors - https://phabricator.wikimedia.org/T149099#2742092 (10jcrespo) [16:58:12] (03CR) 10Hoo man: "Missing changes (only including things we probably want to do short term): Commons media output (as image), changes to date output (don't " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317840 (https://phabricator.wikimedia.org/T142940) (owner: 10Thiemo Mättig (WMDE)) [16:59:02] (03CR) 10Chad: WIP: Rewrite checkoutMediaWiki as scap3 plugin (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317757 (owner: 10Chad) [16:59:27] jynus: ori cool, thx! Basically, I was just asking myself why a URL that should be cached for 10 min is hitting PHP 10 times in the space of 8 minutes. So, just wanting to understand how a buggy response was working cachewise... [16:59:33] (03CR) 10Lydia Pintscher: "That seems like quite a bit still." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317840 (https://phabricator.wikimedia.org/T142940) (owner: 10Thiemo Mättig (WMDE)) [16:59:44] 06Operations, 10ops-codfw, 10DBA: db2011 disk media errors - https://phabricator.wikimedia.org/T149099#2742093 (10jcrespo) @Papaul From the output I would replace disks #4, #7 and #11, which should be the ones with the light on. Disk #1 has some media errors, but I suppose we can live with it for now. [17:00:04] yurik, gwicke, cscott, arlolra, subbu, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161025T1700). [17:00:08] (for details of the php logs, search logstash for "message key centralnotice" (in quotes) over the past few days) [17:02:45] that can be many things- but your suspicions are right, there are multiple varnish backends on separate datacenters, and while there is stickyness, things only get more complex from there- there are rules of what can and cannot be cached, etc. [17:02:50] (03PS2) 10Dzahn: Exclude entries with oldValue=null in "Project changes" results [puppet] - 10https://gerrit.wikimedia.org/r/317316 (owner: 10Aklapper) [17:03:35] (03PS3) 10Aklapper: phabricator: exclude entries with oldValue=null in "Project changes" results [puppet] - 10https://gerrit.wikimedia.org/r/317316 [17:04:16] (03CR) 10Dzahn: [C: 032] phabricator: exclude entries with oldValue=null in "Project changes" results [puppet] - 10https://gerrit.wikimedia.org/r/317316 (owner: 10Aklapper) [17:04:36] (03PS1) 10Filippo Giunchedi: Put helium back in service during potassium reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317853 (https://phabricator.wikimedia.org/T123734) [17:06:08] <_joe_> mutante: https://gerrit.wikimedia.org/r/#/c/317837/ [17:06:26] <_joe_> I think I'll merge it in a few minutes, it is a noop according to the compiler [17:07:32] 06Operations, 06Performance-Team, 10Wikimedia-Apache-configuration, 07HHVM, and 2 others: Fix Apache proxy_fcgi error "Invalid argument: AH01075: Error dispatching request to" (Causing HTTP 503) - https://phabricator.wikimedia.org/T73487#2742105 (10hashar) Based on https://logstash.wikimedia.org/app/kibana... [17:07:33] _joe_: it's very cool if that works now [17:07:54] _joe_: but does it, in manifests/ directly? [17:08:05] <_joe_> mutante: it does, according to the compiler [17:08:09] just because there wasn't a single one like that before [17:08:23] well.. cool then :) [17:08:34] <_joe_> mutante: as long as it's in semi-autoload layout, it works [17:08:35] at some point in the past i tried that [17:08:38] <_joe_> what a mess [17:08:52] <_joe_> mutante: everything should be moved at once for that to work ofc [17:08:59] <_joe_> you can't move just one file [17:09:18] jynus: hmmm K... Yes I have seen the beautiful vlc language... Yeah the URL I'm looking at does get cached... So more than one backend/cache per DC, I guess? [17:09:24] and you are removing the "import" in site.pp at the same time, i see [17:09:41] _joe_: ack! thanks for it !:) [17:09:59] <_joe_> mutante: thank you for all the work you've done for this :) [17:10:30] :) quite welcome [17:11:19] (03CR) 10Dzahn: [C: 031] "really nice, if the compiler says this is noop, yes please" [puppet] - 10https://gerrit.wikimedia.org/r/317837 (https://phabricator.wikimedia.org/T119042) (owner: 10Giuseppe Lavagetto) [17:17:12] (03CR) 10Filippo Giunchedi: [C: 032] Put helium back in service during potassium reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317853 (https://phabricator.wikimedia.org/T123734) (owner: 10Filippo Giunchedi) [17:17:58] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/4480/ says this is a noop" [puppet] - 10https://gerrit.wikimedia.org/r/317837 (https://phabricator.wikimedia.org/T119042) (owner: 10Giuseppe Lavagetto) [17:18:04] (03PS5) 10Giuseppe Lavagetto: puppet: kill manifests/roles, move under $modulepath [puppet] - 10https://gerrit.wikimedia.org/r/317837 (https://phabricator.wikimedia.org/T119042) [17:19:44] !log filippo@mira Synchronized wmf-config/ProductionServices.php: Put helium back in service during potassium reimage (duration: 01m 34s) [17:19:49] <_joe_> submitted ^^ let's hope it doesn't kill everything [17:19:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:20:08] <_joe_> s/everything/every puppet run/ [17:20:43] oh wow, the final big move. congrats! [17:21:11] <_joe_> congratulate once this didn't blow up [17:21:13] <_joe_> :P [17:21:28] heh [17:22:16] 06Operations, 06Discovery, 10Elasticsearch, 10hardware-requests, 06Discovery-Search (Current work): elasticsearch new servers (5x eqiad / 12x codfw) - https://phabricator.wikimedia.org/T149089#2742139 (10debt) p:05Triage>03High [17:22:30] <_joe_> for now, everything seems ok [17:22:35] \o/ [17:22:35] !log gerrit: quick reboot, picking up logging config changes for jvm [17:22:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:23:28] !log powercycling db2015, unresponsive [17:23:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:24:06] !log rebooting notebook1002 for kernel update [17:24:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:24:14] 06Operations, 07Puppet, 07Epic, 07Need-volunteer, 13Patch-For-Review: align puppet-lint config with coding style - https://phabricator.wikimedia.org/T93645#2742155 (10Joe) [17:24:16] 06Operations, 07Puppet, 13Patch-For-Review, 15User-Joe: Import vs autoload: the puppet parser is a bad joke that stopped being funny years ago. - https://phabricator.wikimedia.org/T119042#2742153 (10Joe) 05stalled>03Resolved a:03Joe [17:24:54] 06Operations, 06Discovery, 10Elasticsearch, 06Discovery-Search (Current work): elasticsearch new servers (5x eqiad / 12x codfw) - https://phabricator.wikimedia.org/T149089#2742159 (10Gehel) We're still sorting out budget on this request and the slightly related T148747. [17:25:08] apergos: We're back, and we've got jvm logs [17:25:18] jvm_gc.pid143452.log.0.current being the first [17:25:25] yep I see em! [17:25:41] Other logs back to that directory too [17:25:47] yep [17:26:19] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Install and configure new WDQS nodes on codfw - https://phabricator.wikimedia.org/T144380#2742170 (10Gehel) [17:26:22] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: Configure varnish to include wdqs nodes in codfw - https://phabricator.wikimedia.org/T146158#2742168 (10Gehel) 05Open>03Resolved This has been resolved by implementing LVS as described on T132457. [17:26:34] 83gb free on /, oughta hold us for awhile [17:27:32] !log rebooting phab2001 for kernel update [17:27:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:28:08] apergos: And git data resides on a different partition so even if we eat things up, we won't lose Important Stuff :) [17:28:24] on /srv right? perfect [17:28:27] Yep [17:33:15] 06Operations, 10ops-codfw, 10hardware-requests: decomission db2015 - https://phabricator.wikimedia.org/T149102#2742190 (10jcrespo) [17:34:20] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch: Decrease time required to fully restart the Cirrus elasticsearch clusters - https://phabricator.wikimedia.org/T145065#2742204 (10debt) [17:34:47] (03PS1) 10Filippo Giunchedi: Rename potassium as poolcounter1002 [puppet] - 10https://gerrit.wikimedia.org/r/317855 (https://phabricator.wikimedia.org/T123734) [17:37:30] (03PS2) 10Filippo Giunchedi: Rename potassium as poolcounter1002 [dns] - 10https://gerrit.wikimedia.org/r/317854 (https://phabricator.wikimedia.org/T123734) [17:38:21] (03CR) 10Filippo Giunchedi: [C: 032] Rename potassium as poolcounter1002 [dns] - 10https://gerrit.wikimedia.org/r/317854 (https://phabricator.wikimedia.org/T123734) (owner: 10Filippo Giunchedi) [17:38:58] (03PS1) 10Ori.livneh: Fix-up for Iff996299 [puppet] - 10https://gerrit.wikimedia.org/r/317856 [17:39:10] (03CR) 10Ori.livneh: [C: 032 V: 032] Fix-up for Iff996299 [puppet] - 10https://gerrit.wikimedia.org/r/317856 (owner: 10Ori.livneh) [17:39:35] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch: Decrease time required to fully restart the Cirrus elasticsearch clusters - https://phabricator.wikimedia.org/T145065#2742217 (10Gehel) 05Open>03declined Improvements seems unlikely: * delayed allocation does not seem to work as expected.... [17:42:22] (03CR) 10Filippo Giunchedi: [C: 032] Rename potassium as poolcounter1002 [puppet] - 10https://gerrit.wikimedia.org/r/317855 (https://phabricator.wikimedia.org/T123734) (owner: 10Filippo Giunchedi) [17:42:29] (03PS2) 10Filippo Giunchedi: Rename potassium as poolcounter1002 [puppet] - 10https://gerrit.wikimedia.org/r/317855 (https://phabricator.wikimedia.org/T123734) [17:46:06] 06Operations, 06Discovery-Search (Current work), 07Wikimedia-Incident: Enable GC (garbage collection) logs on Elasticsearch JVM - https://phabricator.wikimedia.org/T134853#2742252 (10debt) a:03Gehel [17:55:54] (03CR) 10Alex Monk: [C: 04-1] "see my comment above" [puppet] - 10https://gerrit.wikimedia.org/r/317316 (owner: 10Aklapper) [17:55:58] (03PS4) 10Dzahn: phabricator: exclude entries with oldValue=null in "Project changes" results [puppet] - 10https://gerrit.wikimedia.org/r/317316 (owner: 10Aklapper) [17:56:05] (03PS1) 10Chad: group0 to wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317857 [17:56:46] (03CR) 10Chad: [C: 04-2] "I'll merge l8r :D" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317857 (owner: 10Chad) [17:56:59] Krenair: yes, we saw the comment but would like to address that in a separate change please [17:57:17] (03CR) 10Thcipriani: "Couple minor things and should be good to go" (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317757 (owner: 10Chad) [17:59:25] (03CR) 10Dzahn: "@Alex Monk, we saw that comment and briefly talked about it, it's not being ignored but we would like to do that in a follow-up please" [puppet] - 10https://gerrit.wikimedia.org/r/317316 (owner: 10Aklapper) [18:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161025T1800). Please do the needful. [18:01:15] (03PS3) 10Aklapper: phabricator: Also list name of acting user for project creations and name changes [puppet] - 10https://gerrit.wikimedia.org/r/317317 [18:01:18] (03PS4) 10Dzahn: phabricator: Also list name of acting user for project creations and name changes [puppet] - 10https://gerrit.wikimedia.org/r/317317 (owner: 10Aklapper) [18:03:30] !log demon@mira Started scap: Moving testwiki to wmf.23 for l10n bootstrap [18:03:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:05:36] Good thing there's no swat since I stole scap :p [18:12:24] (03PS3) 10Aklapper: phabricator: Also display column name when hiding/showing workboard columns [puppet] - 10https://gerrit.wikimedia.org/r/317323 [18:12:30] ostriches: actually i have a patch i just didn't manage to get it into the list :P i can ship it after your scap is done though [18:12:35] !log rebooting labvirt1001 for kernel update [18:14:32] (03PS4) 10Dzahn: phabricator: Also display column name when hiding/showing workboard columns [puppet] - 10https://gerrit.wikimedia.org/r/317323 (owner: 10Aklapper) [18:14:43] (03CR) 10Dzahn: [C: 032] "query tested" [puppet] - 10https://gerrit.wikimedia.org/r/317323 (owner: 10Aklapper) [18:15:05] !log restbase eqiad rolling reboot for kernel update [18:15:29] morebots :( [18:17:03] gone already? [18:17:17] yeah I think because of labvirt [18:19:24] meh [18:20:49] ebernhardson: On the upside it'll be faster since everything will be nice and up to date :) [18:23:56] (03PS3) 10Dzahn: Also list parent project for (sub)project creations and name changes [puppet] - 10https://gerrit.wikimedia.org/r/317321 (owner: 10Aklapper) [18:23:58] (03PS4) 10Aklapper: phabricator: Also list parent project for (sub)project creations and name changes [puppet] - 10https://gerrit.wikimedia.org/r/317321 [18:28:33] (03CR) 10Dzahn: [C: 032] "the query works, just that parentProject is NULL for the 5 rows here" [puppet] - 10https://gerrit.wikimedia.org/r/317321 (owner: 10Aklapper) [18:31:39] 06Operations, 10ops-codfw, 10hardware-requests: decomission db2015 - https://phabricator.wikimedia.org/T149102#2742406 (10RobH) a:03RobH So this host had its warranty expire on 2015-04-10. Normally I'd check with our DBAs if they need a replacement, but since @jcrespo filed the actual task, he is well awa... [18:32:37] 06Operations, 10ops-eqiad: Rename potassium / WMF3287 as poolcounter1002 - https://phabricator.wikimedia.org/T149106#2742411 (10fgiunchedi) [18:34:40] (03PS5) 10Dzahn: phabricator: Also display column name when hiding/showing workboard columns [puppet] - 10https://gerrit.wikimedia.org/r/317323 (owner: 10Aklapper) [18:36:23] (03PS3) 10Dzahn: Drop "Phabricator workboards with single column only" query [puppet] - 10https://gerrit.wikimedia.org/r/317318 (owner: 10Aklapper) [18:38:07] (03PS4) 10Aklapper: phabricator: Drop "Phabricator workboards with single column only" query [puppet] - 10https://gerrit.wikimedia.org/r/317318 [18:38:49] (03CR) 10Dzahn: [C: 032] phabricator: Drop "Phabricator workboards with single column only" query [puppet] - 10https://gerrit.wikimedia.org/r/317318 (owner: 10Aklapper) [18:43:45] !log 18<godog18> !log restbase eqiad rolling reboot for kernel update [18:43:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Alex [18:44:11] indeed, thanks [18:44:20] it didn't like the colour [18:44:21] 06Operations, 10ops-codfw, 10hardware-requests: decomission db2015 - https://phabricator.wikimedia.org/T149102#2742464 (10RobH) a:05RobH>03mark I've left him a PM about this. I'm going to assign it to him for his approval on decommission. Once he approves, please assign back to me, thanks! [18:44:23] oh well [18:44:55] !log restbase eqiad rolling reboot for kernel update [18:44:57] one more time [18:45:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:45:51] ah no it worked on wikitech, but not on https://tools.wmflabs.org/sal [18:46:15] !log demon@mira Finished scap: Moving testwiki to wmf.23 for l10n bootstrap (duration: 42m 45s) [18:46:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:46:51] ebernhardson: K, I'm done if you wanna sync your thing before I do group0 at noon [18:56:21] ostriches: ok doing it now [18:58:18] morebots, everything ok? [18:58:19] I am a logbot running on tools-exec-1410. [18:58:19] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [18:58:19] To log a message, type !log . [19:00:04] ostriches: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161025T1900). [19:03:52] !log ebernhardson@mira Synchronized php-1.28.0-wmf.22/extensions/WikimediaEvents/modules/ext.wikimediaEvents.searchSatisfaction.js: (no message) (duration: 00m 47s) [19:03:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:04:27] ebernhardson: That already made it into .23? [19:05:29] ostriches: right, and thats shipping it in .22 [19:05:45] Okie dokie [19:05:49] * ostriches blows whistle [19:05:51] all aboard! [19:10:03] (03CR) 10Chad: [C: 032] group0 to wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317857 (owner: 10Chad) [19:10:33] (03Merged) 10jenkins-bot: group0 to wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317857 (owner: 10Chad) [19:11:16] !log demon@mira rebuilt wikiversions.php and synchronized wikiversions files: group0 to wmf.23 [19:11:21] Phabricator will be down momentarily for an unscheduled maintenance (simple reboot, should be back in a flash) [19:11:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:11:40] !log rebooting iridium (phabricator) in ~ 3 minutes [19:11:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:19:19] (03CR) 10Aklapper: "Hmm; does not work as reliably in practice as it did when testing locally (first row of today's test email does not list "Wikispeech" as t" [puppet] - 10https://gerrit.wikimedia.org/r/317321 (owner: 10Aklapper) [19:19:31] !log twentyafterfour@iridium: The system is going down for reboot NOW! [19:19:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:22:15] !log phabricator is back from reboot and it appears that all is well [19:22:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:23:06] !log Python PyPi mirror has some issue. Impacts all CI jobs relying on tox https://status.python.org/ [19:23:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:31:45] (03CR) 10Yuvipanda: [C: 04-1] tools proxy: Add health check and icinga monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/314707 (https://phabricator.wikimedia.org/T143638) (owner: 10Madhuvishy) [19:32:21] 06Operations, 10Cassandra, 06Services (doing): some cassandra instances restarted with java.lang.OutOfMemoryError: Java heap space - https://phabricator.wikimedia.org/T148516#2742544 (10Eevans) Some more findings: Looking at the dominator tree, the 3.4G result is https://www.wikidata.org/wiki/Q27335792 [19:37:47] Is the phabricator search broken= [19:38:18] i doint think so [19:38:19] hoo: WFM [19:38:32] Weird [19:38:43] so we have no task that even mentions "id" [19:38:49] Too short. [19:39:15] Can't search for strings <= 3 chars [19:39:21] mysql search is dumb, we should use elastic. [19:39:39] moritzm: hi, got a minute? in https://phabricator.wikimedia.org/T140419#2698802 you said we're running HHVM 3.12.8 in production, but https://en.wikipedia.org/wiki/Special:Version says 3.12.7. what is the version we are actually running? [19:40:00] ostriches i tested elasticsearch with phabricator on phab-01 and it was very [19:40:01] good [19:40:10] ostriches: Ouch [19:40:22] paladox: Yeah I played with it a long time ago but never tried again [19:40:29] I wanna rig up something to let us test them side by side. [19:40:35] oh. [19:40:36] The results were...mixed...last time I tried [19:41:03] twentyafterfour me and ebernhardson fixed elasticsearch 2.x support [19:41:05] MatmaRex: our 3.12.7 package is identical to 3.12.8 (we got the security fixes a few days ahead, so I applied them before the 3.12.8 release was cut) [19:41:06] in phabricator [19:41:07] !log elastic@eqiad reindexing enwiki with BM25 from terbium T147508 (logs in ~dcausse/bm25_reindex/cirrus_log) [19:41:09] T147508: BM25: initial limited release into production - https://phabricator.wikimedia.org/T147508 [19:41:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:41:15] paladox: awesome [19:41:40] yep, i switched phab-01 over to mysql for that test [19:41:51] to improve searching in mysql [19:42:09] moritzm: my patch in that task is not a security patch, and it seems to be in 3.12.8 but not 3.12.8. can you confirm that it's actually deployed? [19:42:19] in 3.12.8 but not 3.12.7* [19:43:37] (03PS1) 10Jdlrobson: Cleanup unused config variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317868 (https://phabricator.wikimedia.org/T148853) [19:44:11] moritzm: (i have another HHVM image-related patch in progress - https://phabricator.wikimedia.org/T148606 - which is why i looked at this again. i'll be bothering you about this stuff again soon, probably.) [19:45:17] twentyafterfour: I was able to jury-rig a patch that lets us test it via a query string param, I just wanted to confirm that all search backends get updated, not just the "active" one. It *appears* so (https://phabricator.wikimedia.org/applications/view/PhabricatorSearchApplication/), but I haven't confirmed. [19:45:21] cc paladox ^ [19:45:40] :) [19:46:06] Are you doing this on production or on the test instance? [19:46:49] Want on the production instance once we have something working so we can test with real data. [19:47:09] :) [19:47:35] That's a good idea having both installed and play with them :) [19:54:50] 06Operations, 10Electron-PDFs, 06Services (blocked), 15User-mobrovac: Productize the Electron PDF render service & create a REST API end point - https://phabricator.wikimedia.org/T142226#2742577 (10dpatrick) [19:55:21] 06Operations, 10Parsoid: wtp2019.codfw.wmnet is down - https://phabricator.wikimedia.org/T149110#2742578 (10Arlolra) [19:57:32] (03PS3) 10Chad: Rewrite checkoutMediaWiki as scap3 plugin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317757 [19:57:44] (03CR) 10Chad: Rewrite checkoutMediaWiki as scap3 plugin (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317757 (owner: 10Chad) [19:59:11] 06Operations, 10Electron-PDFs, 10Security-Reviews, 06Services (blocked), 15User-mobrovac: Productize the Electron PDF render service & create a REST API end point - https://phabricator.wikimedia.org/T142226#2742624 (10dpatrick) [20:06:11] greg-g: Is wmf.23 being rolled back from group 0 or just blocked from group 1? [20:10:45] Just confused about the blocking notice on the deployment calendar and https://phabricator.wikimedia.org/T147517, although it looks like wmf.23 is already on group 0. [20:10:51] (03PS1) 10Filippo Giunchedi: Put back potassium as poolcounter1002 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317873 (https://phabricator.wikimedia.org/T123734) [20:11:28] ostriches: maybe you can elucidate? [20:12:03] (03CR) 10Filippo Giunchedi: [C: 032] Put back potassium as poolcounter1002 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317873 (https://phabricator.wikimedia.org/T123734) (owner: 10Filippo Giunchedi) [20:12:11] kaldari: Fix for that going out in a min [20:12:18] oh good :) [20:14:13] ostriches: doh, I've already pulled my change above, good to be synced though, I see you are still holding the scap lock [20:14:24] I'm doing a single sync-file [20:14:28] I'll be done in a sec [20:14:56] !log demon@mira Synchronized php-1.28.0-wmf.23/extensions/CiteThisPage/SpecialCiteThisPage.php: T149112 (duration: 01m 39s) [20:14:57] T149112: [bug] SpecialCiteThisPage fatal in 1.28.0-wmf.23 - - https://phabricator.wikimedia.org/T149112 [20:15:00] legoktm: Deployed. Thx for quick fix. [20:15:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:15:47] ack, thanks, I'll sync my change now [20:16:15] 06Operations, 10Cassandra, 06Services (doing): some cassandra instances restarted with java.lang.OutOfMemoryError: Java heap space - https://phabricator.wikimedia.org/T148516#2742669 (10Eevans) >>! In T148516#2742544, @Eevans wrote: > Some more findings: > > Looking at the dominator tree, the 3.4G result is... [20:16:49] !log filippo@mira Synchronized wmf-config/ProductionServices.php: Put back potassium as poolcounter1002 (duration: 00m 51s) [20:16:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:18:20] 06Operations, 05Goal: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2742677 (10fgiunchedi) [20:18:22] 06Operations, 13Patch-For-Review: Migrate pool counters to trusty/jessie - https://phabricator.wikimedia.org/T123734#2742672 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi Potassium reimaged as poolcounter1002, resolving. Followup for wrap up the rename is at T149106 [20:29:05] (03PS1) 10Chad: Gerrit: Move jvm logs outside of gerrit logging folder [puppet] - 10https://gerrit.wikimedia.org/r/317877 [20:30:17] (03CR) 10Chad: "I chose /srv/ instead of /var/log because GC/JVM logs have the potential to get massive, and I don't want to take down the whole OS if the" [puppet] - 10https://gerrit.wikimedia.org/r/317877 (owner: 10Chad) [20:30:29] (03PS1) 10Yurik: Ensure wgJsonConfigs is set properly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317879 [20:31:19] (03CR) 10ArielGlenn: [C: 032] Gerrit: Move jvm logs outside of gerrit logging folder [puppet] - 10https://gerrit.wikimedia.org/r/317877 (owner: 10Chad) [20:31:25] looks like Zero is semidead, working on it. CC: ostriches, Reedy, MaxSem [20:31:33] patch ^^ [20:32:43] !log gerrit doing a quick reboot, config pick up [20:32:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:37:58] MatmaRex: I just doublechecked via salt; your patch is present in the 3.12.7+dfsg-1+wmf1 package onwards and this is installed on all hosts with hhvm in production [20:38:28] 06Operations, 06Commons, 06Multimedia: Deploy a PHP and HHVM patch (Exif values retrieved incorrectly if they appear before IFD) - https://phabricator.wikimedia.org/T140419#2742776 (10MoritzMuehlenhoff) The patch is present in the 3.12.7+dfsg-1+wmf1 package onwards and this is version installed on all hosts... [20:42:47] !log maxsem@mira Synchronized wmf-config/mobile.php: https://gerrit.wikimedia.org/r/#/c/317879/ (duration: 00m 47s) [20:42:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:53:43] 06Operations, 10Cassandra, 06Services (doing): some cassandra instances restarted with java.lang.OutOfMemoryError: Java heap space - https://phabricator.wikimedia.org/T148516#2742821 (10Eevans) Since it proved fairly difficult to get setup to analyze this dump, I've written up some notes on the process I use... [20:58:45] 06Operations, 10Cassandra, 06Services (doing): some cassandra instances restarted with java.lang.OutOfMemoryError: Java heap space - https://phabricator.wikimedia.org/T148516#2742852 (10Eevans) >>! In T148516#2742669, @Eevans wrote: >>>! In T148516#2742544, @Eevans wrote: >> Some more findings: >> >> Lookin... [20:59:07] 06Operations, 10Cassandra, 06Services (doing): some cassandra instances restarted with java.lang.OutOfMemoryError: Java heap space - https://phabricator.wikimedia.org/T148516#2742854 (10Eevans) [21:00:36] 06Operations, 10ops-codfw, 10DBA: db2011 disk media errors - https://phabricator.wikimedia.org/T149099#2742860 (10RobH) 05Open>03stalled So these are 300GB SEAGATE ST3300657SS. 3.5" 15K SAS disks, and we don't keep any of these spare. (We've moved on to SSDs in new databases.) I'll create a sub-task i... [21:03:30] 06Operations, 06Services (next), 15User-mobrovac: Investigate better protection modes for electron render service (xvfb setuid) - https://phabricator.wikimedia.org/T143336#2742874 (10GWicke) a:03GWicke [21:05:43] 06Operations, 10Cassandra, 06Services (doing): some cassandra instances restarted with java.lang.OutOfMemoryError: Java heap space - https://phabricator.wikimedia.org/T148516#2742878 (10Eevans) I think this investigation has reached a point where the issue could be closed. I will however leave this open for... [21:15:05] ostriches, Reedy - MaxSem and I are still working on zero configuration issues with the wmf23. Will extend the hour [21:29:41] jouncebot, update [21:40:24] @now [21:40:35] jouncebot, @now [21:40:40] jouncebot, now [21:40:41] For the next 0 hour(s) and 19 minute(s): Fixing Zero config issues (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161025T2100) [21:50:05] 06Operations, 10Pybal, 06Services, 13Patch-For-Review, and 2 others: Depool / repool scripts execute successfully even when the host has not been (r|d)epooled - https://phabricator.wikimedia.org/T145518#2743123 (10Arlolra) > However, in certain occasions the hosts are not actually (r|d)epooled, in spite of... [21:51:05] !log maxsem@mira Synchronized php-1.28.0-wmf.23/extensions/Graph/: https://gerrit.wikimedia.org/r/#/c/317989/ (duration: 01m 23s) [21:51:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:51:14] 06Operations, 10Parsoid: Deploy failed on wtp2017.codfw.wmnet - https://phabricator.wikimedia.org/T149115#2742709 (10Arlolra) [21:53:07] heads up. I'm going to deploy RESTBase and it might create some monitoring alerts in the meantime. It's expected [21:55:41] !log RESTBase update to 3e53f00e - staging [21:55:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:02:41] !log RESTBase update to 3e53f00e [22:02:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:14:43] legoktm, around? we are having tons of weird config issues [22:14:52] kind of, what's up? [22:14:53] in prod :( [22:15:48] legoktm, for some reason, only one extension is showing up in config's merged values [22:15:58] legoktm, could you join video chat plz? [22:16:02] i will send you invite [22:16:14] uh, I can't really hangout right now [22:16:32] is there a bug or logs about what's wrong? [22:16:41] legoktm, debugging production atm :) [22:17:16] well I can't really help without background, like what's expected and what's not happening [22:17:22] or which extension and config setting this is about [22:17:25] sec, getting the info together [22:17:46] prod issues should be debugged in public channels unless there's something private about them - is there something private about this? [22:18:44] not exactly, but trying to avoid typing [22:18:50] ok: first element: https://github.com/wikimedia/mediawiki-extensions-JsonConfig/blob/master/extension.json#L122 [22:18:53] second: https://github.com/wikimedia/mediawiki-extensions-ZeroBanner/blob/master/extension.json#L146 [22:18:57] thierd: https://github.com/wikimedia/mediawiki-extensions-ZeroPortal/blob/master/extension.json#L61 [22:19:02] they should all merge [22:19:11] result - only the last element is there [22:19:23] legoktm, ^ [22:20:21] okay well, that's not how extension.json is supposed to work [22:20:37] "config" is owned by one extension, and for things that are configured by LocalSettings.php [22:21:06] if an extension needs to configure another extension, then that should be an attribute (top level arbitrary keys) [22:23:05] so it looks like it's also set in mobile.php [22:23:25] not exactly - mobile.php is set when portal is OFF [22:23:32] right now i'm only debugging the ZeroPortal=true [22:23:59] legoktm, before extension.json, multiple extensions add parts of the shared config. How can we port that? [22:24:28] there was a $wgJsonConfigs, and each extension would add subkeys [22:24:42] or modify keys that other extension would add to it [22:25:01] attributes [22:25:18] https://www.mediawiki.org/wiki/Manual:Extension_registration#Attributes [22:25:47] legoktm, but you said attributes are top-level only? [22:25:49] I thought I even added that to JsonConfig last time [22:25:53] yes, they don't go under "config" [22:27:46] paladox: hello, could you kick grrrit-wm please [22:27:55] Ok [22:27:56] Hi [22:27:59] looks like gerrit got restarted about 2 hours ago [22:28:05] from a merge related to logging [22:28:10] hence no bot [22:28:21] Oh woops [22:28:27] Yeah, we rebooted to pick up the jvm gc debuggin' stuff [22:28:27] i forgot to restart it then too [22:28:50] ok, thanks! i just got back [22:29:35] mutante ^^ done :) [22:29:35] paladox: i amended your change about javascript MIME type [22:29:40] thanks [22:29:48] so if you look at 7.1 and 7.2 in https://www.rfc-editor.org/rfc/rfc4329.txt [22:29:56] Yep [22:30:03] it says which one is the standardized one now [22:30:19] linked from that stackoverflow [22:30:32] Oh, i just added all three just in case a browser didnt support the standardized one [22:30:42] due to so many changes in recent years :) [22:31:38] paladox: yes, but then i read the comment from Krinkle [22:31:52] Oh [22:31:53] 06Operations, 06Services (next), 15User-mobrovac: Investigate better protection modes for electron render service (xvfb setuid) - https://phabricator.wikimedia.org/T143336#2743307 (10GWicke) I got the electron render service running in firejail on the pdf.services labs instance. Notes: - `apt-get install xp... [22:31:56] per " there shouldn't be any need to list more then one. We obviously control the server phab runs on and should be able to determine which one Phabricator considers it as when it reads a js file. Most likely based on code in Phabricator itself," [22:32:07] So you want me to use the standard one? [22:32:18] yurik: this is pretty confusing. It looks like both extensions are required from CommonSettings and then also in mobile.php?!? [22:32:20] at least that's what i did in PS8 now [22:32:25] yurik: what is the actual production impact of this? [22:32:29] to add only application/javascript [22:32:30] and can we just revert? [22:32:40] (03PS9) 10Paladox: Phabricator: Add javascript to files.viewable-mime-types [puppet] - 10https://gerrit.wikimedia.org/r/317500 [22:32:45] legoktm, at the moment, zero is down [22:32:46] (03PS10) 10Paladox: Phabricator: Add javascript to files.viewable-mime-types [puppet] - 10https://gerrit.wikimedia.org/r/317500 [22:33:01] oh woops [22:33:02] legoktm, MaxSem is rolling back zerobanner & zeroportal to 22 :( [22:33:14] mutante i didnt see your change ^^, reverting back to yours [22:33:15] LOL [22:33:27] ok [22:33:32] legoktm, when can we chat about this? [22:33:34] (03PS11) 10Paladox: Phabricator: Add javascript to files.viewable-mime-types [puppet] - 10https://gerrit.wikimedia.org/r/317500 [22:33:34] Done [22:33:35] yurik: the wiki is down? or anyone trying to access over zero? [22:34:08] legoktm, if any zero config in memcached expires, no wiki will be able to get zeroconfiguration [22:34:15] which means it won't show the banner [22:34:16] !log revert RESTBase to f9017adc [22:34:16] (03CR) 10Dzahn: [C: 032] Phabricator: Add javascript to files.viewable-mime-types [puppet] - 10https://gerrit.wikimedia.org/r/317500 (owner: 10Paladox) [22:34:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:34:41] ^^ thanks [22:34:44] mutante ^^ [22:36:59] legoktm, the goal is : ZeroBanner ext should set initial wgJsonConfigs.JsonZeroConfig={...}, ZeroPortal alters that with some values. And mobile.php also alters it with other values. both mobile and zeroportal will never run at the same time. [22:37:29] mobile alters it for all wikis except zerowiki. ZeroPortal ext runs only on zerowiki [22:37:41] yurik: can you file a bug for this please? I'm about to go afk, can look tonight [22:37:56] legoktm, its already open for a while :) [22:38:12] and sorry for not reviewing/looking at this stuff earlier, being at offsites for 2 weeks really made me behind on everything [22:38:15] link? [22:38:15] ostriches: paladox: also, very nice that we have the gc logging now, that should get us some more info whether it's really the culprit and using G1 makes sense or not [22:38:32] paladox: feel free to test that last one now [22:38:46] Ok thanks [22:39:14] mutante nope still dosen't allow us to view js in raw in browser [22:39:24] I think this is the same think that happen with php [22:39:34] and we had to apply a custom patch to phabricator core [22:39:40] ^^ twentyafterfour [22:40:11] mutante only certain js files will work in the raw, ie small js files. Big ones such as https://phabricator.wikimedia.org/diffusion/EPFM/browse/master/libs/ext.pf.select2.base.js doint [22:40:23] paladox: how about the "should be able to determine which one Phabricator considers it as when it reads a js file. Most likely based on code in Phabricator itself" [22:40:33] custom patch in core because of that? [22:40:42] Yeh we have a custom patch [22:40:47] for viewing phpp [22:40:48] ok [22:40:50] php [22:41:30] I think the custom patch for php is just config ... [22:41:34] in ops/puppet [22:42:07] but more than files.viewable-mime-types? [22:42:26] I can't remember, we might have changed something slightly [22:42:39] alright [22:43:15] twentyafterfour you did a change in the phabricator [22:43:15] core [22:43:27] since changing ops/puppet wasent enough to work. [22:43:36] So i think this may be a bug in phabricator core [22:46:39] paladox: something unrelated, still interested in the "Update all on-wiki references to git.wikimedia.org and replace them with the Phabricator equivalent" [22:46:52] oh [22:47:10] I think the deal with js is different from php. [22:47:14] Oh [22:47:30] with js we don't want to allow js to be executed (resulting in potential XSS vulnerabilities) [22:48:03] Oh, but some can [22:48:11] so allowing arbitrary js to be served in raw form from the phabricator application domain (that is, htts://phabricator.wikimedia.org rather than wmfusercontent.org) [22:48:15] twentyafterfour some js files can be viewed raw in the browser [22:48:29] why just some? which ones can't? [22:48:30] as long as there small files, i can view them, but big ones i carn't [22:48:41] twentyafterfour https://phabricator.wikimedia.org/diffusion/EPFM/browse/master/libs/ext.pf.select2.base.js [22:49:00] compaired too https://phabricator.wikimedia.org/diffusion/EPFM/browse/master/Gruntfile.js [22:49:03] (03CR) 10Andy M. Wang: "This appears to be a cumulative patch that does not give any time for which non-admins have any ability to patrol new pages. I suggest spl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317824 (https://phabricator.wikimedia.org/T149019) (owner: 10Cenarium) [22:50:12] paladox: that seems to be the same for me [22:50:21] Yep [22:52:21] oh, i see what you mean, yea [22:52:28] :) [22:52:37] depending on random timing and such, we might get some spurious ipsec alert spam in here soon [22:52:50] if so, ignore it and apologies in advance (it's hard to usefully suppress) [22:53:31] twentyafterfour https://phabricator.wikimedia.org/rPHAB6b9b0cc5b26a574dcac6be4b27f72860458897d7 [22:54:56] bblack: thanks for the heads-up [22:55:36] paladox: would you feel like doing this https://phabricator.wikimedia.org/T137353#2743071 or are we good enough now [22:56:03] I could possibly do it [22:56:12] i know we definitely fixed the low hanging fruit [22:56:20] But carn't tonight, i can try tonight or possibly try doing it on thursday [22:56:27] it's probably good enough [22:56:41] Do i use that special search tool on mw to check for the references? [22:56:44] mutante ^^ [22:57:12] i.. eh..just used normal search but then yurik had something better [22:57:19] that was that special tool you mean i guess [22:57:27] yep [22:59:20] mutante, eh? special search tool? [23:00:05] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161025T2300). [23:00:05] ebernhardson and arseny92: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:06] https://www.mediawiki.org/wiki/Special:LinkSearch [23:00:12] . [23:00:17] !log maxsem@mira Synchronized php-1.28.0-wmf.23/extensions/ZeroPortal/: https://gerrit.wikimedia.org/r/#/c/318004/ (duration: 01m 32s) [23:00:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:00:23] hey, we're still fixing Zero [23:00:33] (03PS1) 10Filippo Giunchedi: esams: introduce svc records for swift [dns] - 10https://gerrit.wikimedia.org/r/318010 (https://phabricator.wikimedia.org/T149098) [23:00:36] uh [23:00:48] twentyafterfour mutante https://phabricator.wikimedia.org/D428 :) [23:00:51] yurik: it's about the best way to check the links to git.wm https://phabricator.wikimedia.org/T137353#2743071 [23:01:55] Or using insource: searches in cirrus. [23:02:15] Since linksearch can't wildcard the protocol [23:02:34] !log maxsem@mira Synchronized php-1.28.0-wmf.23/extensions/ZeroBanner/: https://gerrit.wikimedia.org/r/#/c/318004/ (duration: 00m 50s) [23:02:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:02:58] oh [23:03:03] oh my patch doesn't need to be deployed, i managed to get it into the swat earlier today [23:04:04] I'm done, SWAT can begin [23:04:05] paladox: here is the patch we did for php files back in april: https://phabricator.wikimedia.org/rPHAB6b9b0cc5b26a574dcac6be4b27f72860458897d7 [23:04:19] ostriches: hmm, yea ought to be able to throw together a quick adjustment to mwgrep and run that search everywhere... [23:04:20] twentyafterfour twentyafterfour https://phabricator.wikimedia.org/rPHAB6b9b0cc5b26a574dcac6be4b27f72860458897d7 [23:04:23] LOL [23:04:54] twentyafterfour ive submitted a patch [23:04:55] ah I didn't see that [23:05:07] (03PS2) 10Filippo Giunchedi: standard: deploy prometheus-node-exporter to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/317552 (https://phabricator.wikimedia.org/T140646) [23:05:21] MaxSem , you SWATing? [23:05:32] twentyafterfour it seems someone removed the differential patches from hompage [23:05:41] oh never mind found it [23:06:05] twentyafterfour https://phabricator.wikimedia.org/D428 :) [23:06:41] 06Operations, 10Traffic: reimage cp1047 - https://phabricator.wikimedia.org/T148723#2731084 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by bblack on neodymium.eqiad.wmnet for hosts: ``` ['cp1047.eqiad.wmnet'] ``` The log can be found in `/var/log/wmf-auto-reimage/201610252306_bblack_5171.log`. [23:07:51] arseny92, nope - had a long debugging session, not prepared to continue deploying [23:07:55] 06Operations, 10Traffic: reimage cp1047 - https://phabricator.wikimedia.org/T148723#2743386 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by bblack on neodymium.eqiad.wmnet for hosts: ``` ['cp1047.eqiad.wmnet'] ``` The log can be found in `/var/log/wmf-auto-reimage/201610252307_bblack_5594.log`. [23:08:43] (03PS1) 10Dzahn: Revert "Phabricator: Add javascript to files.viewable-mime-types" [puppet] - 10https://gerrit.wikimedia.org/r/318011 [23:09:18] mutante ^^ we can keep that applied [23:09:23] as it dosent break anything [23:09:34] (03CR) 10Dzahn: [C: 032] "It did not achieve what was expected, might need upstream and security review." [puppet] - 10https://gerrit.wikimedia.org/r/318011 (owner: 10Dzahn) [23:09:43] oh [23:10:19] MaxSem , then who can SWAT? [23:11:17] arseny92 any one of these addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) [23:11:19] (03CR) 10Filippo Giunchedi: [C: 032] standard: deploy prometheus-node-exporter to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/317552 (https://phabricator.wikimedia.org/T140646) (owner: 10Filippo Giunchedi) [23:11:24] (03PS3) 10Filippo Giunchedi: standard: deploy prometheus-node-exporter to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/317552 (https://phabricator.wikimedia.org/T140646) [23:12:06] paladox , I got that, but who else of them is around? [23:12:19] not sure [23:12:27] ebernhardson: do you have time to handle the SWAT? Looks like you've got a patch in it [23:12:45] That was already merged [23:12:59] bd808: A different patch? [19:03] oh my patch doesn't need to be deployed, i managed to get it into the swat earlier today [23:13:09] ah [23:13:16] there you go then [23:13:23] :-) [23:14:14] (03PS1) 10Yurik: Enable static maps on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318013 (https://phabricator.wikimedia.org/T149071) [23:15:04] If noone else can swat I can perhaps swat. [23:15:16] have at it Adam [23:15:18] (03CR) 10Dereckson: [C: 04-1] "We'll wait and deploy early this next week, as there are some ongoing discussions on the task about naming or bureaucrats." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317824 (https://phabricator.wikimedia.org/T149019) (owner: 10Cenarium) [23:15:39] *logs into a few places* [23:15:51] (03CR) 10Dzahn: [C: 032] decom palladium from puppet, install_server, network constants [puppet] - 10https://gerrit.wikimedia.org/r/315891 (https://phabricator.wikimedia.org/T147320) (owner: 10Dzahn) [23:15:56] goodbye old puppet master [23:16:26] addshore, i am about to add one more patch [23:16:32] yurik: okay [23:17:19] (03PS3) 10Dzahn: decom palladium from puppet, install_server, network constants [puppet] - 10https://gerrit.wikimedia.org/r/315891 (https://phabricator.wikimedia.org/T147320) [23:17:21] addshore, done [23:17:36] meh, tin is the the active deployment server *looks around* [23:17:55] no I don't think so, mira still is [23:18:14] addshore: ^ [23:18:16] *goes to deployment.eqiad.wmnet .... [23:19:05] okay arseny92 your here? :) [23:19:17] yes [23:19:35] (03PS2) 10Addshore: Raise abusefilter condition limit for Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/313601 (https://phabricator.wikimedia.org/T147063) (owner: 10MarcoAurelio) [23:19:40] (03CR) 10Addshore: [C: 032] Raise abusefilter condition limit for Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/313601 (https://phabricator.wikimedia.org/T147063) (owner: 10MarcoAurelio) [23:19:49] yurik , https://wikitech.wikimedia.org/w/index.php?title=Deployments&type=revision&diff=928243&oldid=928237 [23:20:10] (03Merged) 10jenkins-bot: Raise abusefilter condition limit for Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/313601 (https://phabricator.wikimedia.org/T147063) (owner: 10MarcoAurelio) [23:20:23] arseny92, yes [23:20:38] arseny92, but it doesn't enable it on he/ca/mk wikis just yet [23:20:45] arseny92: is this someting testable on mw1099? [23:20:46] i want to test it first [23:21:01] https://gerrit.wikimedia.org/r/313601 is now on mw1099 [23:21:25] !log removed palladium from puppet (T147320). puppet node clean [23:21:26] T147320: Decomission palladium - https://phabricator.wikimedia.org/T147320 [23:21:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:22:02] addshore: wait [23:22:08] ack [23:22:39] addshore: sorry, just saw the ticket. Has there been a performance discussion about meta-wiki's abuse filters? I don't think we should be raising the limit just because they asked for it [23:23:29] looks like nothing exploded [23:23:40] e.g. compare the analysis on https://phabricator.wikimedia.org/T132048 versus the meta-wiki ticket [23:25:07] addshore: I'll comment on the ticket, but I don't think it should be raised right now. [23:25:10] There should be less of an impact for meta, but if we think we want to look at this more I can revert it for now [23:25:21] ack, will revert it for now. [23:25:46] (03PS1) 10Addshore: Revert "Raise abusefilter condition limit for Meta-Wiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318014 [23:25:50] (03CR) 10Addshore: [C: 032] Revert "Raise abusefilter condition limit for Meta-Wiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318014 (owner: 10Addshore) [23:26:11] !log Submitted 'deactivate node' for palladium.eqiad.wmnet [23:26:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:26:23] (03Merged) 10jenkins-bot: Revert "Raise abusefilter condition limit for Meta-Wiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318014 (owner: 10Addshore) [23:26:56] 1099 now has the revert too. [23:27:36] arseny92: going ahead with https://gerrit.wikimedia.org/r/#/c/315121/5 now [23:28:13] (03PS6) 10Addshore: Stop adding "Category:Uploaded with UploadWizard" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315121 (https://phabricator.wikimedia.org/T147799) (owner: 10MarcoAurelio) [23:28:24] (03CR) 10Addshore: [C: 032] Stop adding "Category:Uploaded with UploadWizard" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315121 (https://phabricator.wikimedia.org/T147799) (owner: 10MarcoAurelio) [23:28:36] addshore: thanks [23:28:41] legoktm: np [23:29:01] (03Merged) 10jenkins-bot: Stop adding "Category:Uploaded with UploadWizard" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315121 (https://phabricator.wikimedia.org/T147799) (owner: 10MarcoAurelio) [23:29:04] /3/3 [23:29:43] arseny92: https://gerrit.wikimedia.org/r/315121 is now on mw1099, please test :) [23:31:51] will take some mins [23:32:14] okay! [23:32:44] 06Operations, 10Traffic: reimage cp1047 - https://phabricator.wikimedia.org/T148723#2743429 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by bblack on neodymium.eqiad.wmnet for hosts: ``` ['cp1047.eqiad.wmnet'] ``` The log can be found in `/var/log/wmf-auto-reimage/201610252332_bblack_14641.log`. [23:33:03] (03CR) 10Addshore: [C: 032] Enable static maps on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318013 (https://phabricator.wikimedia.org/T149071) (owner: 10Yurik) [23:33:14] yurik: are you able to test yours on mw1099? [23:33:24] addshore, i hope so... looking [23:33:34] it's not there quite yet, give me 1 more min! [23:35:00] (03PS2) 10Addshore: Enable static maps on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318013 (https://phabricator.wikimedia.org/T149071) (owner: 10Yurik) [23:35:06] (03CR) 10Addshore: Enable static maps on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318013 (https://phabricator.wikimedia.org/T149071) (owner: 10Yurik) [23:35:08] (03CR) 10Addshore: [C: 032] Enable static maps on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318013 (https://phabricator.wikimedia.org/T149071) (owner: 10Yurik) [23:35:36] (03Merged) 10jenkins-bot: Enable static maps on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318013 (https://phabricator.wikimedia.org/T149071) (owner: 10Yurik) [23:35:56] yurik: your patch is now also on mw1099! [23:36:04] testing... [23:37:48] addshore, all's good [23:37:57] yurik: ack, syncing [23:39:07] !log addshore@mira Synchronized wmf-config/InitialiseSettings.php: [[gerrit:318013]] Enable static maps on testwiki (duration: 00m 48s) [23:39:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:39:19] yurik: ^^ please make sure the world hasn't broken! [23:39:33] arseny92: any joy testing your patch? :) [23:39:48] not exploding [23:40:27] all's good :) [23:40:43] arseny92: syncing yours now. [23:41:11] Dereckson: I see you added another patch to the window, do you want to deploy that one? [23:41:38] !log addshore@mira Synchronized wmf-config/CommonSettings.php: [[gerrit:315121]] Stop adding Category:Uploaded_with_UploadWizard (duration: 00m 47s) [23:41:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:41:44] arseny92: ^^ please check [23:41:44] (03PS2) 10Dereckson: Add a project namespace on tg.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293243 (https://phabricator.wikimedia.org/T137200) [23:41:48] Hello. [23:41:55] addshore: could you rebase it? I'm not logged on mira etc. [23:41:59] deploy it [23:42:05] Dereckson: yup! [23:42:07] thanks :) [23:42:24] (03CR) 10Addshore: [C: 032] Add a project namespace on tg.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293243 (https://phabricator.wikimedia.org/T137200) (owner: 10Dereckson) [23:42:55] (03Merged) 10jenkins-bot: Add a project namespace on tg.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293243 (https://phabricator.wikimedia.org/T137200) (owner: 10Dereckson) [23:43:35] Dereckson: its on mw1099! [23:43:45] testing [23:44:06] Works. [23:44:15] syncing [23:44:59] !log addshore@mira Synchronized wmf-config/InitialiseSettings.php: [[gerrit:293243]] Add a project namespace on tg.wikipedia (duration: 00m 47s) [23:45:01] Dereckson: ^^ :) - An that concludes this late starting early finishing SWAT window! [23:45:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:45:12] *and [23:45:22] Thanks for the deployment addshore. [23:45:39] *goes back to watching his film* ;) [23:50:18] icinga-wm: :P [23:51:13] (03PS2) 10Cenarium: Create patroller usergroup for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317824 (https://phabricator.wikimedia.org/T149019) [23:56:55] (03CR) 10Cenarium: "It now only creates the group. Reviewer is already taken by pending changes reviewer so we can only call this patroller, in the code." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317824 (https://phabricator.wikimedia.org/T149019) (owner: 10Cenarium)