[00:01:04] !log ori@tin Synchronized php-1.27.0-wmf.16/extensions/Wikidata/extensions/Wikibase/client/includes/Hooks/DataUpdateHookHandlers.php: Briefly disable Wikibase\Client\Hooks\DataUpdateHookHandlers::onParserCacheSaveComplete (duration: 00m 26s) [00:01:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:03:11] (03CR) 10Dzahn: [C: 032] DNS: Adding production entries for wasat Bug:T129930 [dns] - 10https://gerrit.wikimedia.org/r/277680 (https://phabricator.wikimedia.org/T129930) (owner: 10Papaul) [00:15:43] !log ori@tin Synchronized php-1.27.0-wmf.16/extensions/Wikidata/extensions/Wikibase/client/includes/Hooks/DataUpdateHookHandlers.php: Revert briefly disable Wikibase\Client\Hooks\DataUpdateHookHandlers::onParserCacheSaveComplete (duration: 00m 31s) [00:17:40] !log maxsem@tin Synchronized php-1.27.0-wmf.16/includes/jobqueue/JobQueueGroup.php: Log REQUEST_URI (duration: 00m 28s) [00:17:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:28:40] https://meta.wikimedia.org/wiki/Hardware_ordered_January_2004 [00:29:32] Look, we already ordered secondary servers and switched over. [00:38:27] (03PS1) 10Thcipriani: Use scap subcommands [puppet] - 10https://gerrit.wikimedia.org/r/277700 [00:44:13] (03CR) 10Thcipriani: [C: 04-1] "This can merge pending the merge and deployment of D145" [puppet] - 10https://gerrit.wikimedia.org/r/277700 (owner: 10Thcipriani) [00:46:10] !log maxsem@tin Synchronized php-1.27.0-wmf.16/includes/jobqueue/JobQueueGroup.php: Debug (duration: 00m 28s) [00:46:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:47:39] 6Operations, 10MediaWiki-JobQueue, 13Patch-For-Review: The refreshLinks jobs enqueue rate is 10 times the normal rate - https://phabricator.wikimedia.org/T129517#2125116 (10ori) One possible loop: `RefreshLinksJob::runForTitle()` → `ParserCache::singleton()->save()` → `Hooks::run()` → `Wikibase\Client\Hooks... [00:48:59] !log ori@tin Synchronized php-1.27.0-wmf.16/includes/jobqueue/utils/BacklinkJobUtils.php: I32ec0 (duration: 00m 30s) [00:49:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:57:35] !log maxsem@tin Synchronized php-1.27.0-wmf.16/includes/jobqueue/JobQueueGroup.php: Debug (duration: 00m 24s) [00:57:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:01:11] (03CR) 10Dzahn: "re: the comment on precise. when i recently merged a change to font packages and thought the same, it turned out there is still precise in" [puppet] - 10https://gerrit.wikimedia.org/r/218640 (https://phabricator.wikimedia.org/T102623) (owner: 10Dzahn) [01:02:33] (03CR) 10Dzahn: "..and then all this https://phabricator.wikimedia.org/T129500#2112727 and i had to revert , also https://phabricator.wikimedia.org/T1026" [puppet] - 10https://gerrit.wikimedia.org/r/218640 (https://phabricator.wikimedia.org/T102623) (owner: 10Dzahn) [01:06:02] PROBLEM - MariaDB Slave Lag: m3 on db1048 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 350.07 seconds [01:09:07] (03PS5) 10Madhuvishy: [WIP] ifttt: Set up Wikimedia IFTTT channel service using puppet on labs [puppet] - 10https://gerrit.wikimedia.org/r/277189 [01:11:30] oh, that sounds interesting madhuvishy [01:11:59] mutante: the ifttt channel setup? [01:12:29] yea, just the "ifttt" keyword got my attention because i used that before but it's been a long time [01:12:49] and a wikimedia channel for it sounds interesting [01:13:01] mutante: aah - stephen laporte has an ifttt channel on tool labs running since last wikimania :) [01:13:28] it's been going down a lot due to tool labs nfs etc - so i'm just helping to move it to labs [01:13:31] oh .. TIL [01:13:34] nice [01:13:49] https://github.com/wikimedia/ifttt [01:14:35] mutante: I noticed that you moved some of the role files from the top level manifests/role in ops/puppet to modules/role/... [01:14:57] i was wondering what the motivation behind that was - just for me to know [01:15:00] "if picture of the day is a cat .. then ... make the bot post it the kawai channel" [01:15:36] mutante: :D there are loads of recipes folks have made. https://ifttt.com/wikipedia [01:16:38] madhuvishy: the motivation is to move everything into the "autoloader" layout and into a module and so far the role classes have been treated special [01:16:45] http://puppet-lint.com/checks/autoloader_layout/ [01:16:56] for example that means this check can never be green [01:17:01] aah [01:17:11] right [01:17:41] and then things also dont have to be include anymore [01:17:46] currently in site.pp you have this: [01:18:02] 7 import 'role/*.pp' [01:18:10] 5 import 'misc/*.pp' [01:18:11] etc [01:18:12] !log maxsem@tin Synchronized php-1.27.0-wmf.16/includes/jobqueue/JobQueueGroup.php: Debug (duration: 00m 26s) [01:18:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:18:18] we would like to kill all of that [01:18:24] nice. makes sense [01:18:25] and have only the clean module structure [01:18:40] but first they all have to be moved.. so yep [01:18:57] let me know if any of the analytics/research-y things are blocking - i can help [01:19:38] thank you! ok, i will [01:20:00] i think recently many analytics classes were done [01:20:04] i guess there's eventlogging - i can do that sometime [01:20:14] there was always the question of limn [01:20:30] ya i found out by panicking - my role is gone but my service is up howww [01:21:33] oops, sorry, heh. i test them all in compiler to make sure nothing breaks [01:21:44] so there is a thing with this: [01:22:11] it's long to explain, but basically if the roles are like this: role::foo::server then there is no problem [01:22:42] but if they were like this: role::baz then i had a problem moving them [01:23:10] that is why i moved some things a while ago but you still see the current ones [01:23:25] because of baz referring to both the role and the actual module? [01:23:27] there is a longish ticket about it [01:24:15] * mutante tries to find something [01:26:23] so first, when you see me moving things around and lint fixes it's all for that one epic ticket https://phabricator.wikimedia.org/T93645 and it shows the history of how the code gets better over time [01:26:29] and then this is the specific one [01:26:33] https://phabricator.wikimedia.org/T119042 [01:26:44] about the layout problem with the roles [01:26:52] also see comments from _joe_ [01:27:12] mutante: thanks I'll read these [01:27:13] "the problem is "import" is implemented awkwardly.. there is no way to fix it,"" :/ [01:27:26] you're welcome [01:27:34] hmmm [01:27:49] that wasn't the end of it :) [01:27:55] it continued after that [01:28:19] the very last comment shows you how i could move mediawiki classes without a problem despite that [01:28:34] but try moving a remaining one and run it in compiler [01:29:01] one "fix" would be the rename all the role classes to foo::bar [01:29:43] joe suggested to put a ::fixme:: or so in there to make obvious it's a hack because of puppet [01:30:17] and/or we can maybe move them if we move them all at the same time [01:32:05] !log rebooting elastic2011.codfw.wmnet for kernel upgrade [01:32:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:34:28] PROBLEM - Host elastic2011 is DOWN: PING CRITICAL - Packet loss = 100% [01:35:37] RECOVERY - Host elastic2011 is UP: PING OK - Packet loss = 0%, RTA = 37.04 ms [01:40:16] mutante: just finished reading the thread. interesting [01:44:21] (03CR) 10Dzahn: "yea, the role class structure should work fine like this. just 2 inline comments" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/277189 (owner: 10Madhuvishy) [01:47:17] madhuvishy: yea, right ? additional comments welcome. i see one that is kind of analytics related, maybe. role::webperf [01:47:21] RECOVERY - MariaDB Slave Lag: m3 on db1048 is OK: OK slave_sql_lag Replication lag: 0.09 seconds [01:48:01] i dont know much about what it does [01:48:29] maybe it could be role::performance::web or whatever.. and then moving it would be no problem [01:48:35] gotta run. bbl [01:49:29] (uses away nick only when on duty) [01:50:00] (03CR) 10Alex Monk: "Krinkle: Shall we decide on a date for this and announce?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/217858 (https://phabricator.wikimedia.org/T122933) (owner: 10Faidon Liambotis) [02:17:47] !log rebooting elastic2012.codfw.wmnet for kernel upgrade [02:17:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:19:28] PROBLEM - Host elastic2012 is DOWN: PING CRITICAL - Packet loss = 100% [02:20:57] RECOVERY - Host elastic2012 is UP: PING OK - Packet loss = 0%, RTA = 37.42 ms [02:34:37] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.16) (duration: 13m 44s) [02:34:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:42:24] (03PS1) 10Papaul: params: New recipe for labstore200[3-4] Bug:T128764 [puppet] - 10https://gerrit.wikimedia.org/r/277709 (https://phabricator.wikimedia.org/T128764) [02:45:56] !log rebooting elastic2013.codfw.wmnet for kernel upgrade [02:46:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:46:57] PROBLEM - Host elastic2013 is DOWN: PING CRITICAL - Packet loss = 100% [02:49:46] RECOVERY - Host elastic2013 is UP: PING OK - Packet loss = 0%, RTA = 36.23 ms [02:50:06] (03PS2) 10Papaul: params: New recipe for labstore200[3-4] Bug:T128764 [puppet] - 10https://gerrit.wikimedia.org/r/277709 (https://phabricator.wikimedia.org/T128764) [03:09:45] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.17) (duration: 17m 36s) [03:09:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:12:59] 6Operations, 10Wikimedia-Mailing-lists: Remove/ archive inspire@lists.wikimedia.org - https://phabricator.wikimedia.org/T126640#2125231 (10Capt_Swing) Hey @Krenair, @Aklapper, anything I can do to help resolve this request? [03:18:27] 6Operations, 10Wikimedia-Mailing-lists: Remove/ archive inspire@lists.wikimedia.org - https://phabricator.wikimedia.org/T126640#2125233 (10Krenair) Not sure if I have anything useful to add. It is in #operations so theroetically someone with ops rights should have read it by now. Some of those people do someth... [03:19:26] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Mar 16 03:19:26 UTC 2016 (duration 9m 41s) [03:19:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:41:46] !log rebooting elastic2014.codfw.wmnet for kernel upgrade [03:41:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:44:38] PROBLEM - Host elastic2014 is DOWN: PING CRITICAL - Packet loss = 100% [03:45:16] RECOVERY - Host elastic2014 is UP: PING OK - Packet loss = 0%, RTA = 36.28 ms [04:09:16] mutante: ping [04:16:37] PROBLEM - puppet last run on mw2105 is CRITICAL: CRITICAL: puppet fail [04:33:55] papual: 64 bytes from mutante [04:37:22] (03PS3) 10Dzahn: partman: New recipe for labstore200[3-4] Bug:T128764 [puppet] - 10https://gerrit.wikimedia.org/r/277709 (https://phabricator.wikimedia.org/T128764) (owner: 10Papaul) [04:37:37] (03PS4) 10Dzahn: partman: New recipe for labstore200[3-4] Bug:T128764 [puppet] - 10https://gerrit.wikimedia.org/r/277709 (https://phabricator.wikimedia.org/T128764) (owner: 10Papaul) [04:38:48] (03CR) 10Dzahn: [C: 032] partman: New recipe for labstore200[3-4] Bug:T128764 [puppet] - 10https://gerrit.wikimedia.org/r/277709 (https://phabricator.wikimedia.org/T128764) (owner: 10Papaul) [04:40:11] 6Operations, 10ops-codfw, 13Patch-For-Review: labstore2003-labstore2004 onsite setup task - https://phabricator.wikimedia.org/T128764#2125254 (10Dzahn) >>! In T128764#2123686, @RobH wrote: > Ok, we need to write a new partman recipe for this @Papaul has done that and i merged it. https://gerrit.wikimedia.o... [04:40:37] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [04:42:25] !log rebooting elastic2015.codfw.wmnet for kernel upgrade [04:42:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:44:27] PROBLEM - Host elastic2015 is DOWN: PING CRITICAL - Packet loss = 100% [04:45:36] RECOVERY - puppet last run on mw2105 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [04:46:17] RECOVERY - Host elastic2015 is UP: PING OK - Packet loss = 0%, RTA = 36.30 ms [04:48:59] (03PS1) 10Dzahn: partman: use lvm-ext-srv.cfg for labstore200[3-4] [puppet] - 10https://gerrit.wikimedia.org/r/277715 (https://phabricator.wikimedia.org/T128764) [04:49:31] (03PS2) 10Dzahn: partman: use lvm-ext-srv.cfg for labstore200[3-4] [puppet] - 10https://gerrit.wikimedia.org/r/277715 (https://phabricator.wikimedia.org/T128764) [04:51:56] (03PS3) 10Dzahn: partman: use lvm-ext-srv.cfg for labstore200[3-4] [puppet] - 10https://gerrit.wikimedia.org/r/277715 (https://phabricator.wikimedia.org/T128764) [04:54:43] (03PS4) 10Dzahn: partman: use lvm-ext-srv.cfg for labstore200[3-4] [puppet] - 10https://gerrit.wikimedia.org/r/277715 (https://phabricator.wikimedia.org/T128764) [04:56:01] (03CR) 10Dzahn: [C: 032] partman: use lvm-ext-srv.cfg for labstore200[3-4] [puppet] - 10https://gerrit.wikimedia.org/r/277715 (https://phabricator.wikimedia.org/T128764) (owner: 10Dzahn) [05:01:22] (03PS1) 10Dzahn: add labstore200[3-4] to site.pp, like existing [1-2] [puppet] - 10https://gerrit.wikimedia.org/r/277716 (https://phabricator.wikimedia.org/T128764) [05:03:24] (03PS2) 10Dzahn: add labstore200[3-4] to site.pp, like existing [1-2] [puppet] - 10https://gerrit.wikimedia.org/r/277716 (https://phabricator.wikimedia.org/T128764) [05:04:21] (03PS3) 10Dzahn: add labstore200[3-4] to site.pp, like existing [1-2] [puppet] - 10https://gerrit.wikimedia.org/r/277716 (https://phabricator.wikimedia.org/T128764) [05:04:38] PROBLEM - puppet last run on mw1013 is CRITICAL: CRITICAL: Puppet has 1 failures [05:04:42] (03CR) 10Papaul: [C: 031 V: 031] add labstore200[3-4] to site.pp, like existing [1-2] [puppet] - 10https://gerrit.wikimedia.org/r/277716 (https://phabricator.wikimedia.org/T128764) (owner: 10Dzahn) [05:05:29] 6Operations, 10ops-codfw, 13Patch-For-Review: labstore2003-labstore2004 onsite setup task - https://phabricator.wikimedia.org/T128764#2125261 (10Dzahn) [05:11:44] (03CR) 10Dzahn: [C: 032] "going ahead, see the comment below, make a server capable of serving but doesn't activate, perfect." [puppet] - 10https://gerrit.wikimedia.org/r/277716 (https://phabricator.wikimedia.org/T128764) (owner: 10Dzahn) [05:31:57] RECOVERY - puppet last run on mw1013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:39:38] !log rebooting elastic2016.codfw.wmnet for kernel upgrade [05:39:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:42:28] PROBLEM - Host elastic2016 is DOWN: PING CRITICAL - Packet loss = 100% [05:42:57] RECOVERY - Host elastic2016 is UP: PING OK - Packet loss = 0%, RTA = 36.94 ms [06:03:37] AaronSchulz, division=54 :P [06:03:53] tail -n 10000 runJobs.log | grep division | wc -l [06:03:53] 27 [06:04:04] IT MULTIPLIES [06:07:27] PROBLEM - puppet last run on hassaleh is CRITICAL: CRITICAL: puppet fail [06:09:29] MaxSem: what's division mean? :) [06:10:30] !log rebooting elastic2017.codfw.wmnet for kernel upgrade [06:10:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:11:51] ebernhardson, https://gerrit.wikimedia.org/r/#/c/277681/1/includes/jobqueue/utils/BacklinkJobUtils.php [06:13:18] PROBLEM - Host elastic2017 is DOWN: PING CRITICAL - Packet loss = 100% [06:14:06] RECOVERY - Host elastic2017 is UP: PING OK - Packet loss = 0%, RTA = 36.57 ms [06:14:31] MaxSem: should grow large i imagine? iiuc how that works, basically you take the 7M backlinks to Module:Coordinates, and then it inserts jobs in batches with each 'continue' job being a division? [06:16:02] that would also be why we run 30M jobs/hr and not decrease the size of the queue. i suspect though that we somehow have multiple base jobs or something. I checked on fluorine and we've run refreshLinks for Module:Coordinates more than 100M times in the last 5 days [06:16:36] i wrote a script that's now iterating over one wiki's refreshLinks queue to classify the existing jobs per title into leaf/range/base [06:18:29] also, the extremely low number of jobs with division set indicates just how insane our backlog is [06:19:37] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 51.72% of data above the critical threshold [5000000.0] [06:19:41] will division also be set on the leaf jobs? [06:20:05] no, only for derived ones [06:27:50] well, at least for enwiki each job only has 1 or 0 base jobs, which is good. quasi-interestingly over half the enwiki jobs are for 4 pages [06:29:47] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 50.00% above the threshold [1000000.0] [06:31:57] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:38] RECOVERY - puppet last run on hassaleh is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [06:47:37] !log rebooting elastic2018.codfw.wmnet for kernel upgrade [06:47:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:49:58] PROBLEM - Host elastic2018 is DOWN: PING CRITICAL - Packet loss = 100% [06:50:57] RECOVERY - Host elastic2018 is UP: PING OK - Packet loss = 0%, RTA = 36.73 ms [06:57:36] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:08:30] 6Operations, 6Performance-Team, 13Patch-For-Review, 7Performance: Update HHVM package to recent release - https://phabricator.wikimedia.org/T119637#2125411 (10Joe) [07:08:32] 6Operations, 6Performance-Team, 7Performance: HHVM 3.12 has a race-condition when starting up - https://phabricator.wikimedia.org/T129467#2125410 (10Joe) 5Open>3Resolved [07:08:40] 6Operations, 6Performance-Team, 13Patch-For-Review, 7Performance: Update HHVM package to recent release - https://phabricator.wikimedia.org/T119637#1831764 (10Joe) 5Open>3Resolved [07:08:42] 6Operations, 10Deployment-Systems, 6Performance-Team, 6Release-Engineering-Team, 7HHVM: Translation cache exhaustion caused by changes to PHP code in file scope - https://phabricator.wikimedia.org/T103886#2125414 (10Joe) [07:10:10] 6Operations, 10Deployment-Systems, 6Performance-Team, 6Release-Engineering-Team, 7HHVM: Translation cache exhaustion caused by changes to PHP code in file scope - https://phabricator.wikimedia.org/T103886#1401646 (10Joe) Athough we upgraded fleet-wide, I think the tests @ori did showed a great deal of in... [07:10:51] 6Operations, 10Deployment-Systems, 6Performance-Team, 6Release-Engineering-Team, 7HHVM: Translation cache exhaustion caused by changes to PHP code in file scope - https://phabricator.wikimedia.org/T103886#2125417 (10ori) https://github.com/facebook/hhvm/issues/6911 [07:27:37] (03CR) 10KartikMistry: "I agree that these are too many lines. However, we already fixed issue at https://phabricator.wikimedia.org/rOPUPca4af1cfdef53f4892b8f5458" [puppet] - 10https://gerrit.wikimedia.org/r/277463 (https://phabricator.wikimedia.org/T129849) (owner: 10KartikMistry) [07:48:00] 6Operations, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Have a strategy to switch restbase to use services in the appropriate datacenter - https://phabricator.wikimedia.org/T126235#2125429 (10Joe) We've set up restbase to talk to all the local services in each datacenter for everythi... [07:48:11] 6Operations, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Have a strategy to switch restbase to use services in the appropriate datacenter - https://phabricator.wikimedia.org/T126235#2125430 (10Joe) a:3Joe [07:48:18] 6Operations, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Have a strategy to switch restbase to use services in the appropriate datacenter - https://phabricator.wikimedia.org/T126235#2008250 (10Joe) 5Open>3Resolved [07:48:20] 6Operations, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Figure out and document the datacenter switchover process - https://phabricator.wikimedia.org/T124670#2125433 (10Joe) [07:49:27] RECOVERY - Apache HTTP on mw1129 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 615 bytes in 0.248 second response time [07:49:27] RECOVERY - HHVM rendering on mw1129 is OK: HTTP OK: HTTP/1.1 200 OK - 67495 bytes in 0.931 second response time [07:50:26] 6Operations, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Prepare citoid for the codfw switchover - https://phabricator.wikimedia.org/T125057#2125434 (10Joe) The cluster was installed in codfw, and the application is now properly configured there [07:50:44] 6Operations, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Prepare citoid for the codfw switchover - https://phabricator.wikimedia.org/T125057#2125435 (10Joe) 5Open>3Resolved a:3Joe [07:51:46] RECOVERY - HHVM jobrunner on mw1002 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.029 second response time [07:53:04] 6Operations, 10Mathoid, 6Services, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Prepare mathoid for the codfw switchover - https://phabricator.wikimedia.org/T125058#2125438 (10Joe) Mathoid is now installed and working correctly in codfw as well. [07:53:16] (03PS1) 10Jcrespo: Return s2 slaves to normal weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277718 [07:53:17] 6Operations, 10Mathoid, 6Services, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Prepare mathoid for the codfw switchover - https://phabricator.wikimedia.org/T125058#2125439 (10Joe) 5Open>3Resolved a:3Joe [07:54:05] (03PS2) 10Jcrespo: Return s2 slaves to normal weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277718 [07:54:40] 6Operations, 10Graphoid, 6Services, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Prepare graphoid for the codfw switchover - https://phabricator.wikimedia.org/T125060#2125442 (10Joe) Graphoid is now correctly configured to use local resources and to be used locally by restbase. [07:54:54] 6Operations, 10Graphoid, 6Services, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Prepare graphoid for the codfw switchover - https://phabricator.wikimedia.org/T125060#2125443 (10Joe) 5Open>3Resolved a:3Joe [07:55:28] 6Operations, 6Services, 3Mobile-Content-Service, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Prepare mobileapps for the codfw switchover - https://phabricator.wikimedia.org/T125061#2125446 (10Joe) a:3Joe [07:55:48] <_joe_> sorry for the spam, phab spring cleaning [08:01:25] (03CR) 10Jcrespo: "xtrabackups does a controlled recovery making sure it has everithing on its log to redo all committed transactions and undo all uncommitte" [puppet] - 10https://gerrit.wikimedia.org/r/277640 (https://phabricator.wikimedia.org/T127991) (owner: 10Ottomata) [08:03:49] (03PS4) 10Giuseppe Lavagetto: mobileapps: point to $rb_route, not to the local restbase cluster [puppet] - 10https://gerrit.wikimedia.org/r/275538 [08:04:32] !log restarted mw1129/1002 (HHVM hung) [08:04:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:05:04] (03CR) 10Giuseppe Lavagetto: [C: 032] mobileapps: point to $rb_route, not to the local restbase cluster [puppet] - 10https://gerrit.wikimedia.org/r/275538 (owner: 10Giuseppe Lavagetto) [08:17:09] (03PS5) 10Giuseppe Lavagetto: iegreview: use $::parsoid_site [puppet] - 10https://gerrit.wikimedia.org/r/275539 (https://phabricator.wikimedia.org/T125673) [08:18:18] (03CR) 10Giuseppe Lavagetto: [C: 032] iegreview: use $::parsoid_site [puppet] - 10https://gerrit.wikimedia.org/r/275539 (https://phabricator.wikimedia.org/T125673) (owner: 10Giuseppe Lavagetto) [08:35:38] PROBLEM - Host 208.80.153.51 is DOWN: PING CRITICAL - Packet loss = 100% [08:36:17] RECOVERY - Host 208.80.153.51 is UP: PING OK - Packet loss = 0%, RTA = 37.06 ms [08:38:17] PROBLEM - puppet last run on baham is CRITICAL: CRITICAL: puppet fail [08:41:19] 6Operations, 10DBA: Reduce memory commitment on database hosts with many objects, specially s3, dbstore/research and labs - https://phabricator.wikimedia.org/T107282#2125465 (10jcrespo) I've reduced, but not puppetized, because I am not 100% sure it will be ok, both innodb buffer pool and tokudb cache, because... [08:43:23] (03CR) 10Jcrespo: [C: 032] Return s2 slaves to normal weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277718 (owner: 10Jcrespo) [08:45:16] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Return s2 slaves to normal weight (duration: 00m 35s) [08:45:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:45:34] ^this will help with load issues due to queuing [08:47:16] PROBLEM - HHVM jobrunner on mw1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:52:24] <_joe_> uhm [08:52:37] <_joe_> this is happening a bit too often, let me look [08:53:19] !log rebooting labtest*2001 for kernel upgrade [08:53:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:53:59] <_joe_> !log restarting hhvm on mw1007, stuck in a deadlock, apparently on HPHP::Treadmill::getAgeOldestRequest [08:54:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:54:13] 6Operations, 10Wikimedia-Stream, 7user-notice: reboot of rcs servers (stream.wikimedia.org) - https://phabricator.wikimedia.org/T130024#2122965 (10Johan) When will this happen? [08:55:37] RECOVERY - HHVM jobrunner on mw1007 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.019 second response time [08:58:04] 6Operations, 10Wikimedia-Stream, 7user-notice: reboot of rcs servers (stream.wikimedia.org) - https://phabricator.wikimedia.org/T130024#2122965 (10Joe) I don't really get why we should keep being impeded in a normal course of operations that doesn't affect the operativity or uptime of rcstream because the cl... [09:04:12] !log rebooting stat1002/stat1003 for kernel upgrade [09:04:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:06:07] RECOVERY - puppet last run on baham is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:10:53] (03PS9) 10Elukey: Varnish 4 API porting. [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/276439 (https://phabricator.wikimedia.org/T124278) [09:11:16] (03PS10) 10Elukey: Varnish 4 API porting. [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/276439 (https://phabricator.wikimedia.org/T124278) [09:16:12] (03CR) 10Elukey: "Added some checks after each VSLQ_Dispatch call to exit gracefully in case the shm handle is not valid anymore (for example when Varnish r" [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/276439 (https://phabricator.wikimedia.org/T124278) (owner: 10Elukey) [09:18:25] !log setting up pending cross-datacenter master-master database links [09:18:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:27:22] Hey mutante|away, sorry for yesterday, my status was not up-to-date [09:28:06] About deploy for aqs, Ithink I need to be able to do it, but I have not followed from xlose enough the deploy-with-scap3 thing, so I don't know what it involves [09:30:17] (03PS1) 10Alexandros Kosiaris: Introduce ores.svc.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/277725 (https://phabricator.wikimedia.org/T124202) [09:30:55] (03PS3) 10Giuseppe Lavagetto: parsoid::testing: use master_dc variables [puppet] - 10https://gerrit.wikimedia.org/r/275814 (https://phabricator.wikimedia.org/T124670) [09:31:28] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] parsoid::testing: use master_dc variables [puppet] - 10https://gerrit.wikimedia.org/r/275814 (https://phabricator.wikimedia.org/T124670) (owner: 10Giuseppe Lavagetto) [09:33:31] !log puppet disabled on analytics1027 as preparation step for the Hadoop nodes reboots [09:33:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:33:50] <_joe_> meh [09:36:19] (03PS1) 10Giuseppe Lavagetto: parsoid::rt_client: fix scope.lookupvar [puppet] - 10https://gerrit.wikimedia.org/r/277726 [09:37:08] PROBLEM - puppet last run on ruthenium is CRITICAL: CRITICAL: puppet fail [09:38:47] <_joe_> that's me ^^ [09:39:01] (03CR) 10Giuseppe Lavagetto: [C: 032] parsoid::rt_client: fix scope.lookupvar [puppet] - 10https://gerrit.wikimedia.org/r/277726 (owner: 10Giuseppe Lavagetto) [09:40:38] RECOVERY - puppet last run on ruthenium is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [09:44:17] 6Operations: setup/deploy oresrdb1001-oresrdb1002 - https://phabricator.wikimedia.org/T125562#2125571 (10akosiaris) [09:44:22] (03PS1) 10Giuseppe Lavagetto: role::apertium: check the local url, not the eqiad one [puppet] - 10https://gerrit.wikimedia.org/r/277727 [09:44:31] <_joe_> akosiaris: ^^ [09:47:06] 6Operations, 10Ops-Access-Requests: Grant Joseph and Dan deploy permissions on aqs100[1-3] - https://phabricator.wikimedia.org/T116169#2125579 (10JAllemandou) I don't know scap3 enough (or at all to be precise), but if no special right is needed for deploy with it, then I don't need any :) Thanks @Dzahn [10:02:08] 6Operations, 7Puppet, 10Salt, 13Patch-For-Review: Make it possible for wmf-reimage to work seamlessly with a non-local salt master - https://phabricator.wikimedia.org/T124761#2125627 (10Joe) I tested this, finally, yesterday: while the key gets correctly deleted, it seemed not able to find the key to sign... [10:03:20] 6Operations, 10ops-codfw: mw2066 to mw2074 don't reboot cleanly - https://phabricator.wikimedia.org/T130008#2125631 (10MoritzMuehlenhoff) Papaul updated the idrac firmware on mw2066 and I tested a reboot after that, which worked just fine. So that seems to fix it. [10:05:42] (03CR) 10Alexandros Kosiaris: [C: 032] role::apertium: check the local url, not the eqiad one [puppet] - 10https://gerrit.wikimedia.org/r/277727 (owner: 10Giuseppe Lavagetto) [10:08:44] 6Operations, 10ops-codfw: mw2066 to mw2074 don't reboot cleanly - https://phabricator.wikimedia.org/T130008#2125637 (10MoritzMuehlenhoff) mw2166 also hung on reboot and appears to need an idrac update. [10:13:15] 6Operations, 15User-mobrovac, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Create a service location / discovery system for locating local/master resources easily across all WMF applications - https://phabricator.wikimedia.org/T125069#2125639 (10Joe) In my opinion, we should create a system that can integr... [10:13:30] <_joe_> User-mobrovac? [10:13:43] :P [10:17:56] _joe_: i verified that rb is talking to parsoid, graphoid, mathoid and mobileapp in codfw correctly [10:21:47] !log Rebooting analytics103* for kernel upgrade [10:21:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:23:03] <_joe_> mobrovac: great! [10:27:27] 6Operations, 10DBA: Investigate/decom db2001-db2008 - https://phabricator.wikimedia.org/T125827#1998016 (10Volans) I'm editing this task because I'm taking db2008.codfw.wmnet back in usage for T130098 so it must not be decommissioned. [10:27:57] 6Operations, 10DBA: Investigate/decom db2001-db2007 - https://phabricator.wikimedia.org/T125827#2125672 (10Volans) [10:28:42] (03PS3) 10WMDE-leszek: Whitelist feeds included on Wikimedia Germany Engineering page on mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275815 (https://phabricator.wikimedia.org/T127176) [10:33:59] (03CR) 10Addshore: [C: 031] "https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=369298&oldid=369191" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275815 (https://phabricator.wikimedia.org/T127176) (owner: 10WMDE-leszek) [10:39:29] 6Operations, 10MediaWiki-JobQueue, 13Patch-For-Review: The refreshLinks jobs enqueue rate is 10 times the normal rate - https://phabricator.wikimedia.org/T129517#2125708 (10jcrespo) [10:41:40] 6Operations, 10MediaWiki-JobQueue, 13Patch-For-Review: The refreshLinks jobs enqueue rate is 10 times the normal rate - https://phabricator.wikimedia.org/T129517#2107949 (10hoo) >>! In T129517#2125116, @ori wrote: > One possible loop: > > `RefreshLinksJob::runForTitle()` → `ParserCache::singleton()->save()`... [10:49:13] 6Operations, 6Performance-Team, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Figure out how to migrate the jobqueues - https://phabricator.wikimedia.org/T124673#2125727 (10Joe) [10:49:15] 6Operations, 6Performance-Team, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Dedicate 1/2 codfw jobrunners to gwtoolset jobs - https://phabricator.wikimedia.org/T129317#2125725 (10Joe) 5stalled>3Open a:3Joe [10:49:56] 6Operations, 6Performance-Team, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Dedicate 1/2 codfw jobrunners to gwtoolset jobs - https://phabricator.wikimedia.org/T129317#2102170 (10Joe) On second thoughts, it's better not to risk shaking up the eqiad jobrunner config further at the moment, so I'm going to d... [10:50:20] (03PS1) 10Giuseppe Lavagetto: jobrunner: sync codfw configuration and use hiera regexes [puppet] - 10https://gerrit.wikimedia.org/r/277736 (https://phabricator.wikimedia.org/T129317) [10:50:33] 6Operations, 7Puppet, 10Salt, 13Patch-For-Review: Make it possible for wmf-reimage to work seamlessly with a non-local salt master - https://phabricator.wikimedia.org/T124761#2125734 (10ArielGlenn) OK, I'll double-check my end of the code again just in case. [10:50:59] !log rolling reboot of mw* servers in eqiad (except job runners to not interfere with T129517) [10:51:01] T129517: The refreshLinks jobs enqueue rate is 10 times the normal rate - https://phabricator.wikimedia.org/T129517 [10:51:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:51:49] !log rebooting analytics102[89] and analtics104* for kernel upgrades [10:51:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:59:56] 6Operations, 10MediaWiki-JobQueue, 13Patch-For-Review: The refreshLinks jobs enqueue rate is 10 times the normal rate - https://phabricator.wikimedia.org/T129517#2125762 (10hashar) Been looking on `frwiki` for uuid of claimed jobs via: ``` mwscript showJobs.php --wiki=frwiki --type=refreshLinks --status=clai... [11:02:35] !log Revoking puppet key for db2008.codfw.wmnet (puppet wasn't running already) T130098 [11:02:36] T130098: Create a new x1 slave in codfw (just in case) - https://phabricator.wikimedia.org/T130098 [11:02:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:08:04] (03CR) 10Giuseppe Lavagetto: [C: 032] jobrunner: sync codfw configuration and use hiera regexes [puppet] - 10https://gerrit.wikimedia.org/r/277736 (https://phabricator.wikimedia.org/T129317) (owner: 10Giuseppe Lavagetto) [11:11:49] (03PS1) 10Volans: DHCP: Add Jessie PXE for db2008 [puppet] - 10https://gerrit.wikimedia.org/r/277741 (https://phabricator.wikimedia.org/T130098) [11:12:21] (03PS2) 10Filippo Giunchedi: ganglia: remove aggregator from alsafi [puppet] - 10https://gerrit.wikimedia.org/r/277572 (owner: 10Dzahn) [11:12:27] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] ganglia: remove aggregator from alsafi [puppet] - 10https://gerrit.wikimedia.org/r/277572 (owner: 10Dzahn) [11:15:31] 6Operations, 10Ops-Access-Requests: Grant Joseph and Dan deploy permissions on aqs100[1-3] - https://phabricator.wikimedia.org/T116169#2125782 (10mobrovac) I would say you're good now if you are able to restart AQS and Cassandra and use `cqlsh`. [11:22:46] 6Operations, 6Performance-Team, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Figure out how to migrate the jobqueues - https://phabricator.wikimedia.org/T124673#2125795 (10Joe) [11:22:48] 6Operations, 6Performance-Team, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Dedicate 1/2 codfw jobrunners to gwtoolset jobs - https://phabricator.wikimedia.org/T129317#2125794 (10Joe) 5Open>3Resolved [11:24:04] !log rebooting analytics105* for kernel upgrade [11:24:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:26:09] 6Operations: Make services manageable by systemd (tracking) - https://phabricator.wikimedia.org/T97402#1241261 (10fgiunchedi) one takeaway from {T124197} is how to handle multiple instances of the same service, like ganglia aggregator. Using systemd instances seems to work well via `base::service_unit`, one othe... [11:27:45] 6Operations, 6Services, 3Mobile-Content-Service, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Prepare mobileapps for the codfw switchover - https://phabricator.wikimedia.org/T125061#2125813 (10Joe) 5Open>3Resolved [11:36:08] (03CR) 10Mobrovac: [C: 04-1] "One minor nit left." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/277423 (owner: 10Thcipriani) [11:37:01] 6Operations, 10DBA, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Create a Master-master topology between datacenters for easier failover (setup circular replication dallas -> eqiad for mysql databases) - https://phabricator.wikimedia.org/T119642#2125844 (10jcrespo) 5Open>3Resolved All production machin... [11:37:48] (03CR) 10Mobrovac: [C: 031] Filter StatusLogger messages from UDP appender [puppet] - 10https://gerrit.wikimedia.org/r/277265 (https://phabricator.wikimedia.org/T128787) (owner: 10Eevans) [11:42:53] (03PS1) 10Giuseppe Lavagetto: zotero: simplify url-downloader url lookup in hiera [puppet] - 10https://gerrit.wikimedia.org/r/277749 (https://phabricator.wikimedia.org/T125065) [11:43:18] (03PS2) 10Giuseppe Lavagetto: zotero: simplify url-downloader url lookup in hiera [puppet] - 10https://gerrit.wikimedia.org/r/277749 (https://phabricator.wikimedia.org/T125065) [11:44:57] !log rebooting analytics1001/1002 (Yarn/HDFS master nodes) for kernel upgrade [11:45:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:46:42] (03CR) 10Giuseppe Lavagetto: [C: 032] "noop according to the compiler" [puppet] - 10https://gerrit.wikimedia.org/r/277749 (https://phabricator.wikimedia.org/T125065) (owner: 10Giuseppe Lavagetto) [11:48:41] (03PS1) 10Filippo Giunchedi: cassandra: bootstrap restbase1012-a [puppet] - 10https://gerrit.wikimedia.org/r/277750 [11:49:01] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: bootstrap restbase1012-a [puppet] - 10https://gerrit.wikimedia.org/r/277750 (owner: 10Filippo Giunchedi) [11:51:16] (03PS2) 10Volans: DHCP: Add Jessie PXE for db2008 [puppet] - 10https://gerrit.wikimedia.org/r/277741 (https://phabricator.wikimedia.org/T130098) [11:53:16] (03CR) 10Volans: [C: 032] DHCP: Add Jessie PXE for db2008 [puppet] - 10https://gerrit.wikimedia.org/r/277741 (https://phabricator.wikimedia.org/T130098) (owner: 10Volans) [11:55:01] <_joe_> hey you stop merging all those patches, I keep needing to rebase mines! [11:55:22] (03PS1) 10Giuseppe Lavagetto: citoid: simplify hiera lookup for the zotero host [puppet] - 10https://gerrit.wikimedia.org/r/277753 [11:56:47] _joe_: merge faster :P [11:57:15] godog: +1 :) [11:57:16] merge fast, break things [11:57:49] You need the oprah of merges, you get to merge, and you get to merge. Everybody gets to merge! [11:57:53] <_joe_> ok I'll merge $mw_primary eqiad => codfw and call it a day [11:58:28] (03CR) 10Giuseppe Lavagetto: [C: 032] "noop according to the compiler" [puppet] - 10https://gerrit.wikimedia.org/r/277753 (owner: 10Giuseppe Lavagetto) [11:58:32] now we're talking! [12:00:23] or I can merge mine: [12:01:16] (03PS1) 10Jcrespo: Make codfw db masters as the masters of all datacenters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277754 [12:02:33] fun fact! This change is a noop that only changes the comments^ [12:03:16] (03PS2) 10Jcrespo: Make codfw db masters as the masters of all datacenters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277754 (https://phabricator.wikimedia.org/T124699) [12:04:31] PROBLEM - Restbase root url on restbase1012 is CRITICAL: Connection refused [12:05:12] PROBLEM - cassandra-a CQL 10.64.32.202:9042 on restbase1012 is CRITICAL: Connection refused [12:05:32] PROBLEM - cassandra-a service on restbase1012 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [12:06:21] PROBLEM - puppet last run on restbase1012 is CRITICAL: CRITICAL: Puppet has 3 failures [12:06:41] PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.32.79, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [12:07:47] <_joe_> godog: that's you reimaging, right? [12:12:31] _joe_: yeah, damn icinga race.. silencing [12:12:57] 6Operations, 6Language-Engineering, 6Services, 13Patch-For-Review, and 2 others: Prepare cxserver/zotero for the codfw switchover - https://phabricator.wikimedia.org/T125065#2125919 (10Joe) 5Open>3Resolved [12:13:01] thanks [12:13:59] (03PS3) 10Jcrespo: Make codfw db masters as the masters of all datacenters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277754 (https://phabricator.wikimedia.org/T124699) [12:14:12] (03CR) 10Jcrespo: [C: 04-2] "Do not apply until actual codfw failover." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277754 (https://phabricator.wikimedia.org/T124699) (owner: 10Jcrespo) [12:14:22] RECOVERY - cassandra-a service on restbase1012 is OK: OK - cassandra-a is active [12:15:21] RECOVERY - puppet last run on restbase1012 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [12:19:37] 6Operations, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Figure out and document the datacenter switchover process - https://phabricator.wikimedia.org/T124670#1962060 (10jcrespo) This is a duplicate, but I would merge T114398 into it, as this one has activity. [12:21:51] 6Operations, 10DBA, 13Patch-For-Review, 7Performance, and 2 others: Stress-test mediawiki application servers at codfw (specially to figure out db weights configuration) and basic buffer warming - https://phabricator.wikimedia.org/T124697#2125938 (10jcrespo) a:3jcrespo [12:22:10] 6Operations, 10DBA, 13Patch-For-Review, 7Performance, and 2 others: Stress-test mediawiki application servers at codfw (specially to figure out db weights configuration) and basic buffer warming - https://phabricator.wikimedia.org/T124697#1963095 (10jcrespo) p:5Triage>3High [12:36:54] !log bootstrapping restbase1012-a T125842 [12:36:56] T125842: normalize eqiad restbase cluster - replace restbase1001-1006 - https://phabricator.wikimedia.org/T125842 [12:36:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:39:04] 6Operations, 10CirrusSearch, 6Discovery, 3Discovery-Search-Sprint, and 4 others: Look into encrypting Elasticsearch traffic - https://phabricator.wikimedia.org/T124444#2125958 (10faidon) >>! In T124444#2124270, @EBernhardson wrote: > curl_init_pooled looks very interesting. Unfortunately it is new as of 3.... [12:39:24] !log restarting elasticsearch server elastic2019.codfw.wmnet [12:39:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:15:47] !log puppet re-enabled on analytics1027 [13:15:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:23:34] !log restarting elasticsearch server elastic2020.codfw.wmnet [13:23:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:36:40] Hello, do you remember my advice concerning the upgrade to git 2.7.1. I was wrong, It was big security mistake http://www.openwall.com/lists/oss-security/2016/03/16/9 [13:38:07] (03PS1) 10DCausse: Enable completion suggester as default on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277767 (https://phabricator.wikimedia.org/T128776) [13:38:51] ytrezq: don't worry, we've backported the fixes instead of upgrading to 2.7.1 [13:39:04] and thanks again for reaching out to us [13:53:42] 6Operations, 10Ops-Access-Requests: Requesting access to to analytics-search-user for Mikhail Popov and Oliver Keyes - https://phabricator.wikimedia.org/T129260#2126061 (10Ottomata) > cron jobs should be created by puppet rather than humans @dzahn, usually I agree with this. However, on stat1002/stat1003, thi... [13:58:49] 6Operations, 6Performance-Team: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2126064 (10elukey) ori: I would be super interested in working on it, but before starting it would be great to discuss what are the key metrics to check for the test (and the related e... [14:00:03] (03CR) 10Gehel: [C: 031] Enable completion suggester as default on dewiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277767 (https://phabricator.wikimedia.org/T128776) (owner: 10DCausse) [14:02:03] !log Reimaging db2008 to jessie T130098 [14:02:04] T130098: Create a new x1 slave in codfw (just in case) - https://phabricator.wikimedia.org/T130098 [14:02:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:02:20] 6Operations, 10Ops-Access-Requests: Requesting access to to analytics-search-user for Mikhail Popov and Oliver Keyes - https://phabricator.wikimedia.org/T129260#2126076 (10Ottomata) Also, for puppetization help, Discovery has an embedded opsen now! Should be a little easier! [14:03:28] !log cleanup snapshots on labstore1001 [14:03:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:04:42] !log restarting elasticsearch server elastic2021.codfw.wmnet [14:04:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:06:35] !log wasat signing puppet certs, salt-key, initial run [14:06:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:10:08] (03CR) 10Ottomata: [C: 031] "+1, I think bblack should give the final approval though." [puppet] - 10https://gerrit.wikimedia.org/r/277058 (https://phabricator.wikimedia.org/T129270) (owner: 10Alex Monk) [14:10:29] 6Operations, 10ops-codfw: mw2066 to mw2074 don't reboot cleanly - https://phabricator.wikimedia.org/T130008#2126094 (10Papaul) Ok Moritz I will work on updating the others systems. Thanks [14:10:53] 6Operations, 10ops-codfw: mw2066 to mw2074 don't reboot cleanly - https://phabricator.wikimedia.org/T130008#2126095 (10Papaul) p:5Triage>3Normal [14:11:53] 6Operations, 10Ops-Access-Requests: Requesting access to to analytics-search-user for Mikhail Popov and Oliver Keyes - https://phabricator.wikimedia.org/T129260#2099926 (10JanZerebecki) >>! In T129260#2126061, @Ottomata wrote: >> cron jobs should be created by puppet rather than humans > @dzahn, usually I agre... [14:13:23] 6Operations, 10Ops-Access-Requests: Requesting access to to analytics-search-user for Mikhail Popov and Oliver Keyes - https://phabricator.wikimedia.org/T129260#2126099 (10Ottomata) For the most part, but not if they are just making short term reports or experiments. Development of this type of stuff is actua... [14:15:53] 6Operations, 10ops-codfw: rack new mw maint host - wasat - https://phabricator.wikimedia.org/T129930#2126113 (10Papaul) [14:17:21] 6Operations, 10ops-codfw: rack new mw maint host - wasat - https://phabricator.wikimedia.org/T129930#2120304 (10Papaul) a:5Papaul>3Joe @Joe the installation is complete. [14:18:11] PROBLEM - MariaDB Slave SQL: s3 on dbstore2002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1030, Errmsg: Error Got error 122 Internal (unspecified) error in handler from storage engine Aria on query. Default database: bnwikisource. Query: INSERT /* ShortUrlUtils::encodeTitle 66.249.74.113 */ IGNORE INTO shorturls (su_id,su_namespace,su_title) VALUES (NULL,104,জ্ঞানযোগ_-_চতুর্থ_সং [14:18:42] !log labstore200[3-4}] - signing puppet certs, salt-key, initial run [14:18:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:19:11] from storage engine Aria? [14:19:15] that is new [14:22:50] 6Operations, 6Labs: overhaul labstore setup [tracking] - https://phabricator.wikimedia.org/T126083#2126125 (10Papaul) [14:22:53] 6Operations, 10ops-codfw, 6Labs: Figure out what labstore hardware is viable in codfw - https://phabricator.wikimedia.org/T128083#2126121 (10Papaul) 5Open>3Resolved Closing this since this is resolved in T128764 [14:24:10] PROBLEM - MariaDB Slave Lag: s3 on dbstore2002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 616.39 seconds [14:28:45] (03PS1) 10Faidon Liambotis: exim: add add/keep_environment to all mail hubs [puppet] - 10https://gerrit.wikimedia.org/r/277776 [14:29:12] 6Operations, 10ops-codfw, 13Patch-For-Review: labstore2003-labstore2004 onsite setup task - https://phabricator.wikimedia.org/T128764#2085044 (10Papaul) 5Open>3Resolved a:5Papaul>3chasemp @ Chase the installation is complete. [14:31:44] RECOVERY - puppet last run on analytics1026 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [14:35:13] PROBLEM - puppet last run on labstore2004 is CRITICAL: CRITICAL: Puppet has 1 failures [14:35:33] PROBLEM - puppet last run on labstore2003 is CRITICAL: CRITICAL: Puppet has 1 failures [14:35:53] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me. I'll prepare a similar patch for the -light exims in production after having updated them." [puppet] - 10https://gerrit.wikimedia.org/r/277776 (owner: 10Faidon Liambotis) [14:36:09] !log REPAIR NO_WRITE_TO_BINLOG TABLE bnwikisource.shorturls replica failed for corrupted table [14:36:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:38:09] 6Operations, 10ops-codfw: es2004 doesn't come back up after reboot - https://phabricator.wikimedia.org/T126203#2126174 (10Papaul) p:5Normal>3Low [14:48:13] RECOVERY - puppet last run on labstore2003 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [14:49:27] papaul: ^ nice :) [14:51:18] !log restarting elasticsearch server elastic2022.codfw.wmnet [14:51:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:52:21] mutante; thanks [14:52:29] 6Operations, 10CirrusSearch, 6Discovery, 3Discovery-Search-Sprint, and 4 others: Look into encrypting Elasticsearch traffic - https://phabricator.wikimedia.org/T124444#2126204 (10EBernhardson) oh sweet! [14:52:41] !log Reimporting table bnwikisource.shorturls into dbstore2002 as InnoDB [14:52:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:53:45] (03CR) 10JanZerebecki: [C: 031] "good for SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275815 (https://phabricator.wikimedia.org/T127176) (owner: 10WMDE-leszek) [14:55:17] RECOVERY - puppet last run on labstore2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:55:17] 6Operations, 6Services, 13Patch-For-Review, 7Service-Architecture: Set up monitoring automation for services - https://phabricator.wikimedia.org/T94821#2126206 (10Joe) p:5Normal>3Lowest [14:55:51] 6Operations, 6Language-Engineering, 6Services, 13Patch-For-Review, and 2 others: Prepare cxserver/zotero for the codfw switchover - https://phabricator.wikimedia.org/T125065#2126207 (10Joe) 5Open>3Resolved [15:00:04] anomie ostriches thcipriani marktraceur Krenair aude: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160316T1500). [15:00:04] Nikerabbit Addshore: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:00:24] i will do the swat [15:00:57] jzerebecki: if you want to .... [15:01:16] *waves* [15:01:37] aude: any idea if https://gerrit.wikimedia.org/r/#/c/277519/ is good to swat? [15:01:47] hi addshore :) [15:01:53] * aude looks [15:01:58] Nikerabbit, hoo: ^^ [15:02:12] jzerebecki: yes? [15:02:45] Nikerabbit: Did you check the implications of that? [15:03:05] (03CR) 10JanZerebecki: [C: 032] Whitelist feeds included on Wikimedia Germany Engineering page on mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275815 (https://phabricator.wikimedia.org/T127176) (owner: 10WMDE-leszek) [15:03:09] leszek_wmde its patch time :P [15:03:50] hoo: as far as I know it only fixes the bug. I don't see any way I could test this beforehand, but given nowiki already uses nb, it should be safe. Mostly worried if there are some unknown migration/caching issues. [15:03:54] (03Merged) 10jenkins-bot: Whitelist feeds included on Wikimedia Germany Engineering page on mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275815 (https://phabricator.wikimedia.org/T127176) (owner: 10WMDE-leszek) [15:03:58] jzerebecki: i think looks ok... this makes the sister projects consistent with nowiki [15:04:15] so think it would be ok [15:04:17] we can delay if you are unsure, but I have no idea how we could become more sure other than just doing [15:04:55] Well, you could look/ test what exactly changes due to that change [15:05:00] and then see whatever is using that [15:05:14] That may or may not be time consuming though [15:05:41] I can'T think of any obvious problems, but it's certainly not something we do often [15:05:56] anything checking wgLanguageCode directly (ie. what is seen in the bug) as opposed to $wgContLAng->getCode() which is normalised to nb already [15:07:20] 6Operations, 10Ops-Access-Requests: Grant Joseph and Dan deploy permissions on aqs100[1-3] - https://phabricator.wikimedia.org/T116169#2126241 (10Dzahn) 5Open>3Resolved Thank you all for clarification. I'm closing the ticket as resolved then. [15:07:29] !log jzerebecki@tin Synchronized wmf-config/InitialiseSettings.php: Whitelist feeds on mediawiki.org 622c186b2712c923fbcc48c27f65ebf396176f3e T127176 (duration: 00m 31s) [15:07:30] T127176: Include ATOM feed on WMDE Engineering page on mediawiki.org - https://phabricator.wikimedia.org/T127176 [15:08:01] addshore, leszek_wmde: please test [15:08:48] !log rebooting lithium for kernel upgrade [15:08:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:09:06] looks good [15:09:13] thx [15:09:46] !log restarting mysql on dbstore2002 Aria engine crashed [15:09:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:11:50] (03PS3) 10Thcipriani: Pass deploy user from service::node [puppet] - 10https://gerrit.wikimedia.org/r/277423 [15:12:24] (03CR) 10Faidon Liambotis: [C: 032] exim: add add/keep_environment to all mail hubs [puppet] - 10https://gerrit.wikimedia.org/r/277776 (owner: 10Faidon Liambotis) [15:13:40] jzerebecki: also confirming RSS whitelist patch works fine! [15:16:12] !log rebooting graphite2001 for kernel upgrade [15:16:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:17:08] (03PS11) 10Elukey: Varnish 4 API porting. [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/276439 (https://phabricator.wikimedia.org/T124278) [15:19:14] soo did we reach a conclusion? [15:21:28] Nikerabbit: I think we need a sites table change on every wiki to go with that change [15:22:40] jzerebecki: okay, let's postpone then, but could you add those to the task and CC people who need to be involved in that? We can then also announce in technews. [15:25:16] 6Operations, 10ops-eqiad, 10Dumps-Generation: Rack and setup snapshot1005-1007 - https://phabricator.wikimedia.org/T129553#2126316 (10ArielGlenn) I don't see any disks whatsoever detected on bootup. mdadm complains it can't find /dev/sda1 and I can't find /dev/sda with or without partitions. Are the disks... [15:25:42] !log rebooting graphite1001 for kernel upgrade [15:25:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:26:57] (03CR) 10JanZerebecki: [C: 04-1] "Needs DB change to sites and sites_identifiers to go with this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277519 (https://phabricator.wikimedia.org/T126146) (owner: 10Nikerabbit) [15:27:24] PROBLEM - Host graphite1001 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:52] RECOVERY - Host graphite1001 is UP: PING OK - Packet loss = 0%, RTA = 0.57 ms [15:29:45] SWAT is done [15:30:46] thanks jzerebecki [15:34:45] greg-g: too bad I did not get to enjoy having SWAT window within my regular working hours thanks to DST layover ;) [15:35:15] that is, currently it is within my regular hours, but my patch did not get pushed [15:40:44] ah, I see [15:51:12] (03PS1) 10Giuseppe Lavagetto: Use wmfLocalServices for wgUploadThumbnailRenderHttpCustomDomain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277786 [15:54:26] 6Operations, 10Ops-Access-Requests: Requesting access to to analytics-search-user for Mikhail Popov and Oliver Keyes - https://phabricator.wikimedia.org/T129260#2126420 (10JanZerebecki) >>! In T129260#2126099, @Ottomata wrote: > For the most part, but not if they are just making short term reports or experimen... [15:54:54] PROBLEM - puppet last run on krypton is CRITICAL: CRITICAL: Puppet has 1 failures [15:55:10] (03CR) 10Ottomata: "I just tested using this to take an initial and several incremental backups, and then launched a mysql instance on another host using the " [puppet] - 10https://gerrit.wikimedia.org/r/277640 (https://phabricator.wikimedia.org/T127991) (owner: 10Ottomata) [15:56:03] PROBLEM - puppet last run on magnesium is CRITICAL: CRITICAL: Puppet has 1 failures [16:01:06] 6Operations, 10Ops-Access-Requests: Requesting access to to analytics-search-user for Mikhail Popov and Oliver Keyes - https://phabricator.wikimedia.org/T129260#2126433 (10Ironholds) Okay, I quit Friday. Could we please save discussion about long-term solutions for, you know, the long-term? Because not having... [16:01:18] 6Operations, 10Ops-Access-Requests: Requesting access to to analytics-search-user for Mikhail Popov and Oliver Keyes - https://phabricator.wikimedia.org/T129260#2126434 (10Ironholds) 5Open>3Resolved a:3Ironholds [16:01:26] mark: bluejeans links is dead [16:01:50] just put a new one in the calendar invite [16:01:55] ah, just updated, I see [16:01:58] AaronSchulz: query [16:02:15] !log restarting elasticsearch server elastic2023.codfw.wmnet [16:02:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:03:13] (03CR) 10EBernhardson: [C: 032] Enable completion suggester as default on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277767 (https://phabricator.wikimedia.org/T128776) (owner: 10DCausse) [16:03:44] (03Merged) 10jenkins-bot: Enable completion suggester as default on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277767 (https://phabricator.wikimedia.org/T128776) (owner: 10DCausse) [16:06:56] !log upgraded remaining exim binary packages on krypton to 4.84-8+deb8u2 (only exim4-daemon-heavy was at the version, resulting in the new add_environment variable being unknown and exim failing to start) [16:07:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:07:38] RECOVERY - puppet last run on krypton is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:11:08] RECOVERY - puppet last run on magnesium is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [16:11:10] !log upgraded exim packages on magnesium to the version from USN 2933, without it add_environment can't be set exim failed to start [16:11:13] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: Enable completion suggester as default on dewiki (duration: 00m 44s) [16:11:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:11:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:13:14] 6Operations, 10Ops-Access-Requests: Requesting access to to analytics-search-user for Mikhail Popov and Oliver Keyes - https://phabricator.wikimedia.org/T129260#2126451 (10Dzahn) Fair enough. Well.. the ticket already says that most requirements are met with the standard solution. And for the rest, it seems we... [16:13:45] (03CR) 1020after4: [C: 031] Use scap subcommands [puppet] - 10https://gerrit.wikimedia.org/r/277700 (owner: 10Thcipriani) [16:21:09] (03PS1) 10Ema: Port varnishreqstats and varnishstatsd to new VSL API [puppet] - 10https://gerrit.wikimedia.org/r/277790 (https://phabricator.wikimedia.org/T128788) [16:21:44] (03PS1) 10Krinkle: labs: Remove $wgPreloadJavaScriptMwUtil [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277791 [16:23:13] (03CR) 10Krinkle: [C: 032] labs: Remove $wgPreloadJavaScriptMwUtil [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277791 (owner: 10Krinkle) [16:24:17] 6Operations, 10Ops-Access-Requests: Grant Joseph and Dan deploy permissions on aqs100[1-3] - https://phabricator.wikimedia.org/T116169#2126469 (10Dzahn) >>! In T116169#2125782, @mobrovac wrote: > I would say you're good now if you are able to restart AQS and Cassandra and use `cqlsh`. joal is in the group aqs... [16:24:42] (03Merged) 10jenkins-bot: labs: Remove $wgPreloadJavaScriptMwUtil [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277791 (owner: 10Krinkle) [16:25:18] (03CR) 10Giuseppe Lavagetto: "AIUI, this would make the jobrunners submit thumb rendering requests to the local datacenter." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277786 (owner: 10Giuseppe Lavagetto) [16:25:50] (03PS1) 10Ema: varnishprocessor.py: remove unused import (varnishlog) [puppet] - 10https://gerrit.wikimedia.org/r/277794 [16:26:35] (03PS1) 10Volans: DB: Add db2008 to the x1 shard group [puppet] - 10https://gerrit.wikimedia.org/r/277795 (https://phabricator.wikimedia.org/T130098) [16:27:45] (03CR) 10Krinkle: "Yeah, I think it'd be easier to understand if we don't keep remote masters in comments. Especially since the policy is to point to local m" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277754 (https://phabricator.wikimedia.org/T124699) (owner: 10Jcrespo) [16:29:00] 6Operations, 10Ops-Access-Requests: Grant Joseph and Dan deploy permissions on aqs100[1-3] - https://phabricator.wikimedia.org/T116169#2126476 (10Dzahn) So it definitely let's you restart the services which this ticket was originally about. Maybe best if you try a deploy and if you find you need more permissi... [16:29:59] (03PS1) 10DCausse: Enable completion suggester as default on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277796 (https://phabricator.wikimedia.org/T128776) [16:30:33] 6Operations, 10Ops-Access-Requests, 15User-greg: Requesting access to production for SWAT deploy for dereckson - https://phabricator.wikimedia.org/T129365#2126505 (10Dzahn) 5Open>3stalled [16:30:39] (03PS6) 10KartikMistry: Enable non-default MT for some languages [puppet] - 10https://gerrit.wikimedia.org/r/277463 (https://phabricator.wikimedia.org/T129849) [16:31:08] 6Operations, 6Performance-Team, 7Availability, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Dig through logs from 15 Mar 2016 read-only test and file bugs - https://phabricator.wikimedia.org/T129973#2121627 (10greg) @ori: sure, I'll ask for some help (just fyi, I/we didn't get pinged since Phab doesn't a... [16:31:38] 6Operations, 6Performance-Team, 6Release-Engineering-Team, 7Availability, and 2 others: Dig through logs from 15 Mar 2016 read-only test and file bugs - https://phabricator.wikimedia.org/T129973#2126511 (10greg) [16:32:24] !log krinkle@tin Synchronized wmf-config/CommonSettings-labs.php: (no message) (duration: 00m 31s) [16:32:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:33:16] (03CR) 10EBernhardson: [C: 032] Enable completion suggester as default on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277796 (https://phabricator.wikimedia.org/T128776) (owner: 10DCausse) [16:33:45] (03Merged) 10jenkins-bot: Enable completion suggester as default on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277796 (https://phabricator.wikimedia.org/T128776) (owner: 10DCausse) [16:33:50] (03Abandoned) 10Ema: Preiliminary port to new VSL API [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/274135 (https://phabricator.wikimedia.org/T124278) (owner: 10Ema) [16:34:01] (03CR) 10Jcrespo: ""Policy" sounds too serious, I used it wrongly. :-) "Current agreed solution for now"." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277754 (https://phabricator.wikimedia.org/T124699) (owner: 10Jcrespo) [16:35:58] (03CR) 10Gilles: [C: 031] varnishprocessor.py: remove unused import (varnishlog) [puppet] - 10https://gerrit.wikimedia.org/r/277794 (owner: 10Ema) [16:36:10] 6Operations, 10Ops-Access-Requests: Grant Joseph and Dan deploy permissions on aqs100[1-3] - https://phabricator.wikimedia.org/T116169#2126526 (10JAllemandou) Sounds good, thanks @Dzahn. [16:36:16] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: Enable completion suggester as default on enwiki (duration: 00m 30s) [16:36:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:36:26] (03CR) 10Jcrespo: [C: 031] "Offtopc comment and //TODO. I set all of codfw to ROW for testing, we need to revert the master to STATEMENT on all dallas masters due to " [puppet] - 10https://gerrit.wikimedia.org/r/277795 (https://phabricator.wikimedia.org/T130098) (owner: 10Volans) [16:36:48] (03CR) 10Aaron Schulz: [C: 031] Use wmfLocalServices for wgUploadThumbnailRenderHttpCustomDomain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277786 (owner: 10Giuseppe Lavagetto) [16:37:23] (03CR) 10Jcrespo: [C: 04-1] "-1 because this requires extensive testing and gradual deploy." [puppet] - 10https://gerrit.wikimedia.org/r/272639 (https://phabricator.wikimedia.org/T127636) (owner: 10Volans) [16:39:07] (03CR) 10Volans: "I'll abandon for now, is literally 3 lines, we can redo it at anytime" [puppet] - 10https://gerrit.wikimedia.org/r/272639 (https://phabricator.wikimedia.org/T127636) (owner: 10Volans) [16:39:16] (03Abandoned) 10Volans: mariadb: Moved error logs to syslog [puppet] - 10https://gerrit.wikimedia.org/r/272639 (https://phabricator.wikimedia.org/T127636) (owner: 10Volans) [16:44:52] (03PS1) 10Giuseppe Lavagetto: cache::text: route all restbase traffic to codfw [puppet] - 10https://gerrit.wikimedia.org/r/277798 (https://phabricator.wikimedia.org/T127974) [16:45:35] <_joe_> gwicke, mobrovac ^^ [16:45:49] (03CR) 10Volans: [C: 032] DB: Add db2008 to the x1 shard group [puppet] - 10https://gerrit.wikimedia.org/r/277795 (https://phabricator.wikimedia.org/T130098) (owner: 10Volans) [16:46:15] _joe_: nicely simple ;) [16:46:16] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "To be merged and applied for the planned swictchover on 10:00 UTC, Mar 17th" [puppet] - 10https://gerrit.wikimedia.org/r/277798 (https://phabricator.wikimedia.org/T127974) (owner: 10Giuseppe Lavagetto) [16:46:23] <_joe_> gwicke: kudos to bblack for that [16:46:33] !log restarting elasticsearch server elastic2024.codfw.wmnet [16:46:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:46:38] indeed, he is missing part of the party [16:46:54] <_joe_> well, ema will be around for the switch :) [16:52:23] 6Operations, 10Wikimedia-Stream, 7user-notice: reboot of rcs servers (stream.wikimedia.org) - https://phabricator.wikimedia.org/T130024#2126611 (10Dzahn) @Joe Yea, i don't disagree, i have just been asked to announce it like last time. You are probably right about setting bad expectations. [16:52:30] (03PS1) 10Giuseppe Lavagetto: Use the local restbase cluster in codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277803 (https://phabricator.wikimedia.org/T127974) [16:52:32] (03PS1) 10Giuseppe Lavagetto: Switch temporarily eqiad to use the codfw restbase cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277804 (https://phabricator.wikimedia.org/T127974) [16:52:46] <_joe_> gwicke: here you are the mediawiki changes, sir! [16:54:12] interesting, things seem to have changed a bit in that config [16:54:20] also: thank you, sir ;) [16:56:26] (03PS3) 10Alex Monk: varnish: Fix puppet in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/277058 (https://phabricator.wikimedia.org/T129270) [16:56:41] <_joe_> yeah in this case, it's on me :) [16:57:05] 'wmgRestbaseServer' => array( [16:57:06] 'default' => $wmfLocalServices['restbase'] [16:57:19] nice ;) [16:58:06] (03CR) 10GWicke: [C: 031] Switch temporarily eqiad to use the codfw restbase cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277804 (https://phabricator.wikimedia.org/T127974) (owner: 10Giuseppe Lavagetto) [16:59:55] 6Operations, 10Traffic, 6WMF-Communications, 7HTTPS, 7Security-Other: Server certificate is classified as invalid on government computers - https://phabricator.wikimedia.org/T128182#2126657 (10Florian) [17:00:18] (03PS1) 10ArielGlenn: Don't tar up deleted images for wikitech backups [puppet] - 10https://gerrit.wikimedia.org/r/277805 (https://phabricator.wikimedia.org/T129440) [17:00:56] (03CR) 10Jcrespo: "I'm ok with it, I just will not use it for production for time and performance issues (plus no check on potential corruption). But I am ok" [puppet] - 10https://gerrit.wikimedia.org/r/277640 (https://phabricator.wikimedia.org/T127991) (owner: 10Ottomata) [17:02:01] are there any known problems when an user wants to login? (I have an user who reports, that he get's an exception when he tries to login :/) [17:02:17] <_joe_> FlorianSW: no [17:02:35] <_joe_> or better, not that I know of [17:03:05] _joe_: hmm, thanks. can you look into the exception logs (maybe I can provide some information that helps you to find somethind?)? [17:03:37] <_joe_> FlorianSW: if no one else is around... it's 6 PM and I might use a break :P [17:04:27] <_joe_> mutante: around? [17:04:28] _joe_: oops, sorry :P I'll wait for someone other then :) Nevertherless, thanks for the help :) (Btw.: We're in the same time zone, it seems :P) [17:04:33] halfak: do you mind if i go ahead with Tim Landscheidt's patch and merge something that moves the ores role classes around? https://gerrit.wikimedia.org/r/#/c/270102/ i will make sure there are no changes, you will just see that instead of one file with all roles you now have them in a new place, one file per role [17:04:38] _joe_: yes [17:04:46] <_joe_> mutante: can you help FlorianSW ? [17:04:47] ok [17:04:51] <_joe_> thanks :) [17:05:45] mutante, _joe_ oha, if I open the CentralAuth page of the user, I get an exception, too :/ [17:06:02] <_joe_> FlorianSW: can you post it to me in query? [17:06:03] oh, eh, do you want to PM me the user or something? feel free [17:06:18] looks for the right place for the logs [17:06:33] FlorianSW: looks like a known bug (that was closed recently) [17:06:41] but maybe it's back again [17:07:02] (03PS1) 10DCausse: Enable completion suggester as default on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277806 (https://phabricator.wikimedia.org/T128776) [17:07:03] https://phabricator.wikimedia.org/T119736 [17:07:05] <_joe_> yeah sounds like a bug indeed :P [17:07:22] <_joe_> Glaisher: let me verify [17:07:36] (03CR) 10EBernhardson: [C: 032] Enable completion suggester as default on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277806 (https://phabricator.wikimedia.org/T128776) (owner: 10DCausse) [17:08:11] FlorianSW: but the exception isn't supposed to occur anymore when attempting to login [17:08:22] (03Merged) 10jenkins-bot: Enable completion suggester as default on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277806 (https://phabricator.wikimedia.org/T128776) (owner: 10DCausse) [17:09:12] Glaisher: hmm, ok, thanks for the info! The user changed the password 1 or 2 days ago and gets the exception since "today" [17:09:12] ok, so that "Exception encountered, of type "Exception" " is all over older phabricator tickets [17:09:19] !log krinkle@tin Synchronized php-1.27.0-wmf.17/extensions/Kartographer/extension.json: (no message) (duration: 00m 25s) [17:09:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:09:31] mutante: :P [17:09:53] maybe it's another bug but hard to know without the stack trace [17:09:55] is it https://phabricator.wikimedia.org/T103841 ? [17:10:04] <_joe_> Glaisher: it's that precisely [17:10:07] <_joe_> Could not find local user data for Sirlanz@dewikiquote [17:10:17] ah [17:10:18] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: Enable completion suggester as default on eswiki (duration: 00m 25s) [17:10:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:10:27] <_joe_> should we reopen it I guess? [17:10:37] so the root cause is probably still not fixed [17:10:39] yeah [17:10:42] _joe_: I can do and link the ticket [17:10:44] <_joe_> FlorianSW: do you already have a ticket? [17:10:53] We thought it was leftovers from the pre-SUL finalization [17:10:53] _joe_: phab or otrs? [17:10:54] <_joe_> I mean a phab task [17:11:08] no, not yet :) [17:11:18] <_joe_> so yeah just link it [17:11:22] <_joe_> and reopen the ticket [17:11:26] <_joe_> Glaisher: good find :) [17:11:41] I'll do, thanks mutante, _joe_ and Glaisher! :D [17:12:50] (03PS2) 10ArielGlenn: Don't tar up deleted images for wikitech backups [puppet] - 10https://gerrit.wikimedia.org/r/277805 (https://phabricator.wikimedia.org/T129440) [17:13:12] _joe_: what is your phab name? :/ [17:13:19] <_joe_> FlorianSW: Joe [17:13:29] ah, he is it :P Thanks :) [17:14:05] x1 is a new server setup, do not worry [17:15:01] (03CR) 10ArielGlenn: [C: 032] Don't tar up deleted images for wikitech backups [puppet] - 10https://gerrit.wikimedia.org/r/277805 (https://phabricator.wikimedia.org/T129440) (owner: 10ArielGlenn) [17:17:00] 6Operations, 10Wikimedia-Mailing-lists: Remove/ archive inspire@lists.wikimedia.org - https://phabricator.wikimedia.org/T126640#2126708 (10Capt_Swing) Cool. Thanks! I'll give it another week or two. [17:17:35] mutante, ^ [17:18:21] Krenair: ok, thankfully that is nowadays easy and not so horrible anymore [17:18:29] just runs a script [17:18:52] 6Operations, 10Wikimedia-Mailing-lists: Remove/ archive inspire@lists.wikimedia.org - https://phabricator.wikimedia.org/T126640#2126719 (10Dzahn) a:3Dzahn [17:20:18] (03CR) 10Filippo Giunchedi: [C: 031] Use wmfLocalServices for wgUploadThumbnailRenderHttpCustomDomain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277786 (owner: 10Giuseppe Lavagetto) [17:24:16] (03PS1) 10DCausse: Enable completion suggester as default on ru, fr, pt and itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277808 (https://phabricator.wikimedia.org/T128776) [17:25:06] (03CR) 10EBernhardson: [C: 032] Enable completion suggester as default on ru, fr, pt and itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277808 (https://phabricator.wikimedia.org/T128776) (owner: 10DCausse) [17:25:24] 6Operations, 10Wikimedia-Mailing-lists: Remove/ archive inspire@lists.wikimedia.org - https://phabricator.wikimedia.org/T126640#2126746 (10Dzahn) Hi Capt_Swing, this is done. If an admin logs in they wil lsee "Emergency moderation of all list traffic is enabled" and nobody can post but the archives are in pla... [17:25:39] 6Operations, 10Wikimedia-Mailing-lists: Remove/ archive inspire@lists.wikimedia.org - https://phabricator.wikimedia.org/T126640#2126747 (10Dzahn) 5Open>3Resolved [17:25:57] (03Merged) 10jenkins-bot: Enable completion suggester as default on ru, fr, pt and itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277808 (https://phabricator.wikimedia.org/T128776) (owner: 10DCausse) [17:27:13] mutante, maybe this is something to document on the ops clinic duty wikitech page? [17:27:31] or just the location to find all the useful scripts or something [17:28:16] https://wikitech.wikimedia.org/wiki/Mailman#Disable_or_re-enable_a_mailing_list [17:28:22] i'm linking it from clinic duty [17:28:28] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: Enable completion suggester as default on ru, fr, pt and itwiki (duration: 00m 28s) [17:28:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:30:04] legoktm: Respected human, time to deploy Enabling UrlShortener in read-only mode (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160316T1730). Please do the needful. [17:30:20] hello :D [17:30:33] I will start in a few minutes, just prepping patches [17:30:57] Krenair: done [17:31:04] PROBLEM - puppet last run on db2056 is CRITICAL: CRITICAL: puppet fail [17:31:16] ty [17:31:47] 6Operations, 10MediaWiki-JobQueue, 13Patch-For-Review: The refreshLinks jobs enqueue rate is 10 times the normal rate - https://phabricator.wikimedia.org/T129517#2126798 (10GWicke) [17:34:45] (03PS1) 10DCausse: Enable completion suggester as default on ja, zh, pl, ar and nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277810 (https://phabricator.wikimedia.org/T128776) [17:35:24] 6Operations, 10Wikimedia-Stream: reboot of rcs servers (stream.wikimedia.org) - https://phabricator.wikimedia.org/T130024#2126807 (10Dzahn) [17:36:46] (03PS1) 10Legoktm: Add UrlShortener to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277812 (https://phabricator.wikimedia.org/T108557) [17:36:48] (03PS1) 10Legoktm: Configure UrlShortener extension in read-only mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277813 (https://phabricator.wikimedia.org/T108557) [17:36:55] (03CR) 10EBernhardson: "since this is the last patch, lets just drop all the wiki => yes lines and leave it with the default and wikidata?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277810 (https://phabricator.wikimedia.org/T128776) (owner: 10DCausse) [17:37:56] 6Operations, 10Traffic, 6WMF-Communications, 7HTTPS, 7Security-Other: Server certificate is classified as invalid on government computers - https://phabricator.wikimedia.org/T128182#2126811 (10Florian) Ok, we finally got someone, who uses Google Chrome and could take some screenshots from the certificate... [17:38:12] (03PS2) 10DCausse: Enable completion suggester as default on ja, zh, pl, ar and nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277810 (https://phabricator.wikimedia.org/T128776) [17:38:32] legoktm: just fyi we are pushing out one last patch for the comp suggest rollout in a sec [17:38:39] ok [17:38:55] (03CR) 10EBernhardson: [C: 032] Enable completion suggester as default on ja, zh, pl, ar and nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277810 (https://phabricator.wikimedia.org/T128776) (owner: 10DCausse) [17:41:14] (03Merged) 10jenkins-bot: Enable completion suggester as default on ja, zh, pl, ar and nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277810 (https://phabricator.wikimedia.org/T128776) (owner: 10DCausse) [17:43:00] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: Enable completion suggester as default on ja, zh, pl, ar and nlwiki (duration: 00m 29s) [17:43:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:44:25] legoktm: ok all done [17:44:32] 6Operations, 10MobileFrontend, 10Traffic, 3Reading-Web-Sprint-68-"Java and JavaScript are basically the same", and 4 others: Incorrect TOC and section edit links rendering in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getT... - https://phabricator.wikimedia.org/T124356#2126846 [17:44:40] thanks :) [17:45:46] 6Operations, 10MediaWiki-Interface, 10Traffic: Purge pages cached with mobile editlinks - https://phabricator.wikimedia.org/T125841#2126847 (10Jdlrobson) @ori @legoktm @tstarling I'm sure you understand this stuff far greater than I can. Could you let us know why the parser cache hook was reverted and whethe... [17:47:44] ummm [17:47:50] who has live hacks on tin? [17:47:59] jobqueue debugging? [17:49:12] legoktm: yes those are ori from yesterday. they are synced out already [17:49:46] ok [17:49:53] Should be committed, even if just locally to tin. [17:49:59] Otherwise they risk being lost. [17:49:59] it was preventing me from rebasing so I had to stash/pop them [17:50:03] ^ yes that :) [17:50:25] (03CR) 10Legoktm: [C: 032] Add UrlShortener to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277812 (https://phabricator.wikimedia.org/T108557) (owner: 10Legoktm) [17:51:18] (03Merged) 10jenkins-bot: Add UrlShortener to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277812 (https://phabricator.wikimedia.org/T108557) (owner: 10Legoktm) [17:52:19] !log legoktm@tin Started scap: Building l10n cache for UrlShortener - T108557 [17:52:20] T108557: Review and deploy UrlShortener extension to Wikimedia wikis - https://phabricator.wikimedia.org/T108557 [17:52:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:53:15] 6Operations, 6Project-Admins, 3DevRel-March-2016: Operations-related subprojects/tags reorganization - https://phabricator.wikimedia.org/T119944#2126905 (10faidon) >>! In T119944#2056247, @Aklapper wrote: >>>! In T119944#2044189, @faidon wrote: >>> **TO SORT OUT:** >>> * For #DC-Ops, tag along all of the #op... [17:54:04] PROBLEM - Disk space on tin is CRITICAL: DISK CRITICAL - free space: / 104 MB (0% inode=63%) [17:54:47] uhmmmmm [17:55:02] full [17:55:11] !log legoktm@tin scap aborted: Building l10n cache for UrlShortener - T108557 (duration: 02m 51s) [17:55:12] T108557: Review and deploy UrlShortener extension to Wikimedia wikis - https://phabricator.wikimedia.org/T108557 [17:55:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:55:28] I killed the scap [17:55:44] now 96% [17:55:58] ostriches: tin is running out of disk space? [17:56:04] !log rcs1001 - depool from rcstream service [17:56:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:56:20] 14GB on all homes [17:56:22] arg, the other one.. 1002 [17:56:28] legoktm: Ym yes. [17:56:32] * ostriches looks [17:56:44] RECOVERY - puppet last run on db2056 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [17:57:09] !log rcs1002 - the last message was about 1002 [17:57:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:59:09] !log rcs1002 - traffic graph flat in ganglia, reboot [17:59:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:00:07] legoktm: in /var/lib/l10nupdate => 8.0G caches, 3.1Gmediawiki [18:00:49] there are 247GB free on /srv [18:00:57] also all homes are 14GB [18:01:12] ostriches: uh, should we delete some of that? ^ [18:01:35] probably or move it to the big partition [18:01:46] killing scap helped because it writes on /tmp ? [18:01:49] that (deleting large l10nupdate) keeps coming back every once in a while [18:01:53] PROBLEM - Host rcs1002 is DOWN: PING CRITICAL - Packet loss = 100% [18:02:14] mutante: could we move it somewhere to /srv? [18:02:20] I think we can delete the old stuff like /var/lib/l10nupdate/caches/cache-1.27.0-wmf.12 [18:02:27] (03CR) 10Eevans: [C: 031] Increase purged entry point s-maxage from 12 to 48 hours [puppet] - 10https://gerrit.wikimedia.org/r/277112 (owner: 10GWicke) [18:02:50] volans: i think so, but also what legoktm said [18:03:02] i remember old stuff has been deleted before [18:03:14] it just keeps growing again [18:03:59] We prune them after they're 5 weeks old [18:04:14] RECOVERY - Host rcs1002 is UP: PING OK - Packet loss = 0%, RTA = 1.81 ms [18:04:16] Mainly because of static assets, but the general prune happens then [18:04:27] ah,so already has something automatic then, good [18:04:35] Er, l10n cache doesn't have static assets? [18:04:39] oldest is of 58 Feb [18:04:44] 5~8 [18:04:52] wmf.12 would get deleted next week then? can we delete it now? [18:04:54] legoktm: Well yeah, but we prune "old version cache shit" every 5 weeks. [18:04:57] ah [18:05:01] (or rather, prune them after age >=5w [18:05:12] More interesting is actually /var/lib/l10nupdate/mediawiki [18:05:18] I don't have a freaking clue why that exists ^ [18:05:27] !log repooling rcs1002 [18:05:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:06:03] ostriches: maybe because I killed it mid-l10n build? [18:06:14] ostriches: that's the clones of master that are used to gather the new l10n data each night [18:06:34] * ostriches chuckles [18:06:54] any objections to me deleting the wmf.12 cache and restarting the scap? [18:06:56] the nightly l10nudpate cron job updates that and then pushes the l10n changes into the cache files [18:07:02] legoktm: https://gerrit.wikimedia.org/r/#/c/277734/ [18:07:53] l10nupdate@tin /var/lib/l10nupdate/mediawiki/extensions (master)$ du -sh . [18:07:53] 2.6G . [18:07:53] l10nupdate@tin /srv/mediawiki-staging/php-1.27.0-wmf.17/extensions (wmf/1.27.0-wmf.17)$ du -sh . [18:07:53] 373M . [18:07:58] bd808: Hehehe ^ [18:08:10] There's a ton of wasted space on / [18:08:27] For all those extensions we don't deploy :P :P [18:09:01] !log deleting /var/lib/l10nupdate/caches/cache-1.27.0-wmf.12 on tin to free up some space [18:09:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:09:06] ostriches: *nod* [18:09:23] 2.8G free now, hopefully that's enough [18:09:24] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [18:09:41] !log legoktm@tin Started scap: Building l10n cache for UrlShortener - T108557 (try #2) [18:09:42] T108557: Review and deploy UrlShortener extension to Wikimedia wikis - https://phabricator.wikimedia.org/T108557 [18:09:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:10:12] ostriches: we also have 487M of apt cache on that host so ... yeah [18:10:52] !log depool rcs1001 [18:10:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:13:40] (03PS1) 10Sabya: Add support for running preached as a systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/277824 [18:14:59] !log rebooting rcs1001 [18:15:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:15:30] !log: tin: repacking all of the l10nupdate git clones in /var/lib/l10nupdate/mediawiki/* to free up some disk space on / [18:16:54] PROBLEM - Host rcs1001 is DOWN: PING CRITICAL - Packet loss = 100% [18:18:11] !log restarting elasticsearch server elastic1001.eqiad.wmnet [18:18:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:18:35] Hey I could still use a review of https://gerrit.wikimedia.org/r/#/c/268337/ which will make it easier to deploy new extensions [18:19:04] the other half of that is accepted in differential, waiting for many weeks now on that ^ to merge [18:19:34] PROBLEM - PyBal backends health check on lvs1002 is CRITICAL: PYBAL CRITICAL - streamlb_443 - Could not depool server rcs1002.eqiad.wmnet because of too many down!: streamlb6_443 - Could not depool server rcs1002.eqiad.wmnet because of too many down! [18:19:41] it solves a chicken-or-egg problem with the extension lists [18:19:53] RECOVERY - Host rcs1001 is UP: PING OK - Packet loss = 0%, RTA = 0.90 ms [18:19:56] Im getting this error QuotaExceededError QuotaExceededError in develeper tools on https://en.wikipedia.org/wiki/Sky_Go?veaction=edit [18:20:22] Using internet explorer and microsoft edge [18:20:30] o.O [18:20:34] I thought that only happened in Firefox [18:20:44] !log repooling rcs1001 [18:20:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:20:48] I wonder if Edge has the same broken quota handling that Firefox has [18:20:53] Nope i only recently got it [18:21:01] It never happend to me before [18:21:54] (03PS1) 10Mforns: Rsync browser reports to datasets.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/277826 (https://phabricator.wikimedia.org/T127326) [18:24:54] PROBLEM - PyBal backends health check on lvs1005 is CRITICAL: PYBAL CRITICAL - streamlb_443 - Could not depool server rcs1002.eqiad.wmnet because of too many down! [18:25:14] PROBLEM - PyBal backends health check on lvs1011 is CRITICAL: PYBAL CRITICAL - streamlb_80 - Could not depool server rcs1001.eqiad.wmnet because of too many down!: streamlb6_80 - Could not depool server rcs1002.eqiad.wmnet because of too many down! [18:27:55] (03CR) 10Ottomata: [C: 032] Rsync browser reports to datasets.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/277826 (https://phabricator.wikimedia.org/T127326) (owner: 10Mforns) [18:29:23] RECOVERY - Disk space on tin is OK: DISK OK [18:32:38] the PyBal checks there should recover soon [18:32:42] both rcs backends are repooled [18:32:44] ACKNOWLEDGEMENT - Restbase root url on restbase1012 is CRITICAL: Connection refused Filippo Giunchedi bootstrapping [18:32:44] ACKNOWLEDGEMENT - cassandra-a CQL 10.64.32.202:9042 on restbase1012 is CRITICAL: Connection refused Filippo Giunchedi bootstrapping [18:32:44] ACKNOWLEDGEMENT - restbase endpoints health on restbase1012 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.32.79, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) Filippo Giunchedi bootstrapping [18:33:12] and the service is up [18:35:24] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [18:37:05] (03CR) 10Faidon Liambotis: [C: 04-1] Add ferm rules for DNS auth servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/277258 (owner: 10Muehlenhoff) [18:39:05] (03PS1) 10Dereckson: Grammar fix for Bosnian [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277835 (https://phabricator.wikimedia.org/T130141) [18:39:11] (03PS1) 10Ppchelko: Set up outgoing request filter config. [puppet] - 10https://gerrit.wikimedia.org/r/277836 [18:41:30] (03PS2) 10Ppchelko: Set up outgoing request filter config. [puppet] - 10https://gerrit.wikimedia.org/r/277836 [18:42:12] (03PS2) 10Dereckson: Grammar fix for Bosnian [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277835 (https://phabricator.wikimedia.org/T130141) [18:42:14] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [18:42:29] (03PS1) 10Bmansurov: Remove Language Overlay experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277837 (https://phabricator.wikimedia.org/T129274) [18:43:32] (03CR) 10GWicke: [C: 031] Set up outgoing request filter config. [puppet] - 10https://gerrit.wikimedia.org/r/277836 (owner: 10Ppchelko) [18:44:01] (03PS3) 10Ottomata: Add mysql_wmf::mylvmbackup define, use this for backups of analytics-meta mysql instance [puppet] - 10https://gerrit.wikimedia.org/r/277640 (https://phabricator.wikimedia.org/T127991) [18:45:52] (03CR) 10Eevans: [C: 031] Set up outgoing request filter config. [puppet] - 10https://gerrit.wikimedia.org/r/277836 (owner: 10Ppchelko) [18:47:32] (03PS4) 10Ottomata: Add mysql_wmf::mylvmbackup define, use this for backups of analytics-meta mysql instance [puppet] - 10https://gerrit.wikimedia.org/r/277640 (https://phabricator.wikimedia.org/T127991) [18:47:47] 6Operations, 10Monitoring, 10Pybal: pybal backends health check streamlb could depool server - https://phabricator.wikimedia.org/T130143#2127105 (10fgiunchedi) [18:48:05] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: puppet fail [18:48:28] I have a couple of simple RESTBase config changes to deploy (https://gerrit.wikimedia.org/r/#/c/277112/ and https://gerrit.wikimedia.org/r/#/c/277836/), is there anyone with +2 I could persuade help me out? [18:48:44] _joe_, ema maybe?? ^^^ [18:48:52] (03CR) 10jenkins-bot: [V: 04-1] Add mysql_wmf::mylvmbackup define, use this for backups of analytics-meta mysql instance [puppet] - 10https://gerrit.wikimedia.org/r/277640 (https://phabricator.wikimedia.org/T127991) (owner: 10Ottomata) [18:49:23] I'm looking at pybal on lvs1002 btw, icinga alerts are not going away [18:51:44] config-master shows both enabled, http://config-master.wikimedia.org/conftool/eqiad/stream [18:51:44] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:55:03] (03Abandoned) 10Dereckson: Grammar fix for Bosnian [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277835 (https://phabricator.wikimedia.org/T130141) (owner: 10Dereckson) [18:56:53] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:57:54] (03PS5) 10Ottomata: Add mysql_wmf::mylvmbackup define, use this for backups of analytics-meta mysql instance [puppet] - 10https://gerrit.wikimedia.org/r/277640 (https://phabricator.wikimedia.org/T127991) [18:59:30] (03CR) 10jenkins-bot: [V: 04-1] Add mysql_wmf::mylvmbackup define, use this for backups of analytics-meta mysql instance [puppet] - 10https://gerrit.wikimedia.org/r/277640 (https://phabricator.wikimedia.org/T127991) (owner: 10Ottomata) [18:59:32] anyways the service seems to be up also according to https://prometheus.wmflabs.org/grafana/dashboard/db/http-s-tcp-probes it isn't flapping [19:00:04] twentyafterfour: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160316T1900). Please do the needful. [19:00:13] eek [19:00:18] twentyafterfour: I'm still scapping :( [19:01:04] RECOVERY - MariaDB Slave SQL: s3 on dbstore2002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [19:01:36] !log restart pybal on lvs1011 T130143 [19:01:37] T130143: pybal backends health check streamlb could depool server - https://phabricator.wikimedia.org/T130143 [19:01:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:02:31] !log restarted mysql on dbstore2002 and all replicas except x1, still investigating T130128 [19:02:32] T130128: Fix dbstore2002 - https://phabricator.wikimedia.org/T130128 [19:02:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:03:34] 6Operations, 10Monitoring, 10Pybal: pybal backends health check streamlb could depool server - https://phabricator.wikimedia.org/T130143#2127176 (10fgiunchedi) restarted pybal on standby lvs1011 and it seems to have cleared the error: ``` lvs1011:~$ sudo service pybal stop lvs1011:~$ ps fwuax | grep -i pyba... [19:04:12] (03PS1) 10Eevans: restbase1012.eqiad.wmnet: enable instance 'b' [puppet] - 10https://gerrit.wikimedia.org/r/277843 (https://phabricator.wikimedia.org/T125842) [19:04:13] http://bethgittings.tumblr.com/post/24618245386 ^ [19:06:16] (03PS6) 10Ottomata: Add mysql_wmf::mylvmbackup define, use this for backups of analytics-meta mysql instance [puppet] - 10https://gerrit.wikimedia.org/r/277640 (https://phabricator.wikimedia.org/T127991) [19:06:51] legoktm: ok let me know when you're done [19:07:08] [12:05:15] not sure if i'm in the right place.. i'm getting a 502 when trying to access the stream at http://stream.wikimedia.org/rc [19:07:09] [12:05:24] is this service still available? [19:07:14] godog, mutante: ^ (via -tech) [19:07:43] twentyafterfour: Would we be able to create a redirect script so if phabricator fails we use git.wikimedia.org as a second solution primaryly for reviewing open patches. [19:07:49] https://phabricator.wikimedia.org/T110607 [19:08:09] !log restarting elasticsearch server elastic1002.eqiad.wmnet [19:08:11] (03CR) 10jenkins-bot: [V: 04-1] Add mysql_wmf::mylvmbackup define, use this for backups of analytics-meta mysql instance [puppet] - 10https://gerrit.wikimedia.org/r/277640 (https://phabricator.wikimedia.org/T127991) (owner: 10Ottomata) [19:08:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:08:15] (03CR) 10Eevans: [C: 04-1] "I'm -1ing this until the Puppet SWAT tomorrow (2016-03-17); This requires someone at the ready to start a bootstrap, when the changeset is" [puppet] - 10https://gerrit.wikimedia.org/r/277843 (https://phabricator.wikimedia.org/T125842) (owner: 10Eevans) [19:08:32] godog: :) [19:08:36] legoktm: thanks! [19:08:59] paladox: no, the entire point of the redirects is to allow us to kill git.wikimedia.org asap [19:09:11] twentyafterfour: Oh ok [19:09:39] (03PS7) 10Ottomata: Add mysql_wmf::mylvmbackup define, use this for backups of analytics-meta mysql instance [puppet] - 10https://gerrit.wikimedia.org/r/277640 (https://phabricator.wikimedia.org/T127991) [19:10:12] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 789 [19:10:14] 6Operations, 10Monitoring, 10Pybal: pybal backends health check streamlb could depool server - https://phabricator.wikimedia.org/T130143#2127199 (10fgiunchedi) lvs1002 and lvs1005 (standby) also report the error, all of class `high-traffic2` together with `lvs1011` and `lvs1008` [19:10:42] paladox: I'd rather put the effort into setting up a backup phabricator instance [19:10:46] (03CR) 10jenkins-bot: [V: 04-1] Add mysql_wmf::mylvmbackup define, use this for backups of analytics-meta mysql instance [puppet] - 10https://gerrit.wikimedia.org/r/277640 (https://phabricator.wikimedia.org/T127991) (owner: 10Ottomata) [19:11:21] !log regenerating puppet SSL certificates for elasticsearch codfw cluster [19:11:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:11:28] Ok [19:11:50] (03PS8) 10Ottomata: Add mysql_wmf::mylvmbackup define, use this for backups of analytics-meta mysql instance [puppet] - 10https://gerrit.wikimedia.org/r/277640 (https://phabricator.wikimedia.org/T127991) [19:12:34] twentyafterfour: Maybe we can update diffusion upstream that we add a config for to enable replicating refs/changes [19:12:43] since they hard code refs/head [19:12:47] !log legoktm@tin Finished scap: Building l10n cache for UrlShortener - T108557 (try #2) (duration: 63m 05s) [19:12:48] T108557: Review and deploy UrlShortener extension to Wikimedia wikis - https://phabricator.wikimedia.org/T108557 [19:12:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:12:57] (03PS9) 10Ottomata: Add mysql_wmf::mylvmbackup define, use this for backups of analytics-meta mysql instance [puppet] - 10https://gerrit.wikimedia.org/r/277640 (https://phabricator.wikimedia.org/T127991) [19:13:06] woo. [19:13:15] paladox: better to discuss this in #wikimedia-devtools [19:13:15] 63 min :( [19:13:25] twentyafterfour ok [19:13:42] (03CR) 10Legoktm: [C: 032] Configure UrlShortener extension in read-only mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277813 (https://phabricator.wikimedia.org/T108557) (owner: 10Legoktm) [19:14:16] (03Merged) 10jenkins-bot: Configure UrlShortener extension in read-only mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277813 (https://phabricator.wikimedia.org/T108557) (owner: 10Legoktm) [19:14:21] !log aaron@tin Synchronized php-1.27.0-wmf.17/includes/jobqueue/Job.php: 3da38ce (duration: 00m 37s) [19:14:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:14:38] all clear for the train? [19:14:52] twentyafterfour: I have one config patch left [19:14:56] ok [19:15:12] RECOVERY - check_mysql on lutetium is OK: Uptime: 104269 Threads: 2 Questions: 1345218 Slow queries: 2372 Opens: 32424 Flush tables: 2 Open tables: 64 Queries per second avg: 12.901 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [19:15:46] !log legoktm@tin Synchronized wmf-config/InitialiseSettings.php: Configure UrlShortener extension in read-only mode (1/2) (duration: 00m 28s) [19:15:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:16:03] RECOVERY - PyBal backends health check on lvs1011 is OK: PYBAL OK - All pools are healthy [19:16:10] (03PS10) 10Ottomata: Add mysql_wmf::mylvmbackup define, use this for backups of analytics-meta mysql instance [puppet] - 10https://gerrit.wikimedia.org/r/277640 (https://phabricator.wikimedia.org/T127991) [19:16:30] 6Operations, 10ArchCom-RfC, 6Commons, 10MediaWiki-File-management, and 13 others: Use content hash based image / thumb URLs & define an official thumb API - https://phabricator.wikimedia.org/T66214#2127221 (10GWicke) > What problem would such a VCL rule solve? We need both a good long-term API design, as... [19:16:48] !log legoktm@tin Synchronized wmf-config/CommonSettings.php: Configure UrlShortener extension in read-only mode (2/2) (duration: 00m 26s) [19:16:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:17:33] RECOVERY - PyBal backends health check on lvs1005 is OK: PYBAL OK - All pools are healthy [19:18:09] !log created urlshortcodes table on wikishared db for UrlShortener [19:18:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:18:42] twentyafterfour: ok, all done :) [19:18:54] RECOVERY - PyBal backends health check on lvs1002 is OK: PYBAL OK - All pools are healthy [19:19:01] (03PS11) 10Ottomata: Add mysql_wmf::mylvmbackup define, use this for backups of analytics-meta mysql instance [puppet] - 10https://gerrit.wikimedia.org/r/277640 (https://phabricator.wikimedia.org/T127991) [19:19:07] 6Operations, 10Wikimedia-Stream: redis not up after reboot on rcs machines - https://phabricator.wikimedia.org/T130147#2127222 (10fgiunchedi) [19:24:17] 6Operations, 10Wikimedia-Stream: redis not up after reboot on rcs machines - https://phabricator.wikimedia.org/T130147#2127258 (10fgiunchedi) also the service was down during this period as reported by users ``` 19:05 not sure if i'm in the right place.. i'm getting a 502 when trying to access the s... [19:24:21] 6Operations, 10Monitoring, 10Pybal: pybal backends health check streamlb could depool server - https://phabricator.wikimedia.org/T130143#2127260 (10fgiunchedi) also note that according to {T130147} the service was actually down [19:24:28] (03PS12) 10Ottomata: Add mysql_wmf::mylvmbackup define, use this for backups of analytics-meta mysql instance [puppet] - 10https://gerrit.wikimedia.org/r/277640 (https://phabricator.wikimedia.org/T127991) [19:27:02] 6Operations, 10Wikimedia-Stream: reboot of rcs servers (stream.wikimedia.org) - https://phabricator.wikimedia.org/T130024#2127265 (10Johan) Well, for several reasons? a) Because if anti-vandalism tools stop working, that's a real problem for the Wikimedians who depend on them but have never touched the code i... [19:27:06] (03CR) 10Ottomata: [C: 032] Add mysql_wmf::mylvmbackup define, use this for backups of analytics-meta mysql instance [puppet] - 10https://gerrit.wikimedia.org/r/277640 (https://phabricator.wikimedia.org/T127991) (owner: 10Ottomata) [19:28:22] 6Operations, 10ArchCom-RfC, 6Commons, 10MediaWiki-File-management, and 13 others: Use content hash based image / thumb URLs & define an official thumb API - https://phabricator.wikimedia.org/T66214#2127267 (10GWicke) [19:29:32] (03PS1) 10Ottomata: Include rsync::server in role::analytics_cluster::database::meta::backup_dest [puppet] - 10https://gerrit.wikimedia.org/r/277852 (https://phabricator.wikimedia.org/T127991) [19:29:49] (03CR) 10Ottomata: [C: 032 V: 032] Include rsync::server in role::analytics_cluster::database::meta::backup_dest [puppet] - 10https://gerrit.wikimedia.org/r/277852 (https://phabricator.wikimedia.org/T127991) (owner: 10Ottomata) [19:30:13] 6Operations, 10ArchCom-RfC, 6Commons, 10MediaWiki-File-management, and 13 others: Use content hash based image / thumb URLs & define an official thumb API - https://phabricator.wikimedia.org/T66214#1800774 (10GWicke) [19:31:29] !log restart pybal on lvs1005 T130143 [19:31:30] T130143: pybal backends health check streamlb could depool server - https://phabricator.wikimedia.org/T130143 [19:31:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:32:35] !log Deploying 1.27.0-wmf.17 to group1 wikis [19:32:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:32:41] 6Operations, 10Monitoring, 10Pybal: pybal backends health check streamlb could depool server - https://phabricator.wikimedia.org/T130143#2127285 (10fgiunchedi) I'll leave lvs1002 alone for diagnostic purposes but a simple `service pybal restart` fixes it [19:33:29] (03PS2) 10DCausse: Enable ICU Folding on greek wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277477 (https://phabricator.wikimedia.org/T129502) [19:34:52] 6Operations, 10Wikimedia-Stream: reboot of rcs servers (stream.wikimedia.org) - https://phabricator.wikimedia.org/T130024#2127305 (10Johan) Or maybe I misunderstand what you're referring to? If so, my apologies. (: [19:35:32] (03PS1) 10Ottomata: Add ferm rule for rsync daemon on analytics1002 in role::analytics_cluster::database::meta::backup_dest [puppet] - 10https://gerrit.wikimedia.org/r/277853 (https://phabricator.wikimedia.org/T127991) [19:36:03] PROBLEM - PyBal backends health check on lvs1002 is CRITICAL: PYBAL CRITICAL - streamlb_443 - Could not depool server rcs1001.eqiad.wmnet because of too many down!: streamlb6_80 - Could not depool server rcs1002.eqiad.wmnet because of too many down!: streamlb_80 - Could not depool server rcs1001.eqiad.wmnet because of too many down!: streamlb6_443 - Could not depool server rcs1002.eqiad.wmnet because of too many down! [19:36:31] uh [19:36:32] twentyafterfour@tin:/srv/mediawiki-staging$ git commit [19:36:34] error: insufficient permission for adding an object to repository database .git/objects [19:36:36] error: Error building trees [19:36:45] twentyafterfour: A root touched it? [19:37:10] permissions look ok to me :-/ [19:37:39] twentyafterfour: g.odog touched the SAL last. [19:37:52] (03CR) 10Ottomata: [C: 032] Add ferm rule for rsync daemon on analytics1002 in role::analytics_cluster::database::meta::backup_dest [puppet] - 10https://gerrit.wikimedia.org/r/277853 (https://phabricator.wikimedia.org/T127991) (owner: 10Ottomata) [19:38:39] drwxrwxr-x 2 root mwdeploy 4096 Mar 11 15:10 96 [19:38:42] PROBLEM - PyBal backends health check on lvs1011 is CRITICAL: PYBAL CRITICAL - streamlb_443 - Could not depool server rcs1001.eqiad.wmnet because of too many down!: streamlb6_443 - Could not depool server rcs1001.eqiad.wmnet because of too many down!: streamlb6_80 - Could not depool server rcs1001.eqiad.wmnet because of too many down! [19:38:42] PROBLEM - PyBal backends health check on lvs1005 is CRITICAL: PYBAL CRITICAL - streamlb_443 - Could not depool server rcs1002.eqiad.wmnet because of too many down!: streamlb6_443 - Could not depool server rcs1001.eqiad.wmnet because of too many down!: streamlb_80 - Could not depool server rcs1001.eqiad.wmnet because of too many down!: streamlb6_80 - Could not depool server rcs1002.eqiad.wmnet because of too many down! [19:39:20] godog: ^ :/ [19:39:34] can someone with root please `chgrp -R wikidev /srv/mediawiki-staging/.git/` ? [19:39:44] host please [19:39:48] tin [19:40:12] PROBLEM - check_recurring_gc_failures_missed on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=1746 [critical =325] [19:40:14] done [19:40:19] apergos: thank you! [19:40:22] yw [19:41:16] (03PS1) 1020after4: group1 wikis to 1.27.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277854 [19:41:48] (03CR) 1020after4: [C: 032] group1 wikis to 1.27.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277854 (owner: 1020after4) [19:42:22] !log rcs1001 - starting redis-server [19:42:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:42:28] what the.. [19:42:36] (03Merged) 10jenkins-bot: group1 wikis to 1.27.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277854 (owner: 1020after4) [19:42:58] !log rcs1002 - starting redis-server [19:43:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:43:12] RECOVERY - PyBal backends health check on lvs1002 is OK: PYBAL OK - All pools are healthy [19:43:54] RECOVERY - PyBal backends health check on lvs1011 is OK: PYBAL OK - All pools are healthy [19:44:02] RECOVERY - PyBal backends health check on lvs1005 is OK: PYBAL OK - All pools are healthy [19:44:36] mutante: should I hold off on the train and wait for redis errors to die down? [19:45:12] PROBLEM - check_recurring_gc_failures_missed on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=1746 [critical =325] [19:45:13] twentyafterfour: no [19:45:24] go ahead as normal [19:45:48] that was just related to rcstream [19:45:54] https://logstash.wikimedia.org/#/dashboard/elasticsearch/fatalmonitor-group1 doesn't look pretty [19:46:31] (03PS1) 10Ottomata: Move hiera for role::analytics_cluster::database::meta::backup::dest to eqiad.yaml [puppet] - 10https://gerrit.wikimedia.org/r/277855 [19:46:38] arg, but yes, that's it [19:47:12] PROBLEM - puppet last run on elastic2022 is CRITICAL: CRITICAL: Puppet has 10 failures [19:47:28] yes, so i can start redis-server and then it is stopped again [19:47:30] shortly after [19:47:43] (03PS2) 10Ottomata: Move hiera for role::analytics_cluster::database::meta::backup::dest to eqiad.yaml [puppet] - 10https://gerrit.wikimedia.org/r/277855 [19:47:51] :( [19:48:10] it's still alive on rcs1001 [19:49:10] [7890] 16 Mar 19:48:16.807 * The server is now ready [19:50:08] Notice: /Stage[main]/Redis/Service[redis-server]/ensure: ensure changed 'running' to 'stopped' [19:50:12] PROBLEM - check_recurring_gc_failures_missed on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=1746 [critical =325] [19:50:13] whhhaat [19:50:20] puppet is _stopping_ the service [19:50:23] (03CR) 10Ottomata: [C: 032] Move hiera for role::analytics_cluster::database::meta::backup::dest to eqiad.yaml [puppet] - 10https://gerrit.wikimedia.org/r/277855 (owner: 10Ottomata) [19:50:43] that's not "unpuppetized" that's "anti-puppetized" [19:51:34] PROBLEM - puppet last run on analytics1015 is CRITICAL: CRITICAL: puppet fail [19:51:55] 6Operations, 10ops-eqiad, 10Dumps-Generation: Rack and setup snapshot1005-1007 - https://phabricator.wikimedia.org/T129553#2127369 (10Cmjohnson) All 3 have installed [19:52:05] 6Operations, 10ArchCom-RfC, 6Commons, 10MediaWiki-File-management, and 13 others: Use content hash based image / thumb URLs & define an official thumb API - https://phabricator.wikimedia.org/T66214#2127370 (10GWicke) [19:52:15] 6Operations, 10Wikimedia-Stream: redis not up after reboot on rcs machines - https://phabricator.wikimedia.org/T130147#2127222 (10Dzahn) actually .. puppet actively STOPS the service.. shortly after you started it it was down again, i started it again., ran puppet,.. and: Notice: /Stage[main]/Redis/Service[r... [19:52:42] 6Operations, 10ArchCom-RfC, 6Commons, 10MediaWiki-File-management, and 13 others: Use content hash based image / thumb URLs & define an official thumb API - https://phabricator.wikimedia.org/T66214#1811983 (10GWicke) [19:53:24] RECOVERY - puppet last run on analytics1015 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [19:54:27] !log rcs1001 - starting redis, disabling puppet (T130147) [19:54:28] T130147: redis not up after reboot on rcs machines - https://phabricator.wikimedia.org/T130147 [19:54:29] 6Operations, 10ArchCom-RfC, 6Commons, 10MediaWiki-File-management, and 13 others: Use content hash based image / thumb URLs & define an official thumb API - https://phabricator.wikimedia.org/T66214#2127384 (10GWicke) [19:54:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:55:12] PROBLEM - check_recurring_gc_failures_missed on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=1746 [critical =325] [19:56:29] twentyafterfour: check that page again you pasted [19:56:56] (03PS1) 10Ottomata: Use direct hiera lookup instead of parameter in role::analytics_cluster::database::meta::backup [puppet] - 10https://gerrit.wikimedia.org/r/277857 [19:57:31] !log regenerating puppet SSL certificates for elasticsearch eqiad cluster [19:57:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:58:03] mutante: looks much better [19:58:29] 6Operations, 10ArchCom-RfC, 6Commons, 10MediaWiki-File-management, and 13 others: Use content hash based image / thumb URLs & define an official thumb API - https://phabricator.wikimedia.org/T66214#2127391 (10GWicke) I have updated the task description to more clearly separate out use cases and requirement... [19:58:36] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.27.0-wmf.17 [19:58:37] twentyafterfour: i started redis and stopped puppet. i have no idea yet _why_ it does that [19:58:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:59:43] ACKNOWLEDGEMENT - check_recurring_gc_failures_missed on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=1746 [critical =325] Jeff_Green awight backfilled some data, will decrease on its own [19:59:45] (03CR) 10Ottomata: [C: 032] Use direct hiera lookup instead of parameter in role::analytics_cluster::database::meta::backup [puppet] - 10https://gerrit.wikimedia.org/r/277857 (owner: 10Ottomata) [20:00:04] gwicke cscott arlolra subbu bearND mdholloway: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160316T2000). Please do the needful. [20:00:13] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [20:00:21] no mobileapps deploy today [20:03:13] * twentyafterfour wonders why does unmerged changes on mira keep popping up [20:03:43] 6Operations, 10ArchCom-RfC, 6Commons, 10MediaWiki-File-management, and 13 others: Use content hash based image / thumb URLs & define an official thumb API - https://phabricator.wikimedia.org/T66214#2127396 (10GWicke) [20:03:54] (03PS2) 10Jhobs: Remove Language Overlay experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277837 (https://phabricator.wikimedia.org/T129274) (owner: 10Bmansurov) [20:04:01] (03CR) 10Jhobs: [C: 031] Remove Language Overlay experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277837 (https://phabricator.wikimedia.org/T129274) (owner: 10Bmansurov) [20:04:32] 6Operations, 10ArchCom-RfC, 6Commons, 10MediaWiki-File-management, and 13 others: Use content hash based image / thumb URLs & define an official thumb API - https://phabricator.wikimedia.org/T66214#1842437 (10GWicke) [20:05:17] 6Operations, 10ops-eqiad, 10Dumps-Generation: Rack and setup snapshot1005-1007 - https://phabricator.wikimedia.org/T129553#2127401 (10Cmjohnson) [20:05:23] twentyafterfour: mira did not get synced yet for some reason [20:05:27] it does not have [20:05:32] group1 wikis to 1.27.0-wmf.17 [20:05:33] yet [20:05:47] 6Operations, 10ops-eqiad, 10Dumps-Generation: Rack and setup snapshot1005-1007 - https://phabricator.wikimedia.org/T129553#2108980 (10Cmjohnson) 5Open>3Resolved @ArielGlenn finished puppet certs and salt-keys. Racking and install is complete. Resolving the task [20:06:32] !log mira - git pull in mw-staging [20:06:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:06:45] twentyafterfour: i pulled on mira, that got it [20:06:51] (03PS1) 10Ottomata: mylvmbackup prebackup hook needs to be executable [puppet] - 10https://gerrit.wikimedia.org/r/277860 (https://phabricator.wikimedia.org/T127991) [20:07:03] now tin and mira have the same git history [20:07:54] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [20:07:57] (03CR) 10Ottomata: [C: 032 V: 032] mylvmbackup prebackup hook needs to be executable [puppet] - 10https://gerrit.wikimedia.org/r/277860 (https://phabricator.wikimedia.org/T127991) (owner: 10Ottomata) [20:11:46] (03PS1) 10Tim Landscheidt: Tools: Fix Puppet error [puppet] - 10https://gerrit.wikimedia.org/r/277862 (https://phabricator.wikimedia.org/T128411) [20:13:04] RECOVERY - puppet last run on elastic2022 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:13:47] (03PS1) 10Ottomata: Use flock to make sure this only ever runs one mylvmbackup at a time [puppet] - 10https://gerrit.wikimedia.org/r/277863 (https://phabricator.wikimedia.org/T127991) [20:14:46] (03CR) 10Ottomata: [C: 032 V: 032] Use flock to make sure this only ever runs one mylvmbackup at a time [puppet] - 10https://gerrit.wikimedia.org/r/277863 (https://phabricator.wikimedia.org/T127991) (owner: 10Ottomata) [20:20:03] (03PS1) 10Ottomata: Change rsync flags for mylvmbackup to -rt [puppet] - 10https://gerrit.wikimedia.org/r/277864 (https://phabricator.wikimedia.org/T127991) [20:20:06] maybe sync-wikiversions doesn't do sync-masters... [20:20:08] hmm [20:20:46] (03CR) 10Ottomata: [C: 032 V: 032] Change rsync flags for mylvmbackup to -rt [puppet] - 10https://gerrit.wikimedia.org/r/277864 (https://phabricator.wikimedia.org/T127991) (owner: 10Ottomata) [20:22:03] PROBLEM - puppet last run on elastic1016 is CRITICAL: CRITICAL: Puppet has 23 failures [20:25:10] (03CR) 10Tim Landscheidt: "AFAICT from the mails I received this morning, exim4 has already been upgraded (probably by unattended-upgrades that is part of base::labs" [puppet] - 10https://gerrit.wikimedia.org/r/277776 (owner: 10Faidon Liambotis) [20:26:24] 6Operations, 10Monitoring, 10Pybal: pybal backends health check streamlb could depool server - https://phabricator.wikimedia.org/T130143#2127105 (10Joe) @fgiunchedi pybal doesn't report any alarm anymore on lvs1002... [20:27:24] RECOVERY - puppet last run on elastic1016 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [20:27:59] 6Operations, 10Wikimedia-Stream: redis not up after reboot on rcs machines - https://phabricator.wikimedia.org/T130147#2127222 (10Joe) that's because you should not start redis-server, but the specific instance for its port [20:28:16] <_joe_> mutante: what's up with rcs100*? [20:28:30] <_joe_> anyways, I'm going out for a beer, I commented on the tickets [20:28:57] _joe_: when puppet runs, it stops redis-server [20:29:07] <_joe_> and read my comment [20:29:13] <_joe_> it's not redis-server you should start [20:29:54] ok, then the other other redis-service did not come back up after boot [20:29:58] looks again [20:31:00] (03PS1) 10Eevans: [WIP]: write cassandra instance yaml descriptors [puppet] - 10https://gerrit.wikimedia.org/r/277865 [20:31:20] 6Operations, 10Wikimedia-Stream: redis not up after reboot on rcs machines - https://phabricator.wikimedia.org/T130147#2127449 (10Joe) The correct command would've been 'service redis-instance-tcp_6379 start'; now everything is in an incosistent state, since the service is running I'd try not to touch this now... [20:32:45] (03CR) 10jenkins-bot: [V: 04-1] [WIP]: write cassandra instance yaml descriptors [puppet] - 10https://gerrit.wikimedia.org/r/277865 (owner: 10Eevans) [20:35:09] ok, i won't touch it then and didn't before [20:37:05] (03PS1) 10Tim Landscheidt: Tools: Add add_environment/keep_environment to exim configurations [puppet] - 10https://gerrit.wikimedia.org/r/277866 [20:38:21] (03CR) 10Tim Landscheidt: "For Tools done with I48d7aaaeeee1ad7fbdb80afa476f87df784fcbbf." [puppet] - 10https://gerrit.wikimedia.org/r/277776 (owner: 10Faidon Liambotis) [20:39:04] (03CR) 10Tim Landscheidt: "Tested in Toolsbeta with "sudo /etc/cron.daily/exim4-base" (the command that caused the cron mails)." [puppet] - 10https://gerrit.wikimedia.org/r/277866 (owner: 10Tim Landscheidt) [20:43:00] 6Operations, 10Wikimedia-Stream: reboot of rcs servers (stream.wikimedia.org) - https://phabricator.wikimedia.org/T130024#2127468 (10Dzahn) Separate from the announce discussion, reboots have happened earlier today right before you added the last 2 comments. [20:49:36] (03CR) 10Krinkle: "May 1st, 2016?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/217858 (https://phabricator.wikimedia.org/T122933) (owner: 10Faidon Liambotis) [20:56:23] 6Operations, 10Wikimedia-Stream: reboot of rcs servers (stream.wikimedia.org) - https://phabricator.wikimedia.org/T130024#2127504 (10Dzahn) We had/have the following issues: T130147, T130143 [20:57:18] 6Operations, 10Wikimedia-Stream: reboot of rcs servers (stream.wikimedia.org) - https://phabricator.wikimedia.org/T130024#2127509 (10Dzahn) 5Open>3Resolved a:3Dzahn i'm closing the ticket as resolved, not to stop the discussion about announcing it, but because the reboots are technically done and the fol... [20:58:48] mobrovac: any updates re: mathosphere? [21:17:11] twentyafterfour: It seems some website that dosent look to be good is caching and copying phabricator.wikimedia.org users. [21:17:12] See http://jice.ddns.net:808/0/?url=L3hvZGFsYVAvcC9ncm8uYWlkZW1pa2l3LnJvdGFjaXJiYWhwLy9BMyVzcHR0aA== [21:17:18] I found that on google [21:17:52] A Chinese website. [21:18:15] !log restarting elasticsearch server elastic1003.eqiad.wmnet [21:18:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:18:50] 6Operations, 10MobileFrontend, 10Traffic, 3Reading-Web-Sprint-68-"Java and JavaScript are basically the same", and 4 others: Incorrect TOC and section edit links rendering in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getT... - https://phabricator.wikimedia.org/T124356#2127555 [21:19:07] twentyafterfour: It also views the login page for phabricator http://jice.ddns.net:808/0/?url=L2ZsZXNBMyVwYWRsL25pZ29sL2h0dWEvZ3JvLmFpZGVtaWtpdy5yb3RhY2lyYmFocC8vQTMlc3B0dGg= [21:19:15] which can compromise someones account [21:22:17] Is it a proxy? [21:22:23] I think we've seen something like this before [21:25:50] Reedy not sure [21:26:00] mutante: ^ Do you remember seeing something like that before [21:26:07] But security risk since it looks like a fraud and can steal your passwords. [21:26:18] Reedy: Ive opened https://phabricator.wikimedia.org/T130156 [21:27:05] Reedy: yes, not this one, but a similar one [21:27:15] mutante: remember what we did in that case? [21:27:15] What did we do about it? Can you remember [21:27:28] i dont remember any action we took [21:27:43] I found that site by doing something like this https://www.google.co.uk/#q=php+composer+on+nodepool&start=20 [21:27:51] yeah, there's not much we can do aside from blocking their IP, but then they can just change it [21:28:40] The fact it's ddns.net suggests it's dynamic dns [21:28:58] touche [21:29:09] greg-g: Yeh but woulden they use web crawlers to copy there data since there getting second by second of data recent data. [21:29:29] paladox: what are you suggesting to do? [21:29:40] Someone could do with checking if it's proxying [21:29:56] greg-g: Im not sure but if we can block google crawlers would it be posible todo it with the rest. [21:30:00] I mean, it's obvious from the domain it's not the official [21:30:12] we aren't going to block google crawling phab [21:31:40] greg-g: No i doint mean google, im saying if google can be blocked by others can we use that toward blocking other websites [21:31:41] paladox: if the bots dont want to listen to robots.txt, they just won't listen [21:31:57] heh [21:31:58] mutante: oh [21:32:29] "Despite the use of the terms "allow" and "disallow", the protocol is purely advisory.[15] and relies on the compliance of the web robot. " [21:32:37] the only tool is ip-blocking, which is easy to get around [21:32:37] we have found a few of those "mirrors" or reverse proxy versions over time [21:33:52] chasemp: Did we ever do anything? [21:34:23] Could we do something like http://www.inmotionhosting.com/support/website/security/block-unwanted-users-from-your-site-using-htaccess#block-by-user-agent [21:34:23] Block by referer [21:34:36] For example [21:34:37] RewriteEngine On [21:34:37] RewriteCond %{HTTP_REFERER} example\.com [NC] [21:34:37] RewriteRule .* - [F] [21:34:38] Reedy: no I do recall twentyafterfour forwarding an instance to legal previously but nothing came of it to my knoweldge [21:35:53] Hi ops folks, is there a way to find out which machine an alias is pointing to in prod? I'm trying to figure out what's going on with the master-slave setup for the analytics mysql stores [21:36:35] nslookup? [21:36:41] madhuvishy: Have you tried "host db1003.eqiad.wmnet" for example? [21:37:03] RoanKattouw: Reedy no I haven't - thanks will do [21:37:19] http://www.htaccess-guide.com/deny-visitors-by-referrer/ [21:37:48] greg-g Reedy mutante ^^^ [21:38:46] I wonder if we could put a message on the login page that checks the hostname, and if it doesn't come from phab.* it chucks up a warning message [21:40:15] csteipp: Any suggestions from a security pov? [21:40:41] madhuvishy, host, nslookup, or literally just reading the dns repository [21:41:10] keep in mind p858snake and paladox many if not most of the examples you find are apache 2.2 whereas we are running apache2.4 [21:41:15] you would be best off looking here https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modules/phabricator/templates/phabbanlist.conf.erb [21:41:30] chasemp oh thanks [21:42:07] blocking by referrer is probably not terribly difficult but I haven't tested it [21:42:09] chasemp https://httpd.apache.org/docs/2.4/rewrite/access.html [21:42:49] Krenair: thanks - I'm seeing what it's pointing to etc - but having trouble figuring out how the entire setup is configured - for example m4-master is an alias to dbproxy1004 - is this in puppet? because i can't find it [21:43:47] no [21:43:50] dns is a separate repo [21:43:50] chasemp maybe Denying Hosts in a Blacklist [21:43:53] aah [21:43:55] dns [21:44:00] you want operations/dns.git instead of operations/puppet.git [21:44:10] specifically templates/wmnet [21:44:18] right [21:44:22] cool, thanks [21:44:35] madhuvishy: we use haproxy between some hosts and their db [21:44:47] Reedy: We can threaten legally for copyright violation, but there's nothing you can really do. That's why users need to be trained to know what https is, and browsers need to present valid domains well. [21:44:49] so the "real" backend is in something like /etc/haproxy or something on teh proxy host [21:46:02] chasemp we can block the domain so it makes it harder for them to change it since they would need a new domain since we wont be using a ip to detect but an website address [21:46:08] Reedy: We can always drop a popin banner at the top of our site, "^ does your address bar say phabricator.wikimedia.org? if not, someone is stealing your passwords..." [21:46:12] madhuvishy: http://git.wikimedia.org/blob/operations%2Fdns.git/master/templates%2Fwikimedia.org [21:46:33] csteipp ^^ [21:46:41] madhuvishy: eh, http://git.wikimedia.org/blob/operations%2Fdns.git/master/templates%2Fwmnet this [21:46:51] Just to see if the other site scrapes that and shows it too. For the lolz :) [21:47:05] mutante: yup, was just looking through that! Thanks :) [21:47:13] chasemp: right.. hmmm [21:47:36] !log deployed patch for T123071 [21:47:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:49:41] we could use JS to detect if we're not running under .wikimedia.org [21:51:32] 6Operations, 10Wikimedia-Stream: reboot of rcs servers (stream.wikimedia.org) - https://phabricator.wikimedia.org/T130024#2127650 (10Joe) >>! In T130024#2127265, @Johan wrote: > Well, for several reasons? > > a) Because if anti-vandalism tools stop working, that's a real problem for the Wikimedians who depend... [21:52:33] PROBLEM - puppet last run on db1027 is CRITICAL: CRITICAL: Puppet has 1 failures [21:58:12] <_joe_> paladox: they're stripping https, how ince [21:58:24] <_joe_> (sorry reading backlog now) [21:58:29] _joe_ Ok [21:58:47] well, that proxy can spoof even google login, http://jice.ddns.net:808/0/?url=L21vYy5lbGdvb2cud3d3Ly9BMyVzcHR0aEQzJWV1bml0bm9jQjMlcG1hNjIlZXVydEQzJWV2aXNzYXBCMyVwbWE2MiVzZUQzJWxoRjMlbmlnb0xlY2l2cmVTL21vYy5lbGdvb2cuc3RudW9jY2EvL0EzJXNwdHRo [21:58:49] Yeh im looking at there website info using http://who.is/website-information/jice.ddns.net [21:59:25] Vulpix oh [21:59:31] original url is reversed and then base64 encoded to generate the ?url= param [21:59:52] that's how I reached google.com from that proxy [21:59:52] Yeh [22:00:17] <_joe_> Vulpix: it's a generic proxy that is maybe used to "circumvent" some censorship [22:00:31] last time I looked it was doing a bunch of crappy ad injection and making $$ when I took things apart and it wasn't this source either, it's whacamole on these [22:00:47] yes, I doubt it has been made with the sole purpose of stealing credentials [22:00:48] <_joe_> yeah always is [22:01:01] <_joe_> Vulpix: that's probably a convenient byproduct [22:01:37] <_joe_> hell, if I was doing censorship, I'd create such an honeypot and steal credentials of people trying to circumvent censorship :) [22:03:32] Ip is 68.32.11.161 [22:03:38] https://www.site24x7.com/find-ip-address-of-web-site.html [22:04:22] or just do a "ping jice.ddns.net" [22:04:56] https://www.site24x7.com/ping-test.html [22:05:04] It seems to have alot of servers [22:05:11] California - US [22:05:15] Toronto - CA [22:05:20] Singapore - SG [22:05:25] Chennai - IN [22:05:30] Johannesburg - ZA [22:05:58] that's the source of the ping test, not the target [22:06:23] that's just godaddy [22:06:30] they ping from multiple locations to see if connectivity is good from those places [22:08:45] (03CR) 10Alex Monk: "That's a Sunday... What about Monday 2nd of May?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/217858 (https://phabricator.wikimedia.org/T122933) (owner: 10Faidon Liambotis) [22:09:53] !log rebooting elastic1003.eqiad.wmnet for linux kernel upgrade [22:09:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:16:17] Oh [22:16:33] 6Operations, 5WMF-NDA: Migrate RT to Phabricator - https://phabricator.wikimedia.org/T38#2127742 (10Dzahn) [22:16:44] (03CR) 10Jdlrobson: [C: 04-1] "Not to merged till the new language overlay is in stable." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277837 (https://phabricator.wikimedia.org/T129274) (owner: 10Bmansurov) [22:17:34] RECOVERY - puppet last run on db1027 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:19:55] (03PS1) 10Rush: phab: jice.ddns.net stop cloning our site thanks [puppet] - 10https://gerrit.wikimedia.org/r/277903 [22:21:16] (03CR) 10Rush: [C: 032 V: 032] phab: jice.ddns.net stop cloning our site thanks [puppet] - 10https://gerrit.wikimedia.org/r/277903 (owner: 10Rush) [22:21:38] (03CR) 10Paladox: "Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/277903 (owner: 10Rush) [22:23:56] (03CR) 10Paladox: "We can use Require not host jice.ddns.net to make sure even if they change ip they still carn't clone" [puppet] - 10https://gerrit.wikimedia.org/r/277903 (owner: 10Rush) [22:26:04] paladox: if supported (which I don't know) that would make the server perform a RDNS query for every connection to see if it matches the hostname... [22:26:24] Vulpix: Oh but it would with an ip. [22:27:22] (03PS1) 10Paladox: Block access to jice.ddns.net instead of ip [puppet] - 10https://gerrit.wikimedia.org/r/277904 [22:28:32] paladox: I'm not going to go ahead w/ that for now, if they pop back up maybe, but not for the moment, maybe the releng folks disagree :) [22:28:47] chasemp ok [22:33:40] 6Operations, 10ArchCom-RfC, 6Commons, 10MediaWiki-File-management, and 13 others: Use content hash based image / thumb URLs & define an official thumb API - https://phabricator.wikimedia.org/T66214#2127797 (10RobLa-WMF) Discussed in {E148}. Releve 3:00 PM Minutes: https://tools.... [22:43:13] RECOVERY - MariaDB Slave Lag: s3 on dbstore2002 is OK: OK slave_sql_lag Replication lag: 192.32 seconds [22:59:41] Jerry Goldsmith and Alan Silvestri two wonderful but annoyingly highly identifiable film composers [23:00:04] RoanKattouw ostriches Krenair MaxSem: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160316T2300). [23:00:04] RoanKattouw Dereckson Jdlrobson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:12] hey [23:00:13] Hi. [23:00:26] that was definitely meant for the staff channel... [23:00:28] Krenair: are you doing swat today? [23:00:37] not sure. Were you planning to RoanKattouw? [23:00:53] i have 2 additional patches which will need to go out [23:00:59] I do have some other things I should probably be getting on with instead... but if no one else is... [23:01:05] (relating to https://phabricator.wikimedia.org/T127823) [23:01:10] lemme know if you cant see it [23:01:23] I can do it [23:01:27] Sorry for spacing out there [23:01:40] I got distracted checking if the patch I just wrote needed SWATting, thankfully it doesn't [23:01:50] oh wow, you called your sprint "Java and JavaScript are basically the same" [23:03:21] Dereckson: Any particular order those Bosnian namespace changes need to be deployed in? [23:04:09] No, they are now indepednant. [23:04:30] awww. so nice BS is now namespaced [23:04:36] wait, oh shi... [23:04:55] Dereckson: Wait, 'Razgovor_sa_korisnikom' => NS_USER_TALK, doesn't need to be in mw-config, because it'll already be in MessagesBs.php, rigth? [23:05:10] nikerabbit would prefer it in both [23:05:15] :O [23:05:17] OK [23:05:21] If he says so, I trust him [23:05:32] We had a discussion about that on Gerrit. [23:05:41] (03CR) 10Catrope: [C: 032] Namespace configuration for bs.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247093 (https://phabricator.wikimedia.org/T115812) (owner: 10Luke081515) [23:05:49] Krenair: hehe [23:06:34] (03Merged) 10jenkins-bot: Namespace configuration for bs.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247093 (https://phabricator.wikimedia.org/T115812) (owner: 10Luke081515) [23:07:13] PROBLEM - puppet last run on mw1242 is CRITICAL: CRITICAL: puppet fail [23:07:41] (03PS1) 10EBernhardson: Define hhvm connection pools for cirrus [puppet] - 10https://gerrit.wikimedia.org/r/277919 [23:07:54] (03PS2) 10EBernhardson: Define hhvm connection pools for cirrus [puppet] - 10https://gerrit.wikimedia.org/r/277919 [23:09:00] bblack: are you around? [23:09:21] ebernhardson: did you actually get it to work? When I tried it, even after declaring the pool name, curl_init_pooled() always returned false. [23:09:45] (03CR) 10Ori.livneh: "Were you actually able to get it to work? When I tried it, even after declaring the pool name, curl_init_pooled() always returned false." [puppet] - 10https://gerrit.wikimedia.org/r/277919 (owner: 10EBernhardson) [23:09:47] ori: yea, it worked fine from the repl [23:09:55] hmm! [23:09:56] ori: (on an app server, talking to the es prod cluster) [23:09:59] let me try again [23:10:29] ori: make sure you provide -d 'curl.namedPools = poolA,poolB' to the config [23:10:58] yes, I did, and it made a difference -- it suppressed the warning about the code trying to use a pool which was not defined [23:11:02] but I still didn't get a curl handle [23:11:07] hmm, gave me a handle [23:11:07] but maybe I did something wrong; I'll try again [23:12:00] what is the advantage of having multiple discrete pools, by the way? [23:13:12] ori: well, hhvm just recycles the handles. It you point the handle at a different host it will have to re-connet [23:13:23] so you need a pool for cirrus, and a pool for restbase, etc. [23:13:36] or really a pool for cirrus eqiad, a pool for cirrus codfw [23:13:38] * ebernhardson updates patch... [23:13:42] it actually stays connected? [23:13:46] yes [23:13:51] via curl keep alive [23:14:01] I thought it merely spared you the cost of having to constantly allocate and free handles [23:14:26] nope, it keeps the connection open. Brings down per-connection setup for https on elastic from 40ms to 2ms [23:15:09] (03CR) 10EBernhardson: "yup, run hhvm like this:" [puppet] - 10https://gerrit.wikimedia.org/r/277919 (owner: 10EBernhardson) [23:15:15] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Namespace config changes for bswiki (duration: 00m 44s) [23:15:21] Testing. [23:15:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:16:39] ebernhardson: that is fantastic [23:16:42] RoanKattouw: 277919 tested [23:16:46] er nope [23:17:18] legoktm: Did you modify JobQueueGroup.php on tin without checking in your changes? [23:17:33] RoanKattouw: 247093 tested [23:17:35] I'll stash, pull, reapply, but it's really annoying when there are uncommitted changes because git pull --rebase just refuses to run in that case [23:18:00] (03PS3) 10EBernhardson: Define hhvm connection pools for cirrus [puppet] - 10https://gerrit.wikimedia.org/r/277919 [23:18:57] (03PS4) 10EBernhardson: Define curl connection pools in hhvm for cirrus [puppet] - 10https://gerrit.wikimedia.org/r/277919 [23:19:07] ebernhardson: wow, 40ms to open a tcp connection? [23:19:48] Dereckson: Your MessagesBs.php changes will have to wait a bit actually because I have to run a scap for them, so that's last [23:19:55] gwicke: tls, but yes i thought that was really excessive. It's much much less for plain http [23:20:39] jdlrobson: https://gerrit.wikimedia.org/r/#/c/276469/ is not an MMV change, ID typo? [23:20:43] yeah, with tls that's not as outrageous [23:20:56] RoanKattouw: k [23:22:36] interesting. a .yaml file full of bash script [23:22:51] gwicke: suprisingly though, i'm seeing 5 ms average for a regular http connection which still seems a bit much. Using the pool brings that down to .5ms (averaged across 1k req's) [23:23:26] RoanKattouw: yeh typo it's related to MMV but the id is correct [23:23:39] browser tests for MMV [23:23:53] I don't know how to deploy integration/config changes [23:23:55] gwicke: I'm going to merge ebernhardson's change. Since it involves restarting HHVM across the cluster, would you maybe like to add a named restbase pool as well? [23:24:18] But I can find out [23:24:35] ori: why are those configured manually? [23:24:55] I thought most of the time you specify # of connections per destination [23:25:20] gwicke: it's just how they work in hhvm. You can provide a per pool size, max reuse, and timeout (how long to wait for a connection to become available) [23:25:33] they init the pool on startup of the server [23:25:39] I guess HHVM wants to initialize them on startup, .. right [23:26:03] what happens if we use the default pool? [23:26:03] we could probably work up a sane patch that allows runtime config though [23:26:16] there is no default pool afaik [23:26:24] right [23:26:32] this is curl_multi? [23:26:38] curl_init_pooled [23:26:39] no; it's a new HHVM-specific feature [23:26:55] basically it just keeps curl handles across requests, allowing curl's internal keep alive to work [23:26:57] ebernhardson: it'd be an interesting experiment to see if these pools can be configured with ini_set [23:27:06] oh, afaik all connections to RB are using MultiHTTPClient, which is based on curl_multi [23:27:08] I doubt it would [23:27:22] ori: i poked around the code in ext_curl.cpp, it wont :( [23:27:39] which probably has its own keepalive handling / pooling [23:27:49] gwicke: wouldn't that just be curl_multi_add_handle($mh, $ch) ? [23:27:56] curl_init_pooled gives you the $ch [23:28:07] gwicke: https://github.com/facebook/hhvm/blob/cabdaa14fef494fabd7d1b415c2f41e1b97d4c71/hphp/runtime/ext/curl/ext_curl.php#L131-L147 [23:29:13] multihttpclient uses a plain curl_init(): https://github.com/wikimedia/mediawiki/blob/c339459eb3ee100c0abda4ebf7e4266f7b3d58bd/includes/libs/MultiHttpClient.php#L275 [23:29:36] (03CR) 10Ori.livneh: "Weird, it works for me now. I must have done something wrong earlier. Do you want to configure size / connGetTimeout / reuseLimit as well," [puppet] - 10https://gerrit.wikimedia.org/r/277919 (owner: 10EBernhardson) [23:29:39] jdlrobson: More practical issue: I do not have +2 rights in the integration/config repo, so I'm afraid you'll have to find somebody who does [23:29:53] okay no worries that can wait [23:29:54] jdlrobson: RoanKattouw: CI changes aren't swatted or related to mediawiki deployment [23:29:57] gwicke: ahh, then yes it would be a tiny change to use curl_init_pooled instead. The difficulty will be deciding the right pool size, the default of 5 is probably not right if you are using multiple handles per req [23:30:07] Hashar or Lego can re-compile/merge/push to Jenkins [23:30:12] I can also do it. [23:30:30] Krinkle: RoanKattouw don't worry then i'll leave hashar and zeiljko to take care of it [23:30:33] Krinkle: I know that, but I figured I could try deploying his change anyway. But it turns out that not only have I never done it before (which made me a bit uncomfortable already), I don't have +2 rights [23:30:38] ebernhardson: yeah, I'd rather wait until they have sorted this out better [23:31:12] i guarantee they won't do anything, but they take patches :) [23:31:18] connections failing because of pool exhaustion sounds like a bad prospect [23:31:24] ebernhardson: are you planning on submitting a patch to core that allows curl_init_pooled to be used? (you should) [23:31:37] gwicke: you can fall back to initializing a new handle [23:32:03] we do a lot of requests in parallel [23:32:15] I'm not sure where that would be detected / handled [23:32:32] and without thorough testing I wouldn't trust this to work out of the box [23:32:35] ori: so far i've only put together a patch that implements an Elastica transport using it [23:33:08] gwicke: ok, that's not unwise. let's see how well it works for es [23:33:14] (03CR) 10EBernhardson: "reuse limit and timeout seem like reasonable defaults. I'm not entirely sure about the pool size. We are planning to enable wiki's in batc" [puppet] - 10https://gerrit.wikimedia.org/r/277919 (owner: 10EBernhardson) [23:34:03] ebernhardson, ori: kk, thanks for letting us know & giving this a try [23:34:19] we did some tweaking to the default pooling in node as well, so I can relate ;) [23:34:37] ebernhardson: last nit: can you do this in one place, in modules/mediawiki/manifests/hhvm.pp? [23:34:56] ori: i thought hiera overrode, so that only the final thing defining a particular key 'won' ? [23:35:08] don't define it in yaml [23:35:12] ahh, ok [23:35:13] sure [23:35:34] RECOVERY - puppet last run on mw1242 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [23:35:59] i think that is the best use-case for hiera (overriding defaults on a host-specific basis) [23:36:35] moving standard configuration there is a mistake, both because it adds another layer of indirection, and because the move is inevitably incomplete, leaving some wikimedia-specific configuration data in the manifests, and some in the yaml [23:37:21] one of the best things about puppet is that the declarative style makes manifests a formal specification of how a particular software stack is configured [23:37:48] 6Operations, 10Phabricator: migrate RT maint-announce into phabricator - https://phabricator.wikimedia.org/T118176#2128035 (10RobH) [23:37:48] the fact that hiera lookups are so torturous now that we need a tool to understand how a key would be looked up (because it's so hard to reason about) is a disaster [23:37:57] :) [23:38:26] i agitated against it but lost [23:39:47] (03PS5) 10EBernhardson: Define curl connection pools in hhvm for cirrus [puppet] - 10https://gerrit.wikimedia.org/r/277919 [23:42:49] !log catrope@tin Started scap: bswiki namespace changes; MobileFrontend SWAT patches [23:42:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:44:47] (03CR) 10Ori.livneh: "Commit message needs to be updated :)" [puppet] - 10https://gerrit.wikimedia.org/r/277919 (owner: 10EBernhardson) [23:44:57] :P right [23:45:41] (03PS6) 10EBernhardson: Define curl connection pools in hhvm for cirrus [puppet] - 10https://gerrit.wikimedia.org/r/277919 [23:46:22] 6Operations, 10Traffic, 10Wikimedia-IRC-RC-Server, 7HTTPS, and 2 others: Remove the "HTTPS to HTTP" url filter in the IRC feed - https://phabricator.wikimedia.org/T122933#2128060 (10Krinkle) [23:47:00] 6Operations, 10Traffic, 10Wikimedia-IRC-RC-Server, 7HTTPS, and 2 others: Remove the "HTTPS to HTTP" url filter in the IRC feed - https://phabricator.wikimedia.org/T122933#2128063 (10Krinkle) [23:47:12] (03CR) 10Krinkle: "Sounds good." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/217858 (https://phabricator.wikimedia.org/T122933) (owner: 10Faidon Liambotis) [23:47:53] !log catrope@tin Finished scap: bswiki namespace changes; MobileFrontend SWAT patches (duration: 05m 04s) [23:47:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:48:01] Testing. [23:48:13] (03CR) 10Krinkle: [C: 04-1] "Don't merge before May 2nd, 2016." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/217858 (https://phabricator.wikimedia.org/T122933) (owner: 10Faidon Liambotis) [23:49:09] Wow that was fast [23:49:13] jdlrobson: That's your patches too --^^ [23:49:24] RoanKattouw: works [23:49:36] jdlrobson: I think you meant to merge https://phabricator.wikimedia.org/T121734 in the other direction? [23:50:19] Or did you mean to add as blocking? [23:50:57] RoanKattouw: lgtm thanks. Krinkle i forget, they're the same root cause so should be merged someway. [23:51:05] probably with merged descriptions [23:51:08] feel free to correct [23:53:28] ebernhardson: sorry, last thing: you tested that having a space after the comma is ok? [23:53:30] (03CR) 10Catrope: [C: 032] Enable Flow by default in all talk namespaces on gomwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277697 (https://phabricator.wikimedia.org/T128359) (owner: 10Catrope) [23:54:07] (03Merged) 10jenkins-bot: Enable Flow by default in all talk namespaces on gomwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277697 (https://phabricator.wikimedia.org/T128359) (owner: 10Catrope) [23:54:27] ebernhardson: it doesn't, i just checked [23:54:30] s/doesn't/isn't [23:54:45] you'r right, it uses the space as part of the name. i should have checked. updating [23:55:33] (03PS7) 10EBernhardson: Define curl connection pools in hhvm for cirrus [puppet] - 10https://gerrit.wikimedia.org/r/277919 [23:56:27] !log rebooting elastic1004.eqiad.wmnet for kernel update [23:56:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:59:33] PROBLEM - Host elastic1004 is DOWN: PING CRITICAL - Packet loss = 100%