[00:00:04] RoanKattouw ostriches Krenair: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160209T0000). [00:00:04] Krenair jgirault jan_drewniak MatmaRex AaronSchulz James_F tgr: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:17] RECOVERY - cassandra-a CQL 10.64.16.188:9042 on praseodymium is OK: TCP OK - 0.001 second response time on port 9042 [00:00:22] * James_F waves. [00:00:27] wow, how many patches are there suddenly [00:00:27] hey [00:00:33] D: [00:00:52] MatmaRex, people adding patches shortly before swat? never! [00:00:55] * James_F repeats his question. [00:00:58] :) [00:01:09] James_F, um, maybe I missed your question [00:01:12] MatmaRex: 7. [00:01:19] ohh [00:01:32] Krenair: Mostly, can we even do SWAT with Jenkins in its current state? [00:01:51] Which answer do you want to that question? [00:02:04] The correct one. [00:02:10] Well. [00:02:10] 6operations, 7HHVM: HHVM lock-ups - https://phabricator.wikimedia.org/T89912#2010270 (10greg) [00:03:32] I think I can do the config ones, right? [00:03:34] https://gerrit.wikimedia.org/r/#/c/269333/ can be force-merged if Jenkins is lagging [00:03:36] jgirault, around? [00:03:54] it only touches a special page, and that can't get any more broken [00:03:56] what's the issue, jenkins said +2 ? [00:04:07] 9 minutes ago [00:04:28] https://integration.wikimedia.org/zuul/ [00:04:34] some mediawiki gate-and-submit jobs backing up >1hr [00:04:45] Krenair: yeah config changes are moving through the queue pretty quickly [00:04:57] jan_drewniak, what about you? [00:05:48] 6operations, 7Mail: Move most (all?) exim personal aliases to OIT - https://phabricator.wikimedia.org/T122144#2010293 (10Dzahn) [00:05:50] 6operations, 7Mail: delete exim alias wikilibrary@ library@ - https://phabricator.wikimedia.org/T123666#2010291 (10Dzahn) 5Open>3Resolved removed on our side [00:06:04] * Krenair sighs [00:06:34] surely we can force-merge stuff, right? [00:06:51] yes [00:06:52] that causes other problems afaik [00:06:54] i mean, it's deployment branches, who cares. [00:07:06] integration is already all fucked, it won't hurt it much [00:07:09] heh [00:07:31] Krenair: hey [00:07:34] yes, iirc force-merging causes other patches to begin testing again? [00:07:43] jgirault, okay, is https://gerrit.wikimedia.org/r/#/c/268849/ ready to go? [00:08:11] Krenair: yes [00:08:42] Krenair: we have this script https://github.com/wikimedia/wikimedia-portals/blob/master/sync-portals that should be run for doing the deployment [00:09:09] (03PS3) 10Alex Monk: Bump portals to master (color standardization, CSS sprites, updated stats) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268849 (https://phabricator.wikimedia.org/T124993) (owner: 10JGirault) [00:09:16] (03CR) 10Alex Monk: [C: 032] Bump portals to master (color standardization, CSS sprites, updated stats) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268849 (https://phabricator.wikimedia.org/T124993) (owner: 10JGirault) [00:11:29] Well, it has at least appeared on the zuul status page [00:12:06] (03Merged) 10jenkins-bot: Bump portals to master (color standardization, CSS sprites, updated stats) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268849 (https://phabricator.wikimedia.org/T124993) (owner: 10JGirault) [00:12:47] ok [00:15:56] !log krenair@mira Synchronized portals/prod/wikipedia.org/assets: https://gerrit.wikimedia.org/r/#/c/268849/ (duration: 01m 16s) [00:15:59] jgirault, ^ [00:15:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:17:08] Krenair: all good :) [00:17:14] !log krenair@mira Synchronized portals: https://gerrit.wikimedia.org/r/#/c/268849/ (duration: 01m 17s) [00:17:15] jgirault, hang on, that's part 1 of the script [00:17:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:17:18] jgirault, ^ [00:17:53] (03PS1) 10Ottomata: Updates to work with hiera, Hive/Oozie MySQL db can now be hosted on remote node [puppet/cdh] - 10https://gerrit.wikimedia.org/r/269340 (https://phabricator.wikimedia.org/T109859) [00:18:01] enwiki article count should jump from 5.027m to 5.073m [00:18:20] (03PS5) 10Ottomata: [WIP] Refactor manifests/role/analytics/* into modules/role, use hiera to configure [puppet] - 10https://gerrit.wikimedia.org/r/267797 (https://phabricator.wikimedia.org/T109859) [00:18:20] maybe the purge didn't work [00:18:57] Krenair: we were seeing a few issues with the purge on Friday when Max tried it [00:19:07] RECOVERY - puppet last run on mw2092 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:19:15] does varnish do anything funny with this domain? [00:19:28] Krenair: he then ran it manually, we were not sure if it there was some lag [00:20:05] Krenair: it should just purge https://www.wikipedia.org/ [00:20:41] yeah [00:21:24] same problem at https://www.wikipedia.org/portal/wikipedia.org/index.html [00:22:13] Krenair: it’s weird… it doesn’t look like deployment was… effective.. [00:22:54] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Refactor manifests/role/analytics/* into modules/role, use hiera to configure [puppet] - 10https://gerrit.wikimedia.org/r/267797 (https://phabricator.wikimedia.org/T109859) (owner: 10Ottomata) [00:23:20] Ohh, hang on. [00:23:35] Yeah this is my fault [00:23:48] It requires a submodule update, not just a merge :) [00:24:00] oh, :) [00:24:25] I got used to changes in this repo being relatively simple [00:24:59] !log krenair@mira Synchronized portals/prod/wikipedia.org/assets: https://gerrit.wikimedia.org/r/#/c/268849/ (duration: 01m 18s) [00:25:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:26:18] !log krenair@mira Synchronized portals: https://gerrit.wikimedia.org/r/#/c/268849/ (duration: 01m 18s) [00:26:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:27:20] jgirault, aha [00:27:45] jgirault, I ran the purge manually with a trailing /, swapped to my browser and refreshed, and it updated properly [00:27:46] 6operations, 7Mail, 7Mobile, 5Patch-For-Review: consolidate mailman redirects in exim aliases file - https://phabricator.wikimedia.org/T123581#2010331 (10Dzahn) >>! In T123581#1934528, @faidon wrote: > It might be worth it to grep logs for them and see if there has been anything but spam. on mx1001, zgrep... [00:28:03] don't know if that's a coincidence or not [00:28:09] Krenair: awesome! [00:28:25] The other submodule is something like WikipediaFirefoxOS :) [00:28:55] Krenair: It may have been actually, the age was close to 3600 (max-age) [00:29:03] * AaronSchulz lurks [00:29:07] hi AaronSchulz [00:29:15] I am looking at your change now [00:30:46] (03PS1) 10Jdlrobson: Test HTML stripping in production mobile beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269341 (https://phabricator.wikimedia.org/T124959) [00:30:48] 6operations, 7Mail, 7Mobile, 5Patch-For-Review: consolidate mailman redirects in exim aliases file - https://phabricator.wikimedia.org/T123581#2010345 (10Dzahn) search@ has been requested by Tomasz in 2015 only -> T98415 i removed: ops-private@, otrsadmin@ [00:31:21] AaronSchulz, okay so this also moves search to search.svc.codfw.wmnet in codfw [00:31:28] is that all ready to go? [00:32:25] suppose it's fine as it's just codfw [00:32:36] gah, rebase fails [00:32:39] ok [00:32:51] yeah [00:32:59] no apache traffic there anyway [00:33:15] rebasing manually [00:33:18] PROBLEM - restbase endpoints health on praseodymium is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200): /page/html/{title} (Get html by title from storage) is CRITICAL: Test G [00:33:18] though it still would have been nice if split off, but meh [00:33:27] PROBLEM - cassandra-a CQL 10.64.16.188:9042 on praseodymium is CRITICAL: Connection refused [00:33:46] yeah, I'm not too worried about it [00:34:40] so the conflict is with the change from parsoidcache to http://parsoid.svc.eqiad.wmnet:8000 [00:34:56] PROBLEM - cassandra-a service on praseodymium is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [00:35:07] RECOVERY - restbase endpoints health on praseodymium is OK: All endpoints are healthy [00:35:55] (03PS7) 10Alex Monk: Rationalize definition of service hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266509 (https://phabricator.wikimedia.org/T114273) (owner: 10Giuseppe Lavagetto) [00:36:00] renaming that key to parsoid instead of parsoidcache [00:36:37] AaronSchulz, oh, will this break beta? [00:37:44] wmfLocalServices will only be set if $wmfRealm == 'production' [00:38:24] i guess we're not doing non-config changes? :( [00:38:44] the morning swat was full of config stuff. i should've bumped someone off, grumble. [00:38:57] Krenair: only if labs is reusing the same dns host names but local [00:39:16] zuul status is broken, heh [00:39:45] * AaronSchulz assumed that would be crazy...but who knows, hmm [00:39:59] AaronSchulz, undefined variables though [00:40:33] wmfLocalServices will only be set if $wmfRealm == 'production' [00:40:45] but then we start assuming that it's set [00:42:24] !log killed Zuul scheduler. On gallium edited /usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/trigger/gerrit.py and modified: replication_timeout = 300 -> replication_timeout = 10 . Started Zuul [00:42:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:42:38] yeah there is some insanity in the conf using prod values and then overriding them in -labs, which would make some warning spam with that [00:43:37] AaronSchulz, let's just remove the check for prod and introduce sanity in a separate commit? [00:45:26] sigh, I suppose [00:46:48] (03PS8) 10Alex Monk: Rationalize definition of service hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266509 (https://phabricator.wikimedia.org/T114273) (owner: 10Giuseppe Lavagetto) [00:47:30] (03CR) 10Alex Monk: [C: 032] Rationalize definition of service hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266509 (https://phabricator.wikimedia.org/T114273) (owner: 10Giuseppe Lavagetto) [00:47:43] 6operations, 10Education-Program-Dashboard: Spike: What do we have to package to run the Programs and Events dashboard on production? - https://phabricator.wikimedia.org/T126295#2010395 (10awight) 3NEW a:3dduvall [00:48:38] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 56.00% of data above the critical threshold [5000000.0] [00:49:03] (03Merged) 10jenkins-bot: Rationalize definition of service hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266509 (https://phabricator.wikimedia.org/T114273) (owner: 10Giuseppe Lavagetto) [00:49:40] 6operations, 10Education-Program-Dashboard: Spike: What do we have to package to run the Programs and Events dashboard on production? - https://phabricator.wikimedia.org/T126295#2010420 (10Dzahn) Where is that dashboard now? [00:50:22] 6operations, 10Education-Program-Dashboard: Spike: What do we have to package to run the Programs and Events dashboard on production? - https://phabricator.wikimedia.org/T126295#2010421 (10awight) @Dzahn: On WMF Labs, https://wikitech.wikimedia.org/wiki/Nova_Resource:Globaleducation [00:50:57] !log krenair@mira Synchronized wmf-config/ProductionServices.php: https://gerrit.wikimedia.org/r/#/c/266509/8 (duration: 01m 18s) [00:51:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:52:47] !log krenair@mira Synchronized docroot/noc: https://gerrit.wikimedia.org/r/#/c/266509/8 (duration: 01m 17s) [00:52:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:53:16] looks good on mw1017 [00:54:08] MatmaRex, James_F, tgr: I'm going to try to get the others done, will go over the end of the window though [00:54:32] * James_F nods. [00:54:32] I can take MatmaRex's one if he need to sleep. [00:54:45] !log krenair@mira Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/#/c/266509/ (duration: 01m 17s) [00:54:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:55:20] i'm still hanging around. [00:56:39] will do yours first [00:57:10] AaronSchulz, done [00:57:28] 6operations, 10Education-Program-Dashboard: Spike: What do we have to package to run the Programs and Events dashboard on production? - https://phabricator.wikimedia.org/T126295#2010434 (10dduvall) >>! In T126295#2010420, @Dzahn wrote: > Where is that dashboard now? For a little more background, we're working... [00:57:35] Krenair: MatmaRex's is mine. ;-) [00:57:52] Well, my responsibility. [00:58:11] ok [00:58:47] RECOVERY - cassandra-a service on praseodymium is OK: OK - cassandra-a is active [00:59:06] RECOVERY - cassandra-a CQL 10.64.16.188:9042 on praseodymium is OK: TCP OK - 0.002 second response time on port 9042 [00:59:44] 6operations, 7Mail, 7Mobile, 5Patch-For-Review: consolidate mailman redirects in exim aliases file - https://phabricator.wikimedia.org/T123581#2010436 (10Dzahn) while most things have been removed ops@, search@, mobile@, engineering@ and watchmouse@ are left and i'd keep them announcements@ waiting for O... [01:01:53] Ugh. [01:01:57] There's a bunch of things in the MW queue [01:02:38] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 50.00% above the threshold [1000000.0] [01:07:28] PROBLEM - dhclient process on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:08:06] PROBLEM - salt-minion processes on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:08:07] PROBLEM - DPKG on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:08:07] PROBLEM - Disk space on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:08:16] PROBLEM - puppet last run on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:08:26] PROBLEM - configured eth on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:08:36] PROBLEM - RAID on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:08:47] PROBLEM - Check size of conntrack table on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:12:14] FFS. [01:12:39] MatmaRex, James_F: jenkins is failing the commit, qunit test [01:13:13] 01:12:00 09 02 2016 01:12:00.592:WARN [Chromium 48.0.2564 (Ubuntu 0.0.0)]: Disconnected (1 times), because no message in 60000 ms. [01:13:26] bullshit [01:13:34] Krenair: That's just the flakiness of CI. [01:13:52] will it lose its mind if you just let it finish, then quietly V+2 and submit? [01:14:24] MatmaRex: Less now that hashar live-hacked it. [01:14:38] Might be for the best. Krenair, do you want to do that? [01:14:53] yes [01:15:10] Go for it. [01:19:21] I think it heard us and is now taking forever. [01:20:32] Krenair: You didn't do it yet. :-) [01:20:40] yeah, the plan is to "just let it finish" [01:20:59] Oh. I think you should just merge it. [01:21:01] *then* V+2 and submit [01:21:03] Deploying this late is risky as it is. [01:21:12] The longer we wait the less cover there is. [01:22:08] * ori is around and available to help, if needed. [01:22:39] * James_F crosses fingers that it'll be fine. [01:24:02] !log krenair@mira Synchronized php-1.27.0-wmf.12/resources/src: https://gerrit.wikimedia.org/r/#/c/269140/ (duration: 01m 19s) [01:24:04] James_F, MatmaRex ^ [01:24:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:25:01] whoo it worked [01:25:24] tgr, still around? [01:25:31] yes [01:26:07] * James_F glares at 269344 [01:26:23] Krenair: Should we V+2 that too? [01:27:23] !log krenair@mira Synchronized php-1.27.0-wmf.12/extensions/OAuth/frontend/specialpages/SpecialMWOAuthManageConsumers.php: https://gerrit.wikimedia.org/r/#/c/269333/ (duration: 01m 19s) [01:27:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:27:29] tgr, ^ [01:28:07] Krenair: works, thanks! [01:29:39] James_F, https://gerrit.wikimedia.org/r/#/c/269293/ net [01:29:40] next [01:29:50] James_F, what's the best way to test this one? [01:30:02] Krenair: https://grafana.wikimedia.org/dashboard/db/eventlogging-schema [01:30:07] Krenair: But ideally could you do the MW-core one and the MF one together? [01:30:12] (VE and MF) [01:30:14] right, I knew there'd be a better way than what I had in mind :) [01:34:01] !log krenair@mira Synchronized php-1.27.0-wmf.12/extensions: https://gerrit.wikimedia.org/r/#/c/269344/ and https://gerrit.wikimedia.org/r/#/c/269293/1 (duration: 01m 51s) [01:34:03] James_F, ^ [01:34:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:34:07] (03PS1) 10Dzahn: admin: remove ssh key of jkrauska [puppet] - 10https://gerrit.wikimedia.org/r/269350 (https://phabricator.wikimedia.org/T126260) [01:34:18] Thanks. [01:36:46] Krenair: LGTM. [01:37:31] I haven't seen a great deal of change in the graph [01:41:08] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [100000000.0] [01:44:54] (03PS17) 1020after4: Puppet provider for scap3 [puppet] - 10https://gerrit.wikimedia.org/r/262742 (https://phabricator.wikimedia.org/T113072) (owner: 10Alexandros Kosiaris) [01:46:40] (03CR) 10jenkins-bot: [V: 04-1] Puppet provider for scap3 [puppet] - 10https://gerrit.wikimedia.org/r/262742 (https://phabricator.wikimedia.org/T113072) (owner: 10Alexandros Kosiaris) [01:48:35] (03PS18) 1020after4: Puppet provider for scap3 [puppet] - 10https://gerrit.wikimedia.org/r/262742 (https://phabricator.wikimedia.org/T113072) (owner: 10Alexandros Kosiaris) [01:48:50] Krenair: It's roughly what I anticipated. MF and VE are 'only' a third or so of the events, or something. Radically up/down would be bad. [01:49:14] Krenair: The number of errors has dropped to zero, though. [01:49:21] :/ [01:58:04] (03CR) 1020after4: "This works now. The biggest change is I had to change package_settings to install_options" [puppet] - 10https://gerrit.wikimedia.org/r/262742 (https://phabricator.wikimedia.org/T113072) (owner: 10Alexandros Kosiaris) [01:58:42] !log added SMalyshev to wikidata-query gerrit group [01:58:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:00:43] (03CR) 1020after4: Add scap3 deployment option for services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/269143 (owner: 10Thcipriani) [02:03:36] PROBLEM - NTP on alsafi is CRITICAL: NTP CRITICAL: No response from NTP server [02:05:26] RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [02:10:16] PROBLEM - SSH on alsafi is CRITICAL: Server answer [02:11:07] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [24.0] [02:13:37] RECOVERY - SSH on alsafi is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [02:14:36] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [02:20:18] (03PS1) 10Andrew Bogott: This is going to cause /var/log to fill up if I let it run. [puppet] - 10https://gerrit.wikimedia.org/r/269354 [02:20:47] (03PS2) 10Andrew Bogott: Revert "Turn pdns loglevels WAY UP" [puppet] - 10https://gerrit.wikimedia.org/r/269354 [02:20:53] (03PS3) 10Andrew Bogott: Revert "Turn pdns loglevels WAY UP" [puppet] - 10https://gerrit.wikimedia.org/r/269354 [02:22:15] (03CR) 10Andrew Bogott: [C: 032] Revert "Turn pdns loglevels WAY UP" [puppet] - 10https://gerrit.wikimedia.org/r/269354 (owner: 10Andrew Bogott) [02:43:07] PROBLEM - SSH on alsafi is CRITICAL: Server answer [02:58:36] RECOVERY - SSH on alsafi is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [03:08:58] PROBLEM - SSH on alsafi is CRITICAL: Server answer [03:20:57] RECOVERY - SSH on alsafi is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [03:31:26] PROBLEM - SSH on alsafi is CRITICAL: Server answer [03:33:07] RECOVERY - SSH on alsafi is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [03:38:31] ori: greg-g: I should probably have mentioned this sooner, but we didn't enable $wgAuthenticationTokenVersion in production yesterday because of some kind of hhvm caching issue that got me all confused, so I plan to do it tonight [03:39:29] same considerations as yesterday: wait until _joe_ is up, pick a time when most users are asleep or at least don't edit yet, so that would be midnight-ish PDT [03:40:05] it has been deployed on beta since yesterday evening and I haven't heard of any problems [03:43:50] tgr: kk, I'll be asleep [03:46:47] PROBLEM - SSH on alsafi is CRITICAL: Server answer [03:48:28] RECOVERY - SSH on alsafi is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [03:52:20] (03CR) 10Santhosh: "This is good to go now. Lifting my -1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267236 (owner: 10KartikMistry) [03:53:36] PROBLEM - SSH on alsafi is CRITICAL: Server answer [03:55:18] RECOVERY - SSH on alsafi is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [03:55:19] (03CR) 10Santhosh: "The branching happening today will pick up https://gerrit.wikimedia.org/r/#/c/267217 and will be everywhere by 11th. After that we can dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267236 (owner: 10KartikMistry) [04:10:48] PROBLEM - SSH on alsafi is CRITICAL: Server answer [04:12:36] RECOVERY - SSH on alsafi is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [04:17:37] PROBLEM - SSH on alsafi is CRITICAL: Server answer [04:19:17] RECOVERY - SSH on alsafi is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [04:22:10] 7Blocked-on-Operations, 6operations, 10Wikimedia-General-or-Unknown: Invalidate all users sessions - https://phabricator.wikimedia.org/T124440#2010709 (10Tgr) The plan is to set it tonight. The time was chosed over IRC based on four factors: 1) someone from ops should be around, just in case something goes w... [04:24:28] PROBLEM - SSH on alsafi is CRITICAL: Server answer [04:25:34] 7Blocked-on-Operations, 6operations, 10Wikimedia-General-or-Unknown: Invalidate all users sessions - https://phabricator.wikimedia.org/T124440#2010712 (10Tgr) Sent a notification to same places as T126069#2004474. [04:26:16] RECOVERY - SSH on alsafi is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [04:27:06] 7Blocked-on-Operations, 6operations, 10Wikimedia-General-or-Unknown: Invalidate all users sessions - https://phabricator.wikimedia.org/T124440#2010716 (10Tgr) [04:35:22] 7Blocked-on-Operations, 6operations, 10Wikimedia-General-or-Unknown: Invalidate all users sessions - https://phabricator.wikimedia.org/T124440#2010723 (10bd808) a:5Legoktm>3Tgr [04:40:07] PROBLEM - SSH on alsafi is CRITICAL: Server answer [04:44:24] (03PS1) 10Andrew Bogott: Fix ldap_user_name_attribute in keystone config again [puppet] - 10https://gerrit.wikimedia.org/r/269363 [04:45:07] RECOVERY - SSH on alsafi is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [04:57:07] PROBLEM - SSH on alsafi is CRITICAL: Server answer [04:58:47] RECOVERY - SSH on alsafi is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [05:05:47] PROBLEM - SSH on alsafi is CRITICAL: Server answer [05:07:36] RECOVERY - SSH on alsafi is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [05:16:28] PROBLEM - puppet last run on cp2023 is CRITICAL: CRITICAL: puppet fail [05:17:46] PROBLEM - SSH on alsafi is CRITICAL: Server answer [05:22:49] RECOVERY - SSH on alsafi is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [05:27:58] PROBLEM - SSH on alsafi is CRITICAL: Server answer [05:28:47] PROBLEM - cassandra-a service on cerium is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [05:29:26] PROBLEM - cassandra-a CQL 10.64.16.153:9042 on cerium is CRITICAL: Connection refused [05:31:38] RECOVERY - SSH on alsafi is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [05:32:27] RECOVERY - cassandra-a service on cerium is OK: OK - cassandra-a is active [05:32:57] RECOVERY - cassandra-a CQL 10.64.16.153:9042 on cerium is OK: TCP OK - 0.001 second response time on port 9042 [05:37:08] 10Ops-Access-Requests, 6operations, 6Services: Requesting restbase-admins access to RESTBase cluster for Petr Pchelko - https://phabricator.wikimedia.org/T126283#2010766 (10Dzahn) p:5Triage>3Normal [05:37:39] 10Ops-Access-Requests, 6operations, 6Services: Requesting restbase-admins access to RESTBase cluster for Petr Pchelko - https://phabricator.wikimedia.org/T126283#2010768 (10Dzahn) a:3Cmjohnson [05:40:07] PROBLEM - SSH on alsafi is CRITICAL: Server answer [05:44:06] RECOVERY - puppet last run on cp2023 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:45:17] RECOVERY - SSH on alsafi is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [05:46:17] (03PS1) 10Dzahn: admin: add user ppchelko [puppet] - 10https://gerrit.wikimedia.org/r/269368 (https://phabricator.wikimedia.org/T126283) [05:50:26] (03PS1) 10Dzahn: admin: add ppchelko to restbase-admins [puppet] - 10https://gerrit.wikimedia.org/r/269369 (https://phabricator.wikimedia.org/T126283) [05:50:27] PROBLEM - SSH on alsafi is CRITICAL: Server answer [05:52:33] (03CR) 10Papaul: [V: 031] admin: add user ppchelko [puppet] - 10https://gerrit.wikimedia.org/r/269368 (https://phabricator.wikimedia.org/T126283) (owner: 10Dzahn) [05:54:39] (03CR) 10Papaul: [C: 031 V: 031] admin: add ppchelko to restbase-admins [puppet] - 10https://gerrit.wikimedia.org/r/269369 (https://phabricator.wikimedia.org/T126283) (owner: 10Dzahn) [06:00:47] RECOVERY - SSH on alsafi is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [06:05:57] PROBLEM - SSH on alsafi is CRITICAL: Server answer [06:09:18] RECOVERY - SSH on alsafi is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [06:26:36] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [24.0] [06:28:08] PROBLEM - SSH on alsafi is CRITICAL: Server answer [06:29:09] (03PS1) 10Legoktm: contint: Use slave-scripts/bin/php wrapper script [puppet] - 10https://gerrit.wikimedia.org/r/269370 (https://phabricator.wikimedia.org/T126211) [06:29:37] PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: puppet fail [06:30:27] PROBLEM - puppet last run on db1028 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:36] _joe_: can you ping me when you are up and have some time? [06:35:50] <_joe_> tgr: I am :) [06:35:56] oh cool [06:35:58] <_joe_> what's up? [06:36:07] PROBLEM - cassandra-a CQL 10.64.16.188:9042 on praseodymium is CRITICAL: Connection refused [06:36:12] I want to deploy https://phabricator.wikimedia.org/T124440#2010709 tonight [06:36:17] PROBLEM - cassandra-a service on praseodymium is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [06:36:21] (that's PDT tonight) [06:36:50] and wanted to be sure someone from ops is available in case something unexpected happens [06:37:16] it changes login cookie calculation which in effect drops all the sessions [06:37:51] PDT night / UTC morning seemed the least disruptive time for that [06:38:36] <_joe_> tgr: agreed [06:38:57] <_joe_> tgr: you just need someone to be awake, or some active action? [06:39:12] <_joe_> tgr: I think *right now* is early enough in europe [06:39:15] just awake [06:39:20] <_joe_> at least western europe :) [06:40:01] from https://grafana.wikimedia.org/dashboard/db/edit-count?from=1454386018304&to=1454990758304 this is probably as good as it gets [06:40:15] <_joe_> I would've guessed so [06:41:13] <_joe_> so, let's go :) [06:45:27] RECOVERY - SSH on alsafi is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [06:52:17] PROBLEM - SSH on alsafi is CRITICAL: Server answer [06:57:17] RECOVERY - puppet last run on cp4016 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:57:57] RECOVERY - puppet last run on db1028 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:00:08] RECOVERY - cassandra-a CQL 10.64.16.188:9042 on praseodymium is OK: TCP OK - 0.002 second response time on port 9042 [07:00:17] RECOVERY - cassandra-a service on praseodymium is OK: OK - cassandra-a is active [07:01:01] !log tgr@mira Synchronized private/PrivateSettings.php: Mass logout via $wgAuthenticationTokenVersion - T124440#2010709 (duration: 01m 19s) [07:02:43] !log tgr@mira Synchronized wmf-config/PrivateSettings.php: Mass logout via $wgAuthenticationTokenVersion - T124440#2010709 (duration: 01m 20s) [07:02:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:11:55] <_joe_> !log manually touched (with -h) the wmf-config/PrivateSettings.php symlink on all mw* hosts [07:11:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:14:57] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [07:25:57] PROBLEM - cassandra-a service on xenon is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [07:26:16] PROBLEM - cassandra-a CQL 10.64.0.202:9042 on xenon is CRITICAL: Connection refused [07:32:57] RECOVERY - cassandra-a service on xenon is OK: OK - cassandra-a is active [07:34:56] RECOVERY - cassandra-a CQL 10.64.0.202:9042 on xenon is OK: TCP OK - 0.010 second response time on port 9042 [07:38:17] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/2/0: down - Core: cr1-eqord:xe-0/0/1 (Telia, IC-313592, 51ms) {#1502} [10Gbps wave]BR [07:39:46] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-0/0/1: down - Core: cr1-ulsfo:xe-1/2/0 (Telia, IC-313592, 51ms) {#11372} [10Gbps wave]BR [07:46:37] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [07:46:37] (03PS1) 10KartikMistry: Enable Yandex for Persian Wikipedia [puppet] - 10https://gerrit.wikimedia.org/r/269371 (https://phabricator.wikimedia.org/T125118) [07:46:56] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 72, down: 0, dormant: 0, excluded: 0, unused: 0 [08:19:26] PROBLEM - cassandra-a CQL 10.64.16.188:9042 on praseodymium is CRITICAL: Connection refused [08:19:37] PROBLEM - cassandra-a service on praseodymium is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [08:23:07] RECOVERY - cassandra-a service on praseodymium is OK: OK - cassandra-a is active [08:23:30] !log restarted cassandra-a service on praseodymium [08:23:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:24:06] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [24.0] [08:24:28] RECOVERY - cassandra-a CQL 10.64.16.188:9042 on praseodymium is OK: TCP OK - 0.003 second response time on port 9042 [08:30:07] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-0/0/1: down - Core: cr1-ulsfo:xe-1/2/0 (Telia, IC-313592, 51ms) {#11372} [10Gbps wave]BR [08:30:27] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/2/0: down - Core: cr1-eqord:xe-0/0/1 (Telia, IC-313592, 51ms) {#1502} [10Gbps wave]BR [08:35:02] (03PS1) 10Muehlenhoff: Remove access credentials for Victoria Baranetsky [puppet] - 10https://gerrit.wikimedia.org/r/269374 [08:35:46] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 72, down: 0, dormant: 0, excluded: 0, unused: 0 [08:36:12] (03CR) 10Muehlenhoff: [C: 032 V: 032] Remove access credentials for Victoria Baranetsky [puppet] - 10https://gerrit.wikimedia.org/r/269374 (owner: 10Muehlenhoff) [08:37:08] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [08:43:23] 7Puppet, 6operations, 10Wikimedia-Apache-configuration: Refactor the mediawiki puppet classes to make HHVM default, drop zend compatibility - https://phabricator.wikimedia.org/T126310#2010867 (10Joe) 3NEW [08:43:47] <_joe_> ^^ pretty high-yield low-hanging fruit refactoring [08:44:08] <_joe_> if one of the rookies feels adventurous it could be a good homework [08:45:07] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [08:49:48] <_joe_> !log installing the new pybal package in esams and eqiad backups [08:49:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:52:36] 7Puppet, 6operations, 10Wikimedia-Apache-configuration: Refactor the mediawiki puppet classes to make HHVM default, drop zend compatibility - https://phabricator.wikimedia.org/T126310#2010876 (10MoritzMuehlenhoff) It would also be great to fix up the package dependencies so that we can stop installing the Ze... [08:54:52] 7Puppet, 6operations, 10Wikimedia-Apache-configuration: Refactor the mediawiki puppet classes to make HHVM default, drop zend compatibility - https://phabricator.wikimedia.org/T126310#2010877 (10Joe) @MoritzMuehlenhoff I think that should be targeted when we upgrade the appservers to jessie (or stretch). Ano... [09:03:36] (03CR) 10Alexandros Kosiaris: [C: 032] Enable Yandex for Persian Wikipedia [puppet] - 10https://gerrit.wikimedia.org/r/269371 (https://phabricator.wikimedia.org/T125118) (owner: 10KartikMistry) [09:03:43] (03PS2) 10Alexandros Kosiaris: Enable Yandex for Persian Wikipedia [puppet] - 10https://gerrit.wikimedia.org/r/269371 (https://phabricator.wikimedia.org/T125118) (owner: 10KartikMistry) [09:03:49] (03CR) 10Alexandros Kosiaris: [V: 032] Enable Yandex for Persian Wikipedia [puppet] - 10https://gerrit.wikimedia.org/r/269371 (https://phabricator.wikimedia.org/T125118) (owner: 10KartikMistry) [09:08:22] akosiaris: see PM. [09:11:17] (03PS1) 10KartikMistry: Revert "Enable Yandex for Persian Wikipedia" [puppet] - 10https://gerrit.wikimedia.org/r/269377 [09:16:23] (03PS1) 10Elukey: Remove mc1004.eqiad.wmnet from memcached/redis pools for maintenance. Bug: T123711 [puppet] - 10https://gerrit.wikimedia.org/r/269378 (https://phabricator.wikimedia.org/T123711) [09:16:53] 6operations, 10Traffic: Forward-port VCL to Varnish 4 - https://phabricator.wikimedia.org/T124279#2010889 (10ema) p:5Triage>3Normal a:3ema [09:18:38] (03CR) 10Giuseppe Lavagetto: [C: 031] Remove mc1004.eqiad.wmnet from memcached/redis pools for maintenance. Bug: T123711 [puppet] - 10https://gerrit.wikimedia.org/r/269378 (https://phabricator.wikimedia.org/T123711) (owner: 10Elukey) [09:20:21] (03CR) 10Elukey: [C: 032] Remove mc1004.eqiad.wmnet from memcached/redis pools for maintenance. Bug: T123711 [puppet] - 10https://gerrit.wikimedia.org/r/269378 (https://phabricator.wikimedia.org/T123711) (owner: 10Elukey) [09:21:17] !log restarted hhvm on mw1132 [09:21:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:22:06] RECOVERY - HHVM rendering on mw1132 is OK: HTTP OK: HTTP/1.1 200 OK - 67651 bytes in 3.530 second response time [09:22:57] RECOVERY - Apache HTTP on mw1132 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 485 bytes in 0.194 second response time [09:30:11] 6operations, 7Monitoring: [RFC] Alert about *when* partitions will run out of space, not a percentage/absolute number - https://phabricator.wikimedia.org/T126158#2010900 (10jcrespo) I think what Daniel means, correct me if you are wrong "ok with the idea", see problems with the implementation. Allow me to try... [09:36:45] (03CR) 10Muehlenhoff: "@andrew: I'll merge this later the day when you're around (in case there's a problem with OSM), we can make a final duplicate check on dup" [puppet] - 10https://gerrit.wikimedia.org/r/269155 (owner: 10Muehlenhoff) [09:37:09] 6operations, 10RESTBase-Cassandra: impact of large sstables on cassandra - https://phabricator.wikimedia.org/T126221#2010911 (10fgiunchedi) there are a few open questions even in the multi instance case, for example: # what determines the size of the biggest sstable? that has a direct impact on capacity planni... [09:37:09] (03PS3) 10ArielGlenn: puppetize salt master pub/priv keys for labs and prod masters [puppet] - 10https://gerrit.wikimedia.org/r/252424 (https://phabricator.wikimedia.org/T118385) [09:37:34] (03CR) 10jenkins-bot: [V: 04-1] puppetize salt master pub/priv keys for labs and prod masters [puppet] - 10https://gerrit.wikimedia.org/r/252424 (https://phabricator.wikimedia.org/T118385) (owner: 10ArielGlenn) [09:38:28] PROBLEM - cassandra-a service on praseodymium is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [09:39:57] PROBLEM - cassandra-a CQL 10.64.16.188:9042 on praseodymium is CRITICAL: Connection refused [09:40:53] !log restarted cassandra-a service on praseodymium [09:41:46] RECOVERY - cassandra-a CQL 10.64.16.188:9042 on praseodymium is OK: TCP OK - 0.003 second response time on port 9042 [09:42:07] RECOVERY - cassandra-a service on praseodymium is OK: OK - cassandra-a is active [09:47:05] <_joe_> moritzm: I'm unsure just restarting it will achieve anything [09:47:24] (03CR) 10Alexandros Kosiaris: [C: 031] "We should probably also set a filter like (objectclass=posixaccount) on the line above as well" [puppet] - 10https://gerrit.wikimedia.org/r/269155 (owner: 10Muehlenhoff) [09:49:58] _joe_: I also reported this in #wikimedia-services, it's apparently due to a load test Filippo just said [09:50:09] (03PS4) 10ArielGlenn: puppetize salt master pub/priv keys for labs and prod masters [puppet] - 10https://gerrit.wikimedia.org/r/252424 (https://phabricator.wikimedia.org/T118385) [09:50:17] Running alter table: "110% of stage done" [09:51:30] (03CR) 10jenkins-bot: [V: 04-1] puppetize salt master pub/priv keys for labs and prod masters [puppet] - 10https://gerrit.wikimedia.org/r/252424 (https://phabricator.wikimedia.org/T118385) (owner: 10ArielGlenn) [09:51:36] PROBLEM - puppet last run on serpens is CRITICAL: CRITICAL: puppet fail [09:53:41] (03PS5) 10ArielGlenn: puppetize salt master pub/priv keys for labs and prod masters [puppet] - 10https://gerrit.wikimedia.org/r/252424 (https://phabricator.wikimedia.org/T118385) [09:54:57] PROBLEM - puppet last run on db2003 is CRITICAL: CRITICAL: puppet fail [09:55:11] (03CR) 10jenkins-bot: [V: 04-1] puppetize salt master pub/priv keys for labs and prod masters [puppet] - 10https://gerrit.wikimedia.org/r/252424 (https://phabricator.wikimedia.org/T118385) (owner: 10ArielGlenn) [09:56:17] !log running table engine conversion script on db1069 (potential small lag on labs for 1 day) [09:56:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:58:54] (03PS6) 10ArielGlenn: puppetize salt master pub/priv keys for labs and prod masters [puppet] - 10https://gerrit.wikimedia.org/r/252424 (https://phabricator.wikimedia.org/T118385) [10:01:06] 6operations, 10DBA, 6Labs, 10Labs-Infrastructure: db1069 is running low on space - https://phabricator.wikimedia.org/T124464#2010935 (10jcrespo) We are already down to 85% disk usage, the conversion is still ongoing. It may be worth checking labs, too. [10:04:44] <_joe_> !log depooling elastic1021.eqiad.wmnet as RAM has failed [10:04:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:05:42] (03PS7) 10ArielGlenn: puppetize salt master pub/priv keys for labs and prod masters [puppet] - 10https://gerrit.wikimedia.org/r/252424 (https://phabricator.wikimedia.org/T118385) [10:08:36] 6operations, 10ops-eqiad: Hardware problem (probably memory) on elastic1021 - https://phabricator.wikimedia.org/T125973#2010956 (10Joe) FTR, when a server is down for a prolonged period of time, it's a good idea to remove it from the pybal pools: ``` confctl --find --action set/pooled=inactive elastic1021.eqi... [10:09:35] <_joe_> !log upgrading pybal on active nodes in esams and eqiad [10:09:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:10:50] what's the plan _joe_? [10:11:04] asking because we want to test linux 4.4 on LVSes as well at some point [10:11:21] <_joe_> paravoid: I'm just apt-get install-ing pybal [10:11:33] <_joe_> codfw and ulsfo were done yesterday [10:11:42] <_joe_> this is a small fix for the etcd driver [10:12:10] <_joe_> so well, I'll be done in a few minutes [10:12:53] alright [10:14:06] (03CR) 10Muehlenhoff: "Good point, I'll amend the uidNumber statement in a PS2" [puppet] - 10https://gerrit.wikimedia.org/r/269155 (owner: 10Muehlenhoff) [10:17:58] PROBLEM - puppet last run on wtp2001 is CRITICAL: CRITICAL: puppet fail [10:18:17] RECOVERY - puppet last run on serpens is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [10:20:44] 6operations, 10Wikimedia-Stream: occasional 502 from rcstream seen by pybal - https://phabricator.wikimedia.org/T126313#2010977 (10fgiunchedi) 3NEW [10:21:01] _joe_: ^ [10:21:26] <_joe_> godog: thanks [10:23:18] RECOVERY - puppet last run on db2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:24:03] !log elasticsearch eqiad: cleanup leftover logs /var/log/elasticsearch/*.[2-7] [10:24:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:24:14] (03PS1) 10Jcrespo: s2-master now points to db1018 (instead of db1024) [dns] - 10https://gerrit.wikimedia.org/r/269381 (https://phabricator.wikimedia.org/T125215) [10:25:22] (03CR) 10Jcrespo: [C: 04-2] "Do not submit until swithover is in place." [dns] - 10https://gerrit.wikimedia.org/r/269381 (https://phabricator.wikimedia.org/T125215) (owner: 10Jcrespo) [10:28:51] _joe_: np, don't have much more time to look into it [10:30:07] (03CR) 10ArielGlenn: [C: 032] puppetize salt master pub/priv keys for labs and prod masters [puppet] - 10https://gerrit.wikimedia.org/r/252424 (https://phabricator.wikimedia.org/T118385) (owner: 10ArielGlenn) [10:32:38] !log elasticsearch codfw: cleanup leftover logs /var/log/elasticsearch/*.[2-7] [10:32:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:33:10] <_joe_> !log pybal updated everywhere [10:33:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:33:50] (03Abandoned) 10KartikMistry: Revert "Enable Yandex for Persian Wikipedia" [puppet] - 10https://gerrit.wikimedia.org/r/269377 (owner: 10KartikMistry) [10:35:16] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 730 [10:37:01] (03PS2) 10Muehlenhoff: Use slapo-unique to ensure uniqueness of gidNumber for groups [puppet] - 10https://gerrit.wikimedia.org/r/269155 [10:39:38] (03CR) 10Giuseppe Lavagetto: [C: 031] "Should we merge this and maybe refine it in case of need?" [puppet] - 10https://gerrit.wikimedia.org/r/267670 (https://phabricator.wikimedia.org/T124761) (owner: 10ArielGlenn) [10:40:16] RECOVERY - check_mysql on db1008 is OK: Uptime: 1796515 Threads: 2 Questions: 10712354 Slow queries: 12110 Opens: 4418 Flush tables: 2 Open tables: 402 Queries per second avg: 5.962 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [10:40:45] (03PS1) 10ArielGlenn: salt master public key file needs newline [puppet] - 10https://gerrit.wikimedia.org/r/269383 [10:42:33] (03CR) 10ArielGlenn: [C: 032] salt master public key file needs newline [puppet] - 10https://gerrit.wikimedia.org/r/269383 (owner: 10ArielGlenn) [10:44:48] RECOVERY - puppet last run on wtp2001 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [10:45:42] !log disabled puppet, redis and memcached on mc1004 for jessie migration [10:45:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:46:10] (03PS1) 10Filippo Giunchedi: swiftrepl: name-based filter for objects [software] - 10https://gerrit.wikimedia.org/r/269387 (https://phabricator.wikimedia.org/T125791) [10:46:24] (03PS1) 10ArielGlenn: get rid of leading blanks in pem end line of salt master key [puppet] - 10https://gerrit.wikimedia.org/r/269388 [10:48:26] (03CR) 10ArielGlenn: [C: 032] get rid of leading blanks in pem end line of salt master key [puppet] - 10https://gerrit.wikimedia.org/r/269388 (owner: 10ArielGlenn) [10:48:51] (03CR) 10Filippo Giunchedi: "ping?" [puppet] - 10https://gerrit.wikimedia.org/r/268080 (owner: 10Filippo Giunchedi) [10:49:13] (03PS1) 10Jcrespo: Final configuration after failover (db1018 as the new s2-master) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269389 (https://phabricator.wikimedia.org/T125215) [10:51:00] (03CR) 10Jcrespo: [C: 04-2] "This should be cherry-picked (not merged) for the swithover, as the final state. Of course, that is assuming no other unrelated change is " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269389 (https://phabricator.wikimedia.org/T125215) (owner: 10Jcrespo) [10:52:51] (03CR) 10Filippo Giunchedi: [C: 04-1] "thanks! I think we should be enabling it gradually though, perhaps with smaller wikis first but certainly not with commons right away" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266609 (https://phabricator.wikimedia.org/T91869) (owner: 10Aaron Schulz) [10:53:23] !log salt minions on labs instances that respond to labcontrol1001 will be coming back up over the next 1/2 hour as puppet runs (salt master key fixes) [10:53:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:56:04] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I have two suggestions:" [puppet] - 10https://gerrit.wikimedia.org/r/268080 (owner: 10Filippo Giunchedi) [10:56:51] <_joe_> apergos: ok with me merging https://gerrit.wikimedia.org/r/267670 ? [10:57:15] <_joe_> after a more thorough review ofc [10:57:26] looking [10:58:04] yes, it works fine, it might print things when not needed, probably ok [10:58:19] <_joe_> I will amend the patch slightly maybe [10:58:27] but don't know what wmf-reimage needs exactly for returns, you can see what values I have in there [10:58:37] <_joe_> I see "print ret/return ret" [10:58:40] yes [10:58:45] <_joe_> shouldn't it just return a value [10:58:46] <_joe_> ? [10:58:48] the print is from running it from salt master [10:58:59] the ret is for running it from peer [10:59:33] <_joe_> ok, I guess we will only run it from the peer [10:59:37] <_joe_> but still [10:59:37] there's likely a way to know what's right [10:59:50] I'd have to dig around but could do so [10:59:52] <_joe_> how would it be called from the peer? [11:00:03] well via wmf-reimage right [11:00:33] I would have to look at a couple other runners and see how they distinguish, if they do [11:00:43] maybe they pass in an extra parameter or something (eww) [11:01:07] <_joe_> apergos: yeah via wmf-reimage [11:01:21] <_joe_> salt-call keys.status fqdn ? [11:01:38] <_joe_> salt-run sorry [11:03:04] (03PS2) 10Filippo Giunchedi: swift: point swift to local datacenter imagescalers [puppet] - 10https://gerrit.wikimedia.org/r/268080 [11:03:56] no, it would be salt-call publish.runner keys.status fqdn [11:03:57] like that [11:04:24] runner says 'send it to the master to run' [11:04:33] otherwise you're trying to run a local salt module [11:05:06] PROBLEM - HHVM rendering on mw1037 is CRITICAL: HTTP CRITICAL - No data received from host [11:05:23] !log installing linux-image-4.4.0 on lvs2005 and rebooting for testing [11:05:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:05:37] PROBLEM - Apache HTTP on mw1037 is CRITICAL: HTTP CRITICAL - No data received from host [11:06:03] (03PS1) 10Jcrespo: Enabling read only mode for s2 before its master failover [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269391 (https://phabricator.wikimedia.org/T125215) [11:06:28] (03CR) 10jenkins-bot: [V: 04-1] Enabling read only mode for s2 before its master failover [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269391 (https://phabricator.wikimedia.org/T125215) (owner: 10Jcrespo) [11:06:56] (03CR) 10Jcrespo: [C: 04-2] "Do not apply until the master failover starts. Is this really the syntax to make s2 read only? I need a review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269391 (https://phabricator.wikimedia.org/T125215) (owner: 10Jcrespo) [11:07:37] PROBLEM - HHVM processes on mw1037 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm [11:07:58] (03CR) 10Filippo Giunchedi: "1. I think it makes sense not to have a default to mark what parameters are really required for the class to work" [puppet] - 10https://gerrit.wikimedia.org/r/268080 (owner: 10Filippo Giunchedi) [11:08:17] (03PS2) 10Jcrespo: Enabling read only mode for s2 before its master failover [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269391 (https://phabricator.wikimedia.org/T125215) [11:09:26] (03PS1) 10ArielGlenn: fix up last line for salt master pub key for production [puppet] - 10https://gerrit.wikimedia.org/r/269392 [11:09:27] too much time "programming" puppet to forget about the final ";" [11:11:09] (03CR) 10ArielGlenn: [C: 032] fix up last line for salt master pub key for production [puppet] - 10https://gerrit.wikimedia.org/r/269392 (owner: 10ArielGlenn) [11:11:55] (03CR) 10Alexandros Kosiaris: [C: 031] Use slapo-unique to ensure uniqueness of gidNumber for groups [puppet] - 10https://gerrit.wikimedia.org/r/269155 (owner: 10Muehlenhoff) [11:12:06] PROBLEM - Host lvs2005 is DOWN: PING CRITICAL - Packet loss = 100% [11:12:57] RECOVERY - Host lvs2005 is UP: PING OK - Packet loss = 0%, RTA = 36.99 ms [11:14:01] _joe_: lvs2005 just came up, pybal.log has tons of backtraces [11:14:23] <_joe_> paravoid: that's from shutting down I guess [11:14:29] <_joe_> let me see anyways [11:14:55] Feb 9 11:09:39 lvs2005 pybal[53226]: [pybal] ERROR: failed: [Failure instance: Traceback: .ValueError'>: No JSON object could be decoded [11:15:11] <_joe_> yes [11:15:25] yes what? :) [11:15:34] <_joe_> that's because the deferred for reading etcd doesn't have shut down cleanly [11:15:44] <_joe_> it's next on my list of bugs [11:15:46] that's *after* it booted up [11:15:54] <_joe_> are you sure? [11:15:58] <_joe_> let me check [11:16:25] <_joe_> Feb 9 11:09:39 lvs2005 pybal[53226]: [pybal] INFO: Exiting... [11:16:42] hm [11:16:44] might be right yeah [11:16:46] <_joe_> Feb 9 11:14:03 lvs2005 pybal[2192]: [pybal] INFO: Created LVS service 'misc_weblb_80' [11:16:49] <_joe_> Feb 9 11:14:03 lvs2005 pybal[2192]: Memory allocation problem [11:16:54] <_joe_> this is a bit more worrying [11:17:29] nah that's the usual one [11:18:05] <_joe_> that backtrace got me almost fainting yesterday [11:18:11] <_joe_> (the one you found too) [11:18:34] heh [11:18:41] ok, I'll do a test switchover from 2002 -> 2005 now [11:19:16] !log stopping pybal on lvs2002 [11:19:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:19:56] ok, I see counters increasing for all IPv6 realservers under ipvsadm -L [11:20:02] so I guess the 4.3 bug is fixed? [11:20:13] also no kernel backtraces and/or crashes :P [11:21:10] yeah, seems sane to me so far [11:22:38] PROBLEM - PyBal backends health check on lvs2002 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 [11:22:53] (03PS1) 10Filippo Giunchedi: cassandra: provision restbase1007 with new hw specs [puppet] - 10https://gerrit.wikimedia.org/r/269394 (https://phabricator.wikimedia.org/T119935) [11:23:59] so I don't think last time we tried the kernel in codfw [11:24:07] PROBLEM - pybal on lvs2002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [11:24:21] so there's a tiny possibility that the bug we were seeing was eqiad-specific [11:24:28] considering the eqiad lvs boxes have vastly different hardware [11:24:48] the bug seemed to be IPVS-specific though, so I don't think that's the case [11:25:57] RECOVERY - pybal on lvs2002 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [11:26:16] RECOVERY - PyBal backends health check on lvs2002 is OK: PYBAL OK - All pools are healthy [11:36:00] !log reverting lvs2005 to 3.19 and rebooting, test is over and was successful [11:36:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:38:21] (03PS2) 10Legoktm: contint: Use slave-scripts/bin/php wrapper script [puppet] - 10https://gerrit.wikimedia.org/r/269370 (https://phabricator.wikimedia.org/T126211) [11:39:07] PROBLEM - Host lvs2005 is DOWN: PING CRITICAL - Packet loss = 100% [11:39:47] RECOVERY - Host lvs2005 is UP: PING OK - Packet loss = 0%, RTA = 37.30 ms [11:41:09] (03PS1) 10Aude: Enable math data type on Wikidata and everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269398 (https://phabricator.wikimedia.org/T124931) [11:43:39] 6operations: 4.4 Linux kernel - https://phabricator.wikimedia.org/T126320#2011106 (10MoritzMuehlenhoff) 3NEW a:3MoritzMuehlenhoff [11:43:47] !log upgrading lvs2001, lvs2002, lvs2003 to kernel 4.4.0 [11:43:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:45:37] I'm going to leave the backup of each pair running the old kernel [11:45:54] if we hit any trouble, we can just stop pybal on the primary for a quick fix [11:46:02] (03PS3) 10Legoktm: contint: Use slave-scripts/bin/php wrapper script [puppet] - 10https://gerrit.wikimedia.org/r/269370 (https://phabricator.wikimedia.org/T126211) [11:46:04] I'll stick around, but jfyi :) [11:46:06] (03PS1) 10Aude: Enable ArticlePlaceholder on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269399 (https://phabricator.wikimedia.org/T125901) [11:48:22] (03PS4) 10Legoktm: contint: Use slave-scripts/bin/php wrapper script [puppet] - 10https://gerrit.wikimedia.org/r/269370 (https://phabricator.wikimedia.org/T126211) [11:49:40] (03PS2) 10Aude: Enable ArticlePlaceholder on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269399 (https://phabricator.wikimedia.org/T125901) [11:52:13] (03CR) 10JanZerebecki: [C: 031] Enable ArticlePlaceholder on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269399 (https://phabricator.wikimedia.org/T125901) (owner: 10Aude) [11:53:48] (03CR) 10Legoktm: "This is currently cherry-picked on the integration puppetmaster." [puppet] - 10https://gerrit.wikimedia.org/r/269370 (https://phabricator.wikimedia.org/T126211) (owner: 10Legoktm) [11:54:14] (03CR) 10JanZerebecki: [C: 031] Enable math data type on Wikidata and everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269398 (https://phabricator.wikimedia.org/T124931) (owner: 10Aude) [11:56:06] PROBLEM - Host lvs2002 is DOWN: PING CRITICAL - Packet loss = 100% [11:58:36] PROBLEM - Host lvs2003 is DOWN: PING CRITICAL - Packet loss = 100% [11:59:16] RECOVERY - Host lvs2003 is UP: PING OK - Packet loss = 0%, RTA = 36.27 ms [12:02:30] The AHS file system mount failed with (No such device) [12:02:33] (03PS2) 10Tim Landscheidt: Tools: Outfactor the configuration for outgoing HBA connections [puppet] - 10https://gerrit.wikimedia.org/r/267832 [12:02:37] Embedded Flash/SD-CARD [12:02:38] uhm.. [12:02:50] (03PS3) 10Tim Landscheidt: shinken: Only regenerate configuration when there are changes [puppet] - 10https://gerrit.wikimedia.org/r/267423 [12:03:29] (03PS4) 10Tim Landscheidt: puppetmaster: Fix git-sync-upstream for unclean rebases [puppet] - 10https://gerrit.wikimedia.org/r/264692 [12:03:52] (03PS4) 10Tim Landscheidt: Tools: Fix argument quoting in jlocal [puppet] - 10https://gerrit.wikimedia.org/r/266935 [12:04:02] (03PS4) 10Tim Landscheidt: Tools: Allow proxymanager to add and remove proxy forward entries [puppet] - 10https://gerrit.wikimedia.org/r/266448 [12:05:42] 6operations, 10ops-codfw: lvs2002 Embedded Flash/SD-CARD iLO errors - https://phabricator.wikimedia.org/T126321#2011132 (10faidon) 3NEW [12:06:17] RECOVERY - Host lvs2002 is UP: PING OK - Packet loss = 0%, RTA = 36.40 ms [12:08:53] moritzm: 2001-2002-2003 are all upgraded [12:09:04] let's leave them for a while (e.g. until tomorrow) to make sure everything is working properly [12:09:19] they're passing traffic now [12:09:39] ulsfo + esams are at 4.3.0, so the delta is smaller, I don't expect much trouble there [12:09:49] eqiad would be the most risky one at this point [12:11:48] ok! [12:14:38] 6operations: Failing wikitech logins - https://phabricator.wikimedia.org/T126322#2011152 (10MoritzMuehlenhoff) 3NEW [12:17:15] 6operations, 6Labs, 10wikitech.wikimedia.org: Failing wikitech logins - https://phabricator.wikimedia.org/T126322#2011152 (10valhallasw) [12:21:18] 7Blocked-on-Operations, 6operations, 10Wikimedia-General-or-Unknown: Invalidate all users sessions - https://phabricator.wikimedia.org/T124440#2011164 (10Aklapper) Wondering if {T126322} is related or whether timing is just a coincidence? [12:21:24] 6operations, 6Labs, 10wikitech.wikimedia.org: Failing wikitech logins - https://phabricator.wikimedia.org/T126322#2011166 (10Aklapper) p:5Triage>3High [12:21:48] PROBLEM - puppet last run on serpens is CRITICAL: CRITICAL: puppet fail [12:23:25] 6operations, 10ops-ulsfo: ulsfo temperature-related exceptions - https://phabricator.wikimedia.org/T119631#2011172 (10faidon) This is still happening: ``` root@lvs4001:~# tail /var/log/kern.log Feb 9 12:19:34 lvs4001 kernel: [6120361.221303] CPU10: Package temperature/speed normal Feb 9 12:19:34 lvs4001 kern... [12:33:21] !log all CI slaves looping to death because of a php loop [12:33:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:36:28] !log Jenkins no more accept new jobs until the slaves are fixed :/ [12:36:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:36:42] <_joe_> hashar: what happened? [12:36:54] 6operations, 6Labs, 10wikitech.wikimedia.org: Failing wikitech logins - https://phabricator.wikimedia.org/T126322#2011187 (10aude) I can login, but maybe something is different with my sessions, tokens, etc. [12:38:18] _joe_: /usr/bin/php points to a slave script which might run `exec php` which resolves to /usr/bin/php ---> death loop of doom [12:38:33] all because puppet git::clone doesn't properly refresh repos apparently :-} [12:38:45] and or we forgot to update the script via a mass git pull [12:39:04] 6operations, 6Labs, 10wikitech.wikimedia.org: Failing wikitech logins - https://phabricator.wikimedia.org/T126322#2011189 (10JanZerebecki) From logstash, the only relevant message my try created seems to be: "2.1.0 OpenStackNovaController::authenticate return code: 401" [12:40:57] (03CR) 10Physikerwelt: [C: 031] Enable math data type on Wikidata and everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269398 (https://phabricator.wikimedia.org/T124931) (owner: 10Aude) [12:42:11] 6operations, 6Labs, 10wikitech.wikimedia.org: Failing wikitech logins - https://phabricator.wikimedia.org/T126322#2011194 (10hashar) I have managed to login properly. I have two factor authentication if that matter. [12:43:21] 6operations, 10Salt, 5Patch-For-Review: setup/deploy sarin(WMF5851) as a salt master in codfw - https://phabricator.wikimedia.org/T125752#2011198 (10ArielGlenn) [12:44:32] (03PS9) 10Giuseppe Lavagetto: Define Production service entries for InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266510 (https://phabricator.wikimedia.org/T114273) [12:45:05] <_joe_> git::clone doesn't refresh the repo if you don't ask it to [12:45:29] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 53.57% of data above the critical threshold [5000000.0] [12:47:55] !log Updated faulty script that caused 'php' too loop infinitely. Jenkins back up. [12:47:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:48:21] _joe_: I think we have it. But most probably we should salt git pull the repo [12:48:52] (03PS10) 10Giuseppe Lavagetto: Define Production service entries for InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266510 (https://phabricator.wikimedia.org/T114273) [12:51:56] (03PS4) 10ArielGlenn: sarin as codfw redundant salt master [puppet] - 10https://gerrit.wikimedia.org/r/269209 [12:53:02] !log reboot seaborgium to apply memory increase of 2G [12:53:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:56:16] RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 50.00% above the threshold [1000000.0] [12:58:56] PROBLEM - puppet last run on labservices1001 is CRITICAL: CRITICAL: Puppet has 1 failures [13:00:33] (03CR) 10Giuseppe Lavagetto: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266510 (https://phabricator.wikimedia.org/T114273) (owner: 10Giuseppe Lavagetto) [13:01:58] !log Jenkins disabled again :( [13:02:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:02:10] <_joe_> hashar: oh, sigh [13:02:24] <_joe_> ok then, gonna do something else in the meanwhile [13:02:37] RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:06:51] ah that would be why I'm waiting forever [13:06:59] time to eat "breakfast" then [13:07:06] no more excuses [13:07:31] !log installing linux 4.4.0 on lvs1001 [13:07:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:10:46] PROBLEM - Host lvs1001 is DOWN: CRITICAL - Host Unreachable (208.80.154.55) [13:11:27] RECOVERY - Host lvs1001 is UP: PING OK - Packet loss = 0%, RTA = 0.93 ms [13:11:55] !log reboot serpens to apply memory increase of 2G [13:11:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:12:03] 8752 ? S 0:00 | \_ /bin/bash -eu /srv/deployment/integration/slave-scripts/bin/mw-install-mysql.sh [13:12:03] 8767 ? S 0:00 | \_ /bin/bash /usr/bin/php maintenance/install.php --confpath /mnt/jenkins-workspace/workspace/mediawiki-core-qunit/src --dbtype=mysql --dbserver=127.0.0.1:3306 --dbuser=jenkins_u2 --dbpass=pw_jenkins_u2 --dbname=jenkins_u2_mw --pass testpass TestWiki WikiAdmin [13:12:05] 8768 ? S 0:00 | \_ /bin/bash /usr/bin/php maintenance/install.php --confpath /mnt/jenkins-workspace/workspace/mediawiki-core-qunit/src --dbtype=mysql --dbserver=127.0.0.1:3306 --dbuser=jenkins_u2 --dbpass=pw_jenkins_u2 --dbname=jenkins_u2_mw --pass testpass TestWiki WikiAdmin [13:12:05] 8769 ? S 0:00 | \_ /bin/bash /usr/bin/php maintenance/install.php --confpath /mnt/jenkins-workspace/workspace/mediawiki-core-qunit/src --dbtype=mysql --dbserver=127.0.0.1:3306 --dbuser=jenkins_u2 --dbpass=pw_jenkins_u2 --dbname=jenkins_u2_mw --pass testpass TestWiki WikiAdmin [13:12:07] argh wrong paste [13:13:11] 6operations, 10Salt, 5Patch-For-Review: setup/deploy sarin(WMF5851) as a salt master in codfw - https://phabricator.wikimedia.org/T125752#2011264 (10ArielGlenn) [13:13:46] PROBLEM - Host serpens is DOWN: PING CRITICAL - Packet loss = 100% [13:14:17] RECOVERY - Host serpens is UP: PING OK - Packet loss = 0%, RTA = 37.38 ms [13:15:49] (03CR) 10Hashar: [C: 04-1] "removed from puppet master. That is a death loop of doom :(" [puppet] - 10https://gerrit.wikimedia.org/r/269370 (https://phabricator.wikimedia.org/T126211) (owner: 10Legoktm) [13:15:52] !log upgrading lvs1001/lvs1007/lvs1002/lvs1008/lvs1003/lvs1009 to 4.4.0 [13:15:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:15:57] RECOVERY - puppet last run on serpens is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [13:18:03] (03CR) 10Giuseppe Lavagetto: "some minor comments, but I'm inclined on fixing those and merging this." (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/267670 (https://phabricator.wikimedia.org/T124761) (owner: 10ArielGlenn) [13:18:35] <_joe_> apergos: ^^ I'm going to lunch now, but if you're not around after I'm back, I'm going to merge this [13:20:06] PROBLEM - Host lvs1002 is DOWN: CRITICAL - Host Unreachable (208.80.154.56) [13:20:37] RECOVERY - Host lvs1002 is UP: PING OK - Packet loss = 0%, RTA = 1.09 ms [13:22:56] PROBLEM - Host lvs1007 is DOWN: PING CRITICAL - Packet loss = 100% [13:24:17] RECOVERY - Host lvs1007 is UP: PING OK - Packet loss = 0%, RTA = 0.64 ms [13:27:07] PROBLEM - Host lvs1003 is DOWN: CRITICAL - Host Unreachable (208.80.154.57) [13:27:46] RECOVERY - Host lvs1003 is UP: PING OK - Packet loss = 0%, RTA = 1.38 ms [13:27:56] PROBLEM - Host lvs1008 is DOWN: PING CRITICAL - Packet loss = 100% [13:29:06] _joe_: here now so lemme look [13:29:11] enjoy your lunch [13:29:16] RECOVERY - Host lvs1008 is UP: PING OK - Packet loss = 0%, RTA = 10.64 ms [13:31:48] PROBLEM - Host lvs1009 is DOWN: PING CRITICAL - Packet loss = 100% [13:32:10] (03CR) 10jenkins-bot: [V: 04-1] Define Production service entries for InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266510 (https://phabricator.wikimedia.org/T114273) (owner: 10Giuseppe Lavagetto) [13:32:17] RECOVERY - Host lvs1009 is UP: PING OK - Packet loss = 0%, RTA = 1.76 ms [13:32:46] (03CR) 10ArielGlenn: "I don't think those constants exist in our version of salt, I looked around for something like that." [puppet] - 10https://gerrit.wikimedia.org/r/267670 (https://phabricator.wikimedia.org/T124761) (owner: 10ArielGlenn) [13:33:27] PROBLEM - PyBal backends health check on lvs1008 is CRITICAL: PYBAL CRITICAL - ocg_8000 - Could not depool server ocg1002.eqiad.wmnet because of too many down! [13:41:07] PROBLEM - Disk space on kafka1012 is CRITICAL: DISK CRITICAL - free space: /var/spool/kafka/b 73705 MB (3% inode=99%): /var/spool/kafka/f 127290 MB (6% inode=99%) [13:41:44] 6operations, 10ops-ulsfo: ulsfo temperature-related exceptions - https://phabricator.wikimedia.org/T119631#2011327 (10BBlack) See also/merge: T125205 [13:41:47] PROBLEM - puppet last run on cp3033 is CRITICAL: CRITICAL: puppet fail [13:44:18] elukey: ^^^ [13:44:26] I was writing :) [13:44:37] moritzm, bblack: eqiad + codfw primary LVSes all run 4.4.0 now [13:45:03] ulsfo + esams primary/backup run 4.3.0, will upgrade the primaries next (but I have an interview in 15mins) [13:45:18] we are aware of the issue, the last time the node was down two partitions got a huge amount of "old" messages that will get deleted only on the 12th/13th (after a week of retention) [13:45:34] so I was waiting for ottomata to decide how to truncate/delete the old ones [13:49:01] paravoid: should we start experimenting with it on cp* too? I have 40 machines left to reboot for 3.19 updates anyways :P [13:49:15] heh [13:49:17] up to you :) [13:49:33] 4.4 is the next LTS, so we might settle on it as our next stable kernel [13:49:54] be nice to know the v6 routing stuff is fixed for them too, and who knows what other improvements [13:50:03] maybe I'll do a canary from each cluster type [13:50:16] we see increased CPU on the LVSes (we've been over this when we tried 4.3.0 if you recall) [13:50:22] yeah [13:50:22] still the case with 4.4.0 [13:50:32] so it might be interesting to see the impact on cp* [13:50:37] maybe it's busy doing something right that was being done cheaply and wrongly before :) [13:50:43] Who has the power to rename gerrit repos? [13:51:36] !log Restarting Jenkins. It can not manage to add slaves [13:51:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:51:45] addshore: sadly we dont rename repositories [13:51:52] addshore: we create a new one and repopulate to the new name using git push [13:51:57] hashar: okay! :) [13:55:02] arhghg [13:55:15] wikibase takes all the phpunit processing ... :D [13:56:02] * hoo hides [13:56:17] that should not be a mad thing, but a good thing [13:56:26] (03PS1) 10Faidon Liambotis: base: apt-listchanges ensure => absent [puppet] - 10https://gerrit.wikimedia.org/r/269410 [13:56:33] moar phpunit tests! [13:56:36] (03CR) 10ArielGlenn: [C: 032] sarin as codfw redundant salt master [puppet] - 10https://gerrit.wikimedia.org/r/269209 (owner: 10ArielGlenn) [13:57:01] jynus: Yes… but many of them are just slow because doing integration testing with MediaWiki is painful [13:58:31] !log shutting down jenkins finally, and restarting it [13:58:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:08:57] RECOVERY - puppet last run on cp3033 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [14:11:47] (03PS2) 10BBlack: maps DNS 1/2: define at all DCs [dns] - 10https://gerrit.wikimedia.org/r/268239 (https://phabricator.wikimedia.org/T109162) [14:12:36] (03CR) 10BBlack: [C: 032] maps DNS 1/2: define at all DCs [dns] - 10https://gerrit.wikimedia.org/r/268239 (https://phabricator.wikimedia.org/T109162) (owner: 10BBlack) [14:14:23] (03PS3) 10BBlack: cache_maps: define tier-2 backending [puppet] - 10https://gerrit.wikimedia.org/r/268233 (https://phabricator.wikimedia.org/T109162) [14:15:03] (03CR) 10BBlack: [C: 032 V: 032] cache_maps: define tier-2 backending [puppet] - 10https://gerrit.wikimedia.org/r/268233 (https://phabricator.wikimedia.org/T109162) (owner: 10BBlack) [14:15:27] (03PS3) 10BBlack: cache_maps: define global service IPs [puppet] - 10https://gerrit.wikimedia.org/r/268234 (https://phabricator.wikimedia.org/T109162) [14:15:34] (03CR) 10BBlack: [C: 032 V: 032] cache_maps: define global service IPs [puppet] - 10https://gerrit.wikimedia.org/r/268234 (https://phabricator.wikimedia.org/T109162) (owner: 10BBlack) [14:15:47] (03PS3) 10BBlack: cache_maps: add all sites to ganglia [puppet] - 10https://gerrit.wikimedia.org/r/268235 (https://phabricator.wikimedia.org/T109162) [14:15:54] (03CR) 10BBlack: [C: 032 V: 032] cache_maps: add all sites to ganglia [puppet] - 10https://gerrit.wikimedia.org/r/268235 (https://phabricator.wikimedia.org/T109162) (owner: 10BBlack) [14:17:10] hashar: I guess I should request the new repos by the normal route? [14:17:16] (03PS3) 10BBlack: cache_maps: re-role old mobile servers [puppet] - 10https://gerrit.wikimedia.org/r/268236 (https://phabricator.wikimedia.org/T109162) [14:17:18] addshore: yeah [14:17:27] cool! [14:23:38] 6operations, 10DBA, 5Patch-For-Review: Prepare db1018 and s2-slaves for s2 master failover - https://phabricator.wikimedia.org/T125215#2011396 (10aude) I suggest we enable a (non-intrusive) central notice banner on the affected wikis, such as https://meta.wikimedia.org/wiki/Special:CentralNoticeBanners/edit/... [14:24:20] 6operations, 6Labs, 10wikitech.wikimedia.org: Failing wikitech logins - https://phabricator.wikimedia.org/T126322#2011398 (10JanZerebecki) I can still log in to https://horizon.wikimedia.org/ . [14:26:08] 6operations, 10DBA, 5Patch-For-Review: Prepare db1018 and s2-slaves for s2 master failover - https://phabricator.wikimedia.org/T125215#2011400 (10aude) and https://meta.wikimedia.org/wiki/Special:CentralNoticeLogs lists some of the staff members that handle banners, though any meta admin can do this. (if we... [14:27:08] 6operations, 10DBA, 5Patch-For-Review: Prepare db1018 and s2-slaves for s2 master failover - https://phabricator.wikimedia.org/T125215#2011401 (10jcrespo) >>! In T125215#2011396, @aude wrote: > I suggest we enable a (non-intrusive) central notice banner on the affected wikis, such as https://meta.wikimedia.o... [14:32:12] 6operations, 6Labs, 10wikitech.wikimedia.org: Rename specific account in LDAP, Wikitech, Gerrit and Phabricator - https://phabricator.wikimedia.org/T85913#2011403 (10Krenair) @demon: Ping. [14:33:24] (03PS2) 10Filippo Giunchedi: grafana: add dashboard import tool [puppet] - 10https://gerrit.wikimedia.org/r/268082 (https://phabricator.wikimedia.org/T125644) [14:33:26] (03PS2) 10Filippo Giunchedi: grafana: import varnish-http-errors [puppet] - 10https://gerrit.wikimedia.org/r/268085 (https://phabricator.wikimedia.org/T125644) [14:35:05] 6operations, 10Traffic, 7Pybal: pybal fails to detect dead servers under production lb IPs for port 80 - https://phabricator.wikimedia.org/T113151#2011406 (10Joe) 5Open>3Resolved [14:35:18] 6operations, 10Traffic, 5Patch-For-Review, 7Pybal: pybal etcd coroutine crashed - https://phabricator.wikimedia.org/T125397#2011407 (10Joe) 5Open>3Resolved [14:35:38] 6operations: Conftool and etcd should represent boolean values as booleans, not 'yes' / 'no' - https://phabricator.wikimedia.org/T106738#2011409 (10Joe) 5Open>3declined [14:36:49] 6operations: Conftool and etcd should represent boolean values as booleans, not 'yes' / 'no' - https://phabricator.wikimedia.org/T106738#2011420 (10Joe) There is a valid use of multi-value pooling logic in terms of our own operations, and the pybal etcd driver supports yes/no/inactive now, so I don't think we sh... [14:37:02] (03PS2) 10Faidon Liambotis: base: apt-listchanges ensure => absent [puppet] - 10https://gerrit.wikimedia.org/r/269410 [14:37:14] (03CR) 10Faidon Liambotis: [C: 032] base: apt-listchanges ensure => absent [puppet] - 10https://gerrit.wikimedia.org/r/269410 (owner: 10Faidon Liambotis) [14:37:19] 6operations, 10Traffic, 7Monitoring, 5Patch-For-Review, 7Pybal: Implement pybal pool state monitoring and alerting via icinga - https://phabricator.wikimedia.org/T102394#2011421 (10Joe) 5Open>3Resolved [14:37:30] 7Blocked-on-Operations, 6operations, 10Wikimedia-General-or-Unknown: Invalidate all users sessions - https://phabricator.wikimedia.org/T124440#2011422 (10Krenair) >>! In T124440#2011164, @Aklapper wrote: > Wondering if {T126322} is related or whether timing is just a coincidence? `OpenStackNovaController::a... [14:40:46] 6operations, 6Labs, 10wikitech.wikimedia.org: Failing wikitech logins - https://phabricator.wikimedia.org/T126322#2011429 (10Krenair) WFM. Interestingly I was getting that same OSNC auth 401 error when trying to log into labtestwikitech recently. [14:40:59] (03CR) 10Filippo Giunchedi: [C: 031] "what happens is that the grafana url changes, /dashboard/db/ vs /dashboard/file/ and changes from the UI get silently discarde" [puppet] - 10https://gerrit.wikimedia.org/r/268082 (https://phabricator.wikimedia.org/T125644) (owner: 10Filippo Giunchedi) [14:41:20] <_joe_> !log setting mw1026-1050 as inactive in the appservers pool (T126242) [14:41:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:43:28] (03PS3) 10Filippo Giunchedi: grafana: add dashboard import tool [puppet] - 10https://gerrit.wikimedia.org/r/268082 (https://phabricator.wikimedia.org/T125644) [14:43:30] (03PS3) 10Filippo Giunchedi: grafana: import varnish-http-errors [puppet] - 10https://gerrit.wikimedia.org/r/268085 (https://phabricator.wikimedia.org/T125644) [14:44:04] 6operations, 6Labs, 10wikitech.wikimedia.org: Failing wikitech logins - https://phabricator.wikimedia.org/T126322#2011431 (10jcrespo) I had the same problem, changing the LDAP password though wikitech fixed it for me. [14:45:07] !log resuming cpNNNN rolling kernel reboots [14:45:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:46:47] it seems openstack is rejecting some wikitech logins, see https://phabricator.wikimedia.org/T126322 . i remember something like this happened beefore. like keystone needing some kicking. can someone help? [14:46:48] 6operations, 10DBA: Reimage db2012 - https://phabricator.wikimedia.org/T126209#2011434 (10jcrespo) p:5Triage>3Normal [14:46:58] !log re-enabled puppet on mc1004.eqiad [14:47:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:47:12] andrewbogott, YuviPanda: ^^? [14:49:43] jzerebecki: did that help? [14:50:24] 6operations, 5Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Reduce the number of appservers we're using in eqiad - https://phabricator.wikimedia.org/T126242#2011435 (10Joe) I depooled mw1025-1050 for now setting all of them to 'inactive'. I'll wait tomorrow to merge the patches to make th... [14:50:26] (03PS2) 10Andrew Bogott: Fix ldap_user_name_attribute in keystone config again [puppet] - 10https://gerrit.wikimedia.org/r/269363 [14:50:40] andrewbogott: nope. what did you do, restart keystone? [14:50:46] yes [14:50:57] PROBLEM - very high load average likely xfs on ms-be1004 is CRITICAL: CRITICAL - load average: 160.16, 117.95, 66.81 [14:51:02] 7Blocked-on-Operations, 6operations, 10Wikimedia-General-or-Unknown: Invalidate all users sessions - https://phabricator.wikimedia.org/T124440#2011438 (10Anomie) >>! In T124440#2010709, @Tgr wrote: > * apart from confusing users, possible fallout is that bots which do not implement `assert=user` will continu... [14:51:19] !log Cutting branches 1.27.0-wmf.13 [14:51:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:51:38] 6operations, 6Labs, 10wikitech.wikimedia.org: Failing wikitech logins - https://phabricator.wikimedia.org/T126322#2011439 (10JanZerebecki) Restarting keystone didn't help. [14:52:01] (03CR) 10Andrew Bogott: [C: 032] Fix ldap_user_name_attribute in keystone config again [puppet] - 10https://gerrit.wikimedia.org/r/269363 (owner: 10Andrew Bogott) [14:53:25] !log reboot ms-be1004, xfs hosed [14:53:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:55:07] PROBLEM - Host cp3043 is DOWN: PING CRITICAL - Packet loss = 100% [14:56:47] <_joe_> bblack: ^^ that you? [14:56:51] <_joe_> oh yes [14:56:53] _joe_: yes [14:56:57] <_joe_> sorry kinda lost your log [14:56:57] PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3043_v4, cp3043_v6 [14:57:06] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3043_v4, cp3043_v6 [14:57:06] PROBLEM - IPsec on cp1051 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3043_v4, cp3043_v6 [14:57:07] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3043_v4, cp3043_v6 [14:57:07] PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3043_v4, cp3043_v6 [14:57:17] PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 140 not-conn: cp3043_v4, cp3043_v6 [14:57:17] PROBLEM - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 140 not-conn: cp3043_v4, cp3043_v6 [14:57:17] PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3043_v4, cp3043_v6 [14:57:19] I have a "system" that makes it silent and simple mostly, but cp30[34]x have issues rebooting, so almost every one of them ends up alerting :/ [14:57:27] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3043_v4, cp3043_v6 [14:57:39] PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3043_v4, cp3043_v6 [14:57:39] PROBLEM - IPsec on cp1061 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3043_v4, cp3043_v6 [14:57:47] PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3043_v4, cp3043_v6 [14:57:57] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3043_v4, cp3043_v6 [14:58:07] PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3043_v4, cp3043_v6 [14:58:08] PROBLEM - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 138 not-conn: cp3043_v4, cp3043_v6, cp4016_v4, cp4016_v6 [14:58:15] https://phabricator.wikimedia.org/T126062 for cp30[34]x issues [14:58:17] PROBLEM - IPsec on kafka1018 is CRITICAL: Strongswan CRITICAL - ok: 138 not-conn: cp3043_v4, cp3043_v6, cp4016_v4, cp4016_v6 [14:58:37] PROBLEM - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 138 not-conn: cp3043_v4, cp3043_v6, cp4016_v4, cp4016_v6 [14:58:37] PROBLEM - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 138 not-conn: cp3043_v4, cp3043_v6, cp4016_v4, cp4016_v6 [14:58:38] PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3043_v4, cp3043_v6 [14:59:31] !log upgrading lvs3001/3002 to linux 4.4.0 [14:59:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:01:07] RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 60 ESP OK [15:01:09] jzerebecki: how about now? [15:01:17] 6operations, 10ops-esams, 10Traffic: cp30[34]x hw/firmware/BMC issues - https://phabricator.wikimedia.org/T126062#2011443 (10BBlack) Add cp3043 to the list of nodes that needed ipmi_si blacklist [15:01:18] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 60 ESP OK [15:01:30] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 60 ESP OK [15:01:30] RECOVERY - Host cp3043 is UP: PING OK - Packet loss = 0%, RTA = 93.50 ms [15:01:30] RECOVERY - IPsec on cp1061 is OK: Strongswan OK - 60 ESP OK [15:01:49] RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 60 ESP OK [15:01:51] andrewbogott: works [15:01:58] RECOVERY - IPsec on cp1051 is OK: Strongswan OK - 60 ESP OK [15:01:58] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 60 ESP OK [15:01:59] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 60 ESP OK [15:02:00] RECOVERY - IPsec on kafka1020 is OK: Strongswan OK - 142 ESP OK [15:02:00] RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 60 ESP OK [15:02:03] hm, well I just found an interesting bug :( [15:02:09] RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 60 ESP OK [15:02:09] RECOVERY - IPsec on kafka1022 is OK: Strongswan OK - 142 ESP OK [15:02:09] RECOVERY - IPsec on kafka1014 is OK: Strongswan OK - 142 ESP OK [15:02:13] andrewbogott: what did you do to fix it? [15:02:18] PROBLEM - NTP on cp3043 is CRITICAL: NTP CRITICAL: Offset unknown [15:02:40] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 60 ESP OK [15:02:48] RECOVERY - IPsec on kafka1018 is OK: Strongswan OK - 142 ESP OK [15:02:49] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 60 ESP OK [15:02:50] PROBLEM - Host lvs3001 is DOWN: PING CRITICAL - Packet loss = 100% [15:02:59] RECOVERY - IPsec on kafka1012 is OK: Strongswan OK - 142 ESP OK [15:03:18] RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 60 ESP OK [15:03:26] (03CR) 10Giuseppe Lavagetto: [C: 031] "I was worried by the change in permission to the /var/lib/puppet directory, and didn't find any outstanding issues. Nonetheless, I guess a" [puppet] - 10https://gerrit.wikimedia.org/r/268684 (owner: 10Ema) [15:04:06] (03PS2) 10BBlack: SPDY support toggle, off for cp1008 canary [puppet] - 10https://gerrit.wikimedia.org/r/268892 (https://phabricator.wikimedia.org/T125979) [15:04:18] RECOVERY - very high load average likely xfs on ms-be1004 is OK: OK - load average: 13.37, 4.02, 1.40 [15:04:19] RECOVERY - Host lvs3001 is UP: PING OK - Packet loss = 0%, RTA = 85.61 ms [15:04:37] 6operations: setup/deploy oresrdb1001-oresrdb1002 - https://phabricator.wikimedia.org/T125562#2011445 (10Cmjohnson) [15:04:39] 6operations, 10ops-eqiad: Update Label for oresrdb1001 (WMF4577) & relocate and update label for oresrdb1002 (WMF4578) - https://phabricator.wikimedia.org/T125565#2011444 (10Cmjohnson) 5Open>3Resolved [15:04:43] 7Blocked-on-Operations, 6operations, 10Wikimedia-General-or-Unknown: Invalidate all users sessions - https://phabricator.wikimedia.org/T124440#2011447 (10matmarex) >>! In T124440#2010709, @Tgr wrote: > 2) it should happen when the least people are editing ([[ https://grafana.wikimedia.org/dashboard/db/edit-c... [15:05:32] (03PS1) 10Andrew Bogott: Change ldap_user_name_attribute back to 'uid' [puppet] - 10https://gerrit.wikimedia.org/r/269415 (https://phabricator.wikimedia.org/T126322) [15:05:35] jzerebecki: ^ [15:06:50] PROBLEM - NTP on cp4016 is CRITICAL: NTP CRITICAL: Offset unknown [15:07:07] (03PS1) 10KartikMistry: CX: Enable specialcx campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269416 (https://phabricator.wikimedia.org/T125306) [15:07:08] PROBLEM - Host lvs3002 is DOWN: PING CRITICAL - Packet loss = 100% [15:07:22] !log stop cassandra on restbase1007, cpu/mem upgrade and reimage [15:07:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:07:58] thx [15:08:00] RECOVERY - Host lvs3002 is UP: PING OK - Packet loss = 0%, RTA = 86.29 ms [15:09:33] (03PS2) 10Filippo Giunchedi: cassandra: provision restbase1007 with new hw specs [puppet] - 10https://gerrit.wikimedia.org/r/269394 (https://phabricator.wikimedia.org/T119935) [15:09:39] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: provision restbase1007 with new hw specs [puppet] - 10https://gerrit.wikimedia.org/r/269394 (https://phabricator.wikimedia.org/T119935) (owner: 10Filippo Giunchedi) [15:10:00] andrewbogott: that fixes it [15:10:37] moritzm: what’s your shell name in labs? [15:10:39] (given that you applied the change from gerrit locally to silver ATM) [15:11:06] "jmm", so different from my wikitech user "Muehlenhoff" [15:11:33] (03CR) 10Giuseppe Lavagetto: "you are right, I keep looking at the current code and it's remarkably different from what we're running." [puppet] - 10https://gerrit.wikimedia.org/r/267670 (https://phabricator.wikimedia.org/T124761) (owner: 10ArielGlenn) [15:12:14] (03PS1) 10Muehlenhoff: Update to 3.19.8-ckt13 [debs/linux] - 10https://gerrit.wikimedia.org/r/269417 [15:12:51] (03PS10) 10ArielGlenn: new salt runner to accept/delete/check status of salt key [puppet] - 10https://gerrit.wikimedia.org/r/267670 (https://phabricator.wikimedia.org/T124761) [15:14:57] moritzm: ok. I need to investigate more, thanks for reporting. [15:15:18] (03CR) 10ArielGlenn: [C: 032] new salt runner to accept/delete/check status of salt key [puppet] - 10https://gerrit.wikimedia.org/r/267670 (https://phabricator.wikimedia.org/T124761) (owner: 10ArielGlenn) [15:15:31] !log upgrading lvs4001/4002 to linux 4.4.0 [15:15:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:15:38] PROBLEM - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 140 not-conn: cp4017_v4, cp4017_v6 [15:15:42] (03PS2) 10Andrew Bogott: Change ldap_user_name_attribute back to 'uid' [puppet] - 10https://gerrit.wikimedia.org/r/269415 (https://phabricator.wikimedia.org/T126322) [15:15:59] PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 140 not-conn: cp4017_v4, cp4017_v6 [15:15:59] PROBLEM - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 140 not-conn: cp4017_v4, cp4017_v6 [15:16:49] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:17:18] PROBLEM - Host snapshot1002 is DOWN: PING CRITICAL - Packet loss = 100% [15:17:50] !log snapshot1002 mistakenly taken offline -- booting now [15:17:52] moritzm, jzerebecki, can I experiment on you once more? [15:17:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:18:38] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [15:18:48] PROBLEM - Host cp4017 is DOWN: PING CRITICAL - Packet loss = 100% [15:19:09] RECOVERY - Host snapshot1002 is UP: PING OK - Packet loss = 0%, RTA = 0.66 ms [15:19:24] (03CR) 10Ottomata: "PINNNGGGGG! ok for this to go through?" [puppet] - 10https://gerrit.wikimedia.org/r/256954 (owner: 10DCausse) [15:19:49] RECOVERY - NTP on cp3043 is OK: NTP OK: Offset -0.003277659416 secs [15:20:07] andrewbogott: sure, shall I logout/login? [15:20:12] yes please [15:20:28] PROBLEM - Host lvs4002 is DOWN: PING CRITICAL - Packet loss = 100% [15:20:35] goddammit [15:20:36] still working for me [15:20:43] Gave up waiting for root device. Common problems: [15:21:10] PROBLEM - IPsec on cp1054 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4017_v4, cp4017_v6 [15:21:12] moritzm: ok, I may have found a better fix then. Thank you [15:21:28] PROBLEM - IPsec on cp1065 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4017_v4, cp4017_v6 [15:21:29] RECOVERY - NTP on cp4016 is OK: NTP OK: Offset -0.0001652240753 secs [15:21:35] still going to revert for now though [15:21:50] (03PS1) 10ArielGlenn: permit puppet master (palladium) to run salt key commands on master [puppet] - 10https://gerrit.wikimedia.org/r/269419 [15:21:54] (03CR) 10Andrew Bogott: [C: 032] Change ldap_user_name_attribute back to 'uid' [puppet] - 10https://gerrit.wikimedia.org/r/269415 (https://phabricator.wikimedia.org/T126322) (owner: 10Andrew Bogott) [15:21:58] PROBLEM - IPsec on cp1066 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4017_v4, cp4017_v6 [15:22:00] PROBLEM - IPsec on cp1053 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4017_v4, cp4017_v6 [15:22:10] PROBLEM - IPsec on cp1052 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4017_v4, cp4017_v6, cp4018_v4, cp4018_v6 [15:22:18] (03PS2) 10ArielGlenn: permit puppet master (palladium) to run salt key commands on master [puppet] - 10https://gerrit.wikimedia.org/r/269419 [15:22:19] PROBLEM - IPsec on kafka1018 is CRITICAL: Strongswan CRITICAL - ok: 138 not-conn: cp3045_v4, cp3045_v6, cp4018_v4, cp4018_v6 [15:22:29] RECOVERY - Host cp4017 is UP: PING OK - Packet loss = 0%, RTA = 75.29 ms [15:22:30] 6operations, 10hardware-requests: new labstore hardware for eqiad - https://phabricator.wikimedia.org/T126089#2011488 (10mark) [15:22:39] PROBLEM - IPsec on cp1068 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4018_v4, cp4018_v6 [15:22:39] PROBLEM - IPsec on cp1055 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4018_v4, cp4018_v6 [15:22:40] PROBLEM - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 138 not-conn: cp3045_v4, cp3045_v6, cp4018_v4, cp4018_v6 [15:22:48] PROBLEM - IPsec on cp1067 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4018_v4, cp4018_v6 [15:23:31] 6operations, 6Labs, 10wikitech.wikimedia.org, 5Patch-For-Review: Failing wikitech logins - https://phabricator.wikimedia.org/T126322#2011499 (10Andrew) a:3Andrew This is resolved for the moment, but the settings are wrong... keeping open until I can fix properly. [15:23:36] (03CR) 10ArielGlenn: [C: 032] permit puppet master (palladium) to run salt key commands on master [puppet] - 10https://gerrit.wikimedia.org/r/269419 (owner: 10ArielGlenn) [15:24:33] 6operations, 10ops-esams, 10Traffic: cp30[34]x hw/firmware/BMC issues - https://phabricator.wikimedia.org/T126062#2011504 (10BBlack) and cp3045 ... [15:25:58] RECOVERY - IPsec on kafka1018 is OK: Strongswan OK - 142 ESP OK [15:26:18] RECOVERY - IPsec on cp1068 is OK: Strongswan OK - 58 ESP OK [15:26:18] RECOVERY - IPsec on cp1055 is OK: Strongswan OK - 58 ESP OK [15:26:19] RECOVERY - Host lvs4002 is UP: PING OK - Packet loss = 0%, RTA = 75.81 ms [15:26:19] RECOVERY - IPsec on kafka1012 is OK: Strongswan OK - 142 ESP OK [15:26:19] RECOVERY - IPsec on cp1067 is OK: Strongswan OK - 58 ESP OK [15:26:29] RECOVERY - IPsec on kafka1020 is OK: Strongswan OK - 142 ESP OK [15:26:39] RECOVERY - IPsec on cp1054 is OK: Strongswan OK - 58 ESP OK [15:26:49] RECOVERY - IPsec on cp1065 is OK: Strongswan OK - 58 ESP OK [15:26:50] RECOVERY - IPsec on kafka1022 is OK: Strongswan OK - 142 ESP OK [15:26:50] RECOVERY - IPsec on kafka1014 is OK: Strongswan OK - 142 ESP OK [15:26:57] (03CR) 10DCausse: "damn... sorry, I completely forgot to test :/" [puppet] - 10https://gerrit.wikimedia.org/r/256954 (owner: 10DCausse) [15:26:59] RECOVERY - IPsec on kafka1013 is OK: Strongswan OK - 142 ESP OK [15:27:07] (03PS1) 10Jcrespo: db1012 default installer = jessie [puppet] - 10https://gerrit.wikimedia.org/r/269421 (https://phabricator.wikimedia.org/T126209) [15:27:29] RECOVERY - IPsec on cp1066 is OK: Strongswan OK - 58 ESP OK [15:27:36] 6operations, 10Traffic: Upgrade LVS servers to a 4.3+ kernel - https://phabricator.wikimedia.org/T119515#2011509 (10faidon) 4.4.0 was released and subsequently packaged by @MoritzMuehlenhoff. After installing it on a couple of canary hosts it was determined that it doesn't suffer from 4.3's (nor 4.2's) issues... [15:27:40] RECOVERY - IPsec on cp1053 is OK: Strongswan OK - 58 ESP OK [15:27:41] RECOVERY - IPsec on cp1052 is OK: Strongswan OK - 58 ESP OK [15:27:44] moritzm: ^^^ [15:28:00] moritzm: TL;DR: eqiad/codfw/esams/ulsfo primaries all upgraded to 4.4.0 [15:28:02] (03PS3) 10Muehlenhoff: Use slapo-unique to ensure uniqueness of gidNumber for groups [puppet] - 10https://gerrit.wikimedia.org/r/269155 [15:28:04] (03PS2) 10Jcrespo: db2012 default installer = jessie [puppet] - 10https://gerrit.wikimedia.org/r/269421 (https://phabricator.wikimedia.org/T126209) [15:28:15] paravoid: nice! [15:28:50] 7Puppet, 6operations, 10Salt, 5Patch-For-Review: Make it possible for wmf-reimage to work seamlessly with a non-local salt master - https://phabricator.wikimedia.org/T124761#2011513 (10ArielGlenn) root@palladium:~# salt-call publish.runner keys.status dataset1001.wikimedia.org [INFO ] Publishing runner... [15:29:45] moritzm: and bblack is waiting for 4.4.0 to be available from carbon to upgrade canary cp* boxes as well [15:31:25] <_joe_> apergos: how can I call that runner from palladium? [15:31:36] see the reimage ticket :-P [15:31:44] like, the comment right above ^^ [15:32:00] _joe_: [15:32:51] 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: Maps hardware planning for FY16/17 - https://phabricator.wikimedia.org/T125126#2011528 (10mark) >>! In T125126#2007868, @mark wrote: >>>! In T125126#1995337, @EBernhardson wrote: >> The varnishes, having previously served the entirety of mobile traff... [15:33:37] <_joe_> apergos: it's a bit clunky to call but I can work with it :) [15:33:44] (03PS1) 10Elukey: Add mc1004.eqiad back to the memcached/redis pool Bug: T123711 [puppet] - 10https://gerrit.wikimedia.org/r/269422 (https://phabricator.wikimedia.org/T123711) [15:34:23] we could see how you do it using salt function calls if you prefer that (instead of command line script) [15:34:24] _joe_: [15:34:31] PROBLEM - Host protactinium is DOWN: CRITICAL - Host Unreachable (208.80.154.13) [15:35:13] <_joe_> apergos: no point in doing that [15:35:22] ok, just thought i tmight be "nicer" [15:35:28] paravoid, bblack: the sync to carbon should be done any minute, I'll add it afterwards [15:35:33] <_joe_> is someone rebooting protactinium? [15:35:41] 6operations, 10RESTBase-Cassandra: impact of large sstables on cassandra - https://phabricator.wikimedia.org/T126221#2011540 (10Eevans) >>! In T126221#2010911, @fgiunchedi wrote: > there are a few open questions even in the multi instance case, for example: > # what determines the size of the biggest sstable?... [15:35:56] _joe_ https://phabricator.wikimedia.org/T123798 [15:36:05] ^ it was supposed to be removed from monitoring [15:36:17] according to the task [15:36:20] <_joe_> oh ok, heh [15:36:23] !log disabled puppet on kafka1012, changing temporary kafka retention to purge some extra logs [15:36:24] <_joe_> cmjohnson1: I'll remove it [15:36:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:36:53] PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3046_v4, cp3046_v6 [15:36:55] k elukey i'm watching kafka logs there [15:37:01] PROBLEM - Host cp3046 is DOWN: PING CRITICAL - Packet loss = 100% [15:37:15] proceed with broker restart whenever you are ready [15:37:53] (03CR) 10Santhosh: [C: 031] CX: Enable specialcx campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269416 (https://phabricator.wikimedia.org/T125306) (owner: 10KartikMistry) [15:38:13] (03CR) 10Muehlenhoff: [C: 032 V: 032] Use slapo-unique to ensure uniqueness of gidNumber for groups [puppet] - 10https://gerrit.wikimedia.org/r/269155 (owner: 10Muehlenhoff) [15:38:32] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3046_v4, cp3046_v6 [15:38:42] PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3046_v4, cp3046_v6 [15:38:52] PROBLEM - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 140 not-conn: cp3046_v4, cp3046_v6 [15:38:58] <_joe_> Notice: No entries found for protactinium.eqiad.wmnet in storedconfigs. [15:39:01] <_joe_> uhm [15:39:02] PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3046_v4, cp3046_v6 [15:39:02] PROBLEM - IPsec on cp1061 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3046_v4, cp3046_v6 [15:39:05] 6operations, 10ops-eqiad, 10DBA: Decommission pc1001-1003 - https://phabricator.wikimedia.org/T124962#2011550 (10jcrespo) p:5Triage>3Low I will start the decom, then, later assign it to @Cmjohnson for in-person steps. Low until space is needed soon. [15:39:08] <_joe_> let me look at neon then [15:39:11] 6operations, 10ops-eqiad, 10DBA: Decommission pc1001-1003 - https://phabricator.wikimedia.org/T124962#2011552 (10jcrespo) a:5RobH>3jcrespo [15:39:12] PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3046_v4, cp3046_v6 [15:39:25] (03PS1) 10Faidon Liambotis: lvs: add schedule_icmp ipvs sysctl [puppet] - 10https://gerrit.wikimedia.org/r/269423 [15:39:33] PROBLEM - IPsec on kafka1018 is CRITICAL: Strongswan CRITICAL - ok: 140 not-conn: cp3046_v4, cp3046_v6 [15:39:41] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3046_v4, cp3046_v6 [15:39:42] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3046_v4, cp3046_v6 [15:39:42] PROBLEM - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 140 not-conn: cp3046_v4, cp3046_v6 [15:39:58] (03PS3) 10Jcrespo: db2012 default installer = jessie [puppet] - 10https://gerrit.wikimedia.org/r/269421 (https://phabricator.wikimedia.org/T126209) [15:40:12] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3046_v4, cp3046_v6, cp4014_v4, cp4014_v6 [15:40:18] ottomata: kafka broker restarted [15:40:26] log retention 96 hours (4 days) [15:40:42] PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3046_v4, cp3046_v6, cp4014_v4, cp4014_v6 [15:40:42] PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3046_v4, cp3046_v6, cp4014_v4, cp4014_v6 [15:40:45] !log echo 1 > /proc/sys/net/ipv4/vs/schedule_icmp on lvs3001 [15:40:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:41:12] PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 138 not-conn: cp3046_v4, cp3046_v6, cp4014_v4, cp4014_v6 [15:41:12] PROBLEM - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 138 not-conn: cp3046_v4, cp3046_v6, cp4014_v4, cp4014_v6 [15:41:26] 6operations, 10RESTBase-Cassandra: impact of large sstables on cassandra - https://phabricator.wikimedia.org/T126221#2011557 (10Eevans) a:3Eevans [15:41:35] elenah: k, not muh change yet, i did see it truncating [15:41:42] RECOVERY - Host cp3046 is UP: PING OK - Packet loss = 0%, RTA = 85.98 ms [15:41:43] PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp4014_v4, cp4014_v6 [15:42:39] elukey: let's wait at least 5 minutes [15:42:49] default log.retention.check.interval.ms is 5 mins [15:42:52] PROBLEM - IPsec on cp1051 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp4014_v4, cp4014_v6 [15:42:53] PROBLEM - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 140 not-conn: cp4014_v4, cp4014_v6 [15:43:00] ottomata: yep got it [15:43:52] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 60 ESP OK [15:44:03] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 60 ESP OK [15:44:13] RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 60 ESP OK [15:44:21] RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 60 ESP OK [15:44:22] RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 60 ESP OK [15:44:22] RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 60 ESP OK [15:44:22] RECOVERY - IPsec on kafka1020 is OK: Strongswan OK - 142 ESP OK [15:44:33] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 60 ESP OK [15:44:33] RECOVERY - IPsec on cp1061 is OK: Strongswan OK - 60 ESP OK [15:44:39] that was the last of the suspect cp30[34]x, should be much less noise from cache reboots now [15:44:42] RECOVERY - IPsec on cp1051 is OK: Strongswan OK - 60 ESP OK [15:44:42] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 60 ESP OK [15:44:53] RECOVERY - IPsec on kafka1013 is OK: Strongswan OK - 142 ESP OK [15:44:54] RECOVERY - IPsec on kafka1022 is OK: Strongswan OK - 142 ESP OK [15:44:54] RECOVERY - IPsec on kafka1014 is OK: Strongswan OK - 142 ESP OK [15:45:13] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 60 ESP OK [15:45:21] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 60 ESP OK [15:45:32] RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 60 ESP OK [15:47:16] 6operations, 10Dumps-Generation, 10hardware-requests: determine hardware needs for dumps in eqiad and codfw - https://phabricator.wikimedia.org/T118154#2011586 (10ArielGlenn) [15:47:59] <_joe_> !log re-removed the puppet facts for protactinium [15:48:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:48:16] 6operations, 10Dumps-Generation: Make dumps run via cron on each snapshot host - https://phabricator.wikimedia.org/T107750#2011597 (10ArielGlenn) 5Open>3Resolved Why do I forget to log success here? Anyways the jobs are humming along so I can finally close this. No more screen sessions! [15:48:40] 6operations, 6Labs: RDNS for 10.68.18.65 resolves to two different instances - https://phabricator.wikimedia.org/T115194#2011601 (10Krenair) [15:49:03] RECOVERY - IPsec on kafka1018 is OK: Strongswan OK - 142 ESP OK [15:49:21] RECOVERY - IPsec on kafka1012 is OK: Strongswan OK - 142 ESP OK [15:49:43] 6operations, 6Labs: RDNS for some labs instance IPs resolve to multiple different instances - https://phabricator.wikimedia.org/T115194#2011608 (10Krenair) [15:50:07] 7Blocked-on-Operations, 6operations, 10Wikimedia-General-or-Unknown: Invalidate all users sessions - https://phabricator.wikimedia.org/T124440#2011610 (10Tgr) >>! In T124440#2011164, @Aklapper wrote: > Wondering if {T126322} is related or whether timing is just a coincidence? If something went really bad (e... [15:51:27] 6operations, 6Labs: RDNS for some labs instance IPs resolve to multiple different instances - https://phabricator.wikimedia.org/T115194#2011613 (10Krenair) [15:52:11] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 [15:52:12] 6operations, 10ops-eqiad, 5Patch-For-Review: decomission the netapps in EQIAD: nas1001-a, nas1001-b - https://phabricator.wikimedia.org/T124156#2011615 (10faidon) nas1001-a/b were also connected to cr1/2-eqiad with 10G. I disabled ports xe-5/3/2 on both but left the descriptions. Do not forget to unplug, the... [15:52:56] twentyafterfour, chasemp, apergos et all: iridium has puppet disabled for the past 5 days [15:53:01] (03PS3) 10BBlack: SPDY support toggle, off for cp1008 canary [puppet] - 10https://gerrit.wikimedia.org/r/268892 (https://phabricator.wikimedia.org/T125979) [15:53:02] reason is "DO NO ENABLE AS IT WILL BREAK THINGS CONTACT MUKUNDA" [15:53:03] (03PS2) 10BBlack: disable SPDY for all cache_text [puppet] - 10https://gerrit.wikimedia.org/r/268893 (https://phabricator.wikimedia.org/T125979) [15:53:21] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 206, down: 0, dormant: 0, excluded: 0, unused: 0 [15:53:25] twentyafterfour is going to know whether it's safe [15:53:47] I think [15:54:04] we can probably re-enable it, but I would be responsible for patching crap together if it break again [15:54:10] yes that was the deal, mukunda has to fix things so it won't break [15:54:11] paravoid: [15:54:11] (03CR) 10jenkins-bot: [V: 04-1] SPDY support toggle, off for cp1008 canary [puppet] - 10https://gerrit.wikimedia.org/r/268892 (https://phabricator.wikimedia.org/T125979) (owner: 10BBlack) [15:54:43] yeah chasemp but this was supposed to be a short time fix [15:54:49] alright, let's wait for him [15:54:53] i.e. make sure puppet isn't going to break it again [15:55:02] fine by me [15:55:25] !log uploaded linux 4.4-1~wmf1 (jessie-wikimedia/experimental) to carbon [15:55:26] (03PS4) 10BBlack: SPDY support toggle, off for cp1008 canary [puppet] - 10https://gerrit.wikimedia.org/r/268892 (https://phabricator.wikimedia.org/T125979) [15:55:28] (03PS3) 10BBlack: disable SPDY for all cache_text [puppet] - 10https://gerrit.wikimedia.org/r/268893 (https://phabricator.wikimedia.org/T125979) [15:55:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:56:21] !log "power"cycling alsafi [15:56:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:56:52] RECOVERY - Check size of conntrack table on alsafi is OK: OK: nf_conntrack is 0 % full [15:56:59] I thought akosiaris fixed that issue (with alsafi) [15:57:02] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:57:06] ^bblack [15:57:11] RECOVERY - Disk space on kafka1012 is OK: DISK OK [15:57:24] moritzm: thanks :) [15:57:32] RECOVERY - DPKG on alsafi is OK: All packages OK [15:57:49] note this is based on 4.4._0_, so maybe limit to cp1008 initially :-) [15:57:52] RECOVERY - Disk space on alsafi is OK: DISK OK [15:57:52] RECOVERY - configured eth on alsafi is OK: OK - interfaces up [15:57:52] RECOVERY - RAID on alsafi is OK: OK: no RAID installed [15:58:02] RECOVERY - salt-minion processes on alsafi is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:58:02] RECOVERY - dhclient process on alsafi is OK: PROCS OK: 0 processes with command name dhclient [15:58:22] RECOVERY - SSH on alsafi is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [15:58:32] RECOVERY - puppet last run on alsafi is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [15:58:52] (03CR) 10BBlack: [C: 032] SPDY support toggle, off for cp1008 canary [puppet] - 10https://gerrit.wikimedia.org/r/268892 (https://phabricator.wikimedia.org/T125979) (owner: 10BBlack) [15:59:21] !log puppet re-enabled on kafka1012 [15:59:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:00:05] anomie ostriches thcipriani marktraceur Krenair: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160209T1600). [16:00:05] aude: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [16:00:41] PROBLEM - puppet last run on mw1090 is CRITICAL: CRITICAL: Puppet has 1 failures [16:01:01] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [16:01:03] (03CR) 10Alex Monk: "T126338, T125941" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158409 (owner: 10Andrew Bogott) [16:01:32] * aude wavse [16:01:34] wave* [16:01:46] I can SWAT [16:01:51] paravoid: kafka1012 is good now :) [16:02:07] :) [16:02:26] jouncebot doesn't like me [16:03:47] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269398 (https://phabricator.wikimedia.org/T124931) (owner: 10Aude) [16:04:22] (03Merged) 10jenkins-bot: Enable math data type on Wikidata and everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269398 (https://phabricator.wikimedia.org/T124931) (owner: 10Aude) [16:04:29] moritzm: re: 4.4, are they involved with linux-meta, or should I include other packages? or just "apt-get install linux-image-4.4"? [16:04:48] well 4.4.whatever [16:05:38] so it's just wikidata in swat today? [16:05:53] no, i see tgr there [16:06:45] the bot probably did not approve of adding myself one minute before SWAT started [16:09:05] !log thcipriani@mira Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable math data type on Wikidata and everywhere [[gerrit:269398]] (duration: 02m 31s) [16:09:07] ^ aude check please [16:09:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:09:40] ok [16:09:57] i see the property in Special:NewProperty [16:10:08] so i think is ok [16:10:46] got an error on mw1037.eqiad.wmnet looks like "rsync: failed to set times on "/srv/mediawiki/.": Read-only file system (30)" [16:12:30] :( [16:12:47] (03PS1) 10Andrew Bogott: Tell keystone to use the actual username as the username field. [puppet] - 10https://gerrit.wikimedia.org/r/269426 [16:13:32] hmm mw1037 doesn't let me ssh in either, I see the banner, then I get connection closed exit status -1 [16:14:47] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269399 (https://phabricator.wikimedia.org/T125901) (owner: 10Aude) [16:14:49] I haven't updated linux-meta yet, that's still TBD, on the lvs* simply linux-image-4.4.0-1-amd64 was installed [16:14:59] 6operations, 10ops-esams, 10Traffic: cp30[34]x hw/firmware/BMC issues - https://phabricator.wikimedia.org/T126062#2011692 (10BBlack) So for the record, the total list of hosts that are now running with ipmi_si blacklisted are: cp3032, cp3039, cp3043, and cp3045 [16:15:08] note there's also an updated firmware-nonfree package, though [16:15:15] (03Merged) 10jenkins-bot: Enable ArticlePlaceholder on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269399 (https://phabricator.wikimedia.org/T125901) (owner: 10Aude) [16:15:22] 6operations, 10ops-esams, 10Traffic: cp30[34]x hw/firmware/BMC issues - https://phabricator.wikimedia.org/T126062#2011695 (10BBlack) (if I had to guess, these machines won't correctly reboot/poweroff due to that, but who knows until we try) [16:15:24] IIRC it was the bnx2x driver which required an updated firmware [16:15:39] moritzm: yes [16:15:52] article placeholder should be no-op [16:15:52] moritzm: we have bnx2x on LVSes too [16:15:58] !log mw1037.eqiad.wmnet error during SWAT rsync: failed to set times on "/srv/mediawiki/.": Read-only file system (30) [16:16:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:16:25] yeah, I think Faidon upgraded these along [16:16:26] (i'll be around later during the train to verify things, when the code is deployed) [16:16:31] RECOVERY - NTP on alsafi is OK: NTP OK: Offset 0.0009155273438 secs [16:16:38] 20151018-2~wmf1 is the jessie-wikimedia backport [16:17:11] so what's the set of packages to install? [16:18:47] !log thcipriani@mira Synchronized wmf-config: SWAT: Enable ArticlePlaceholder on test wikis [[gerrit:269399]] (duration: 01m 19s) [16:18:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:18:51] ^ aude sync'd [16:19:22] I guess half my problem is our apt sources are inconsistent on these machines [16:19:27] 7Blocked-on-Operations, 6operations, 10Wikimedia-General-or-Unknown: Invalidate all users sessions - https://phabricator.wikimedia.org/T124440#2011701 (10Tgr) @Billinghurst reports that autologin does not work for him (he needs to log in separately on every wiki). I guess there is not much point in trying t... [16:19:50] thcipriani: thanks [16:20:13] (cp1008 has jessie-backports commented out) [16:20:18] tgr: you saw Siebrand's message looks like? [16:20:29] er code review (since I don't see comments) [16:20:31] then again lvs3001 doesn't have it at all [16:21:00] thcipriani: yes [16:21:47] MatmaRex or I will follow up on that, but the message change is urgent due to the mass session invalidation earlier today [16:22:19] tgr: ack, I've +2 [16:23:53] (03PS4) 10Jcrespo: db2012 default installer = jessie [puppet] - 10https://gerrit.wikimedia.org/r/269421 (https://phabricator.wikimedia.org/T126209) [16:24:46] (03CR) 10Giuseppe Lavagetto: [C: 031] "LGTM, minor nit" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/269422 (https://phabricator.wikimedia.org/T123711) (owner: 10Elukey) [16:25:37] (03CR) 10Jcrespo: [C: 032] db2012 default installer = jessie [puppet] - 10https://gerrit.wikimedia.org/r/269421 (https://phabricator.wikimedia.org/T126209) (owner: 10Jcrespo) [16:26:25] ah! [16:26:26] modules/role/manifests/lvs/balancer.pp: apt::repository { 'wikimedia-experimental': [16:27:21] RECOVERY - puppet last run on mw1090 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:27:44] (03PS2) 10Elukey: Add mc1004.eqiad back to the memcached/redis pool Bug: T123711 [puppet] - 10https://gerrit.wikimedia.org/r/269422 (https://phabricator.wikimedia.org/T123711) [16:29:06] 10Ops-Access-Requests, 6operations, 6Services, 5Patch-For-Review: Requesting restbase-admins access to RESTBase cluster for Petr Pchelko - https://phabricator.wikimedia.org/T126283#2011714 (10mobrovac) Hm, actually, with `restbase-admins` you are not able to deploy, you can: - log in, - read logs - start/s... [16:29:10] (03PS3) 10Elukey: Add mc1004.eqiad back to the memcached/redis pool Bug: T123711 [puppet] - 10https://gerrit.wikimedia.org/r/269422 (https://phabricator.wikimedia.org/T123711) [16:29:17] ---^ _joe_ - last one [16:29:58] 7Blocked-on-Operations, 6operations, 10Wikimedia-General-or-Unknown: Invalidate all users sessions - https://phabricator.wikimedia.org/T124440#2011716 (10Tgr) >>! In T124440#2011447, @matmarex wrote: > The error messages about "loss of session data" are really lame. Could we merge and deploy the patch for {T... [16:30:18] !log thcipriani@mira Started scap: SWAT: Clarify and expand messages mentioning loss of session data [[gerrit:269424]] [16:30:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:34:00] !log reimage db2012 [16:34:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:35:37] PROBLEM - Restbase root url on restbase1007 is CRITICAL: Connection refused [16:36:06] PROBLEM - cassandra-a CQL 10.64.0.230:9042 on restbase1007 is CRITICAL: Connection refused [16:36:19] that's me ^ reimaged [16:36:56] PROBLEM - puppet last run on mw1037 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [16:38:47] thcipriani: gotta run but it's an i18n change so no real need to verify, I think [16:38:54] tgr: kk. also worth mentioning that wmf.13 has been cut already, so you might need to backport the i18n messages to that branch before the train [16:39:06] ack, thanks [16:42:25] 6operations, 10Traffic: Upgrade LVS servers to a 4.3+ kernel - https://phabricator.wikimedia.org/T119515#2011775 (10BBlack) I see we also have a 4.4.0-rt to try as well. It sounds like it might be beneficial on LVS and/or cp, but probably needs separate testing. [16:46:21] 6operations, 10hardware-requests: new labstore hardware for eqiad - https://phabricator.wikimedia.org/T126089#2011782 (10chasemp) Need some guidance here outlining how we can sort out new servers with breaking up the existing shelves. [16:47:20] 6operations, 6Performance-Team, 10Traffic, 5Patch-For-Review: Disable SPDY on cache_text for a week - https://phabricator.wikimedia.org/T125979#2011788 (10BBlack) The cache kernel reboots will be done in a few hours. I figure allow the rest of the day for the perf impact there to settle back to "normal",... [16:48:33] 6operations, 6Performance-Team, 10Traffic, 5Patch-For-Review: Disable SPDY on cache_text for a week - https://phabricator.wikimedia.org/T125979#2011789 (10BBlack) (also note pinkunicorn/cp1008 already has SPDY removed. You can locally hack e.g. en.wikipedia.org DNS to point at 208.80.154.42 to see how the... [16:50:56] PROBLEM - cassandra-a CQL 10.64.16.153:9042 on cerium is CRITICAL: Connection refused [16:51:15] PROBLEM - cassandra-a service on cerium is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [16:53:29] !log rebooting cp1008/pinkunicorn for 4.4 kernel [16:53:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:56:54] (03PS14) 10Elukey: Adding a new email template for Burrow lag alerts. [puppet] - 10https://gerrit.wikimedia.org/r/268682 [16:57:55] !log thcipriani@mira Finished scap: SWAT: Clarify and expand messages mentioning loss of session data [[gerrit:269424]] (duration: 27m 36s) [16:57:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:58:28] ^ tgr i18n messages updated. [17:00:05] jynus moritzm: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160209T1700). Please do the needful. [17:00:32] 6operations, 10DBA, 5Patch-For-Review: Prepare db1018 and s2-slaves for s2 master failover - https://phabricator.wikimedia.org/T125215#2011805 (10greg) >>! In T125215#2011401, @jcrespo wrote: >>>! In T125215#2011396, @aude wrote: >> I suggest we enable a (non-intrusive) central notice banner on the affected... [17:00:43] no patches, but you have time to add more [17:01:27] PROBLEM - restbase endpoints health on xenon is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200): /page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html [17:02:06] PROBLEM - restbase endpoints health on cerium is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200): /page/html/{title} (Get html by title from storage) is CRITICAL: Test Get htm [17:02:07] PROBLEM - cassandra-a service on xenon is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [17:02:46] RECOVERY - cassandra-a service on cerium is OK: OK - cassandra-a is active [17:02:56] PROBLEM - cassandra-a CQL 10.64.0.202:9042 on xenon is CRITICAL: Connection refused [17:03:15] PROBLEM - restbase endpoints health on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:03:45] (03CR) 10Muehlenhoff: [C: 032 V: 032] Update to 3.19.8-ckt13 [debs/linux] - 10https://gerrit.wikimedia.org/r/269417 (owner: 10Muehlenhoff) [17:04:05] RECOVERY - cassandra-a service on xenon is OK: OK - cassandra-a is active [17:04:17] RECOVERY - cassandra-a CQL 10.64.16.153:9042 on cerium is OK: TCP OK - 0.005 second response time on port 9042 [17:04:55] RECOVERY - cassandra-a CQL 10.64.0.202:9042 on xenon is OK: TCP OK - 0.004 second response time on port 9042 [17:05:05] RECOVERY - restbase endpoints health on praseodymium is OK: All endpoints are healthy [17:05:17] RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy [17:05:56] RECOVERY - restbase endpoints health on cerium is OK: All endpoints are healthy [17:07:40] !log start cassandra-a on restbase1007 with replace_address=10.64.0.230 [17:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:11:45] PROBLEM - Host cp4015 is DOWN: PING CRITICAL - Packet loss = 100% [17:12:50] (03PS1) 10Jcrespo: Upgrading phabricator mysql setting for db2012 [puppet] - 10https://gerrit.wikimedia.org/r/269436 (https://phabricator.wikimedia.org/T126209) [17:13:41] 6operations, 6Labs, 10wikitech.wikimedia.org: Rename specific account in LDAP, Wikitech, Gerrit and Phabricator - https://phabricator.wikimedia.org/T85913#2011847 (10demon) @adrianheine find me on IRC (ostriches or ^d), prefer to do it synchronously so we can troubleshoot if it goes wrong. [17:14:07] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp4015_v4, cp4015_v6 [17:14:08] PROBLEM - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 140 not-conn: cp4015_v4, cp4015_v6 [17:14:08] PROBLEM - IPsec on cp1051 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp4015_v4, cp4015_v6 [17:14:26] PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp4015_v4, cp4015_v6 [17:14:30] (03CR) 10Ottomata: [C: 031] Adding a new email template for Burrow lag alerts. [puppet] - 10https://gerrit.wikimedia.org/r/268682 (owner: 10Elukey) [17:14:37] PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp4015_v4, cp4015_v6 [17:14:39] (03PS1) 10Giuseppe Lavagetto: puppetmaster: adapt wmf-reimage to use remote salt calls [puppet] - 10https://gerrit.wikimedia.org/r/269437 (https://phabricator.wikimedia.org/T124761) [17:14:45] PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp4015_v4, cp4015_v6 [17:14:45] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp4015_v4, cp4015_v6 [17:14:46] PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp4015_v4, cp4015_v6 [17:14:47] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp4015_v4, cp4015_v6 [17:14:47] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp4015_v4, cp4015_v6 [17:15:16] RECOVERY - Host cp4015 is UP: PING OK - Packet loss = 0%, RTA = 74.95 ms [17:16:06] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 60 ESP OK [17:16:06] RECOVERY - IPsec on kafka1013 is OK: Strongswan OK - 142 ESP OK [17:16:06] RECOVERY - IPsec on cp1051 is OK: Strongswan OK - 60 ESP OK [17:16:16] RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 60 ESP OK [17:16:20] (03CR) 10Elukey: [C: 032] Adding a new email template for Burrow lag alerts. [puppet] - 10https://gerrit.wikimedia.org/r/268682 (owner: 10Elukey) [17:16:35] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 60 ESP OK [17:16:36] RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 60 ESP OK [17:16:36] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 60 ESP OK [17:16:37] RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 60 ESP OK [17:16:37] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 60 ESP OK [17:16:45] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 60 ESP OK [17:17:45] PROBLEM - Host cp1053 is DOWN: PING CRITICAL - Packet loss = 100% [17:19:25] PROBLEM - Host mw1037 is DOWN: PING CRITICAL - Packet loss = 100% [17:19:33] (03CR) 10Jcrespo: [C: 032] "https://puppet-compiler.wmflabs.org/1700/" [puppet] - 10https://gerrit.wikimedia.org/r/269436 (https://phabricator.wikimedia.org/T126209) (owner: 10Jcrespo) [17:19:55] <_joe_> !log powered down mw1037 [17:19:56] <_joe_> meh [17:19:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:20:11] 6operations, 10ops-eqiad: Decommission mw1037 - https://phabricator.wikimedia.org/T126350#2011866 (10Joe) 3NEW [17:20:23] (03PS2) 10Jcrespo: Upgrading phabricator mysql setting for db2012 [puppet] - 10https://gerrit.wikimedia.org/r/269436 (https://phabricator.wikimedia.org/T126209) [17:20:36] PROBLEM - IPsec on cp4018 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: cp1053_v4, cp1053_v6 [17:20:37] PROBLEM - IPsec on cp2010 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: cp1053_v4, cp1053_v6 [17:20:45] PROBLEM - IPsec on cp2023 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: cp1053_v4, cp1053_v6 [17:20:45] PROBLEM - IPsec on cp4010 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: cp1053_v4, cp1053_v6 [17:20:46] PROBLEM - IPsec on cp2019 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: cp1053_v4, cp1053_v6 [17:20:46] PROBLEM - IPsec on cp4009 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: cp1053_v4, cp1053_v6 [17:20:46] PROBLEM - IPsec on cp3013 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: cp1053_v4, cp1053_v6 [17:20:46] PROBLEM - IPsec on cp3006 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: cp1053_v4, cp1053_v6 [17:20:46] PROBLEM - IPsec on cp3005 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: cp1053_v4, cp1053_v6 [17:20:55] PROBLEM - IPsec on cp3007 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: cp1053_v4, cp1053_v6 [17:21:26] RECOVERY - Host cp1053 is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms [17:22:00] !log rebooting cp1008/pinkunicorn for 4.4-rt kernel test [17:22:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:22:26] RECOVERY - IPsec on cp4018 is OK: Strongswan OK - 28 ESP OK [17:22:35] RECOVERY - IPsec on cp2010 is OK: Strongswan OK - 28 ESP OK [17:22:35] RECOVERY - IPsec on cp2023 is OK: Strongswan OK - 28 ESP OK [17:22:36] RECOVERY - IPsec on cp4010 is OK: Strongswan OK - 28 ESP OK [17:22:36] RECOVERY - IPsec on cp2019 is OK: Strongswan OK - 28 ESP OK [17:22:37] RECOVERY - IPsec on cp4009 is OK: Strongswan OK - 28 ESP OK [17:22:37] RECOVERY - IPsec on cp3013 is OK: Strongswan OK - 28 ESP OK [17:22:45] RECOVERY - IPsec on cp3006 is OK: Strongswan OK - 28 ESP OK [17:22:45] RECOVERY - IPsec on cp3005 is OK: Strongswan OK - 28 ESP OK [17:22:45] RECOVERY - IPsec on cp3007 is OK: Strongswan OK - 28 ESP OK [17:22:59] (03PS4) 10Elukey: Add mc1004.eqiad back to the memcached/redis pool Bug: T123711 [puppet] - 10https://gerrit.wikimedia.org/r/269422 (https://phabricator.wikimedia.org/T123711) [17:23:31] (03CR) 10Elukey: [C: 032] Add mc1004.eqiad back to the memcached/redis pool Bug: T123711 [puppet] - 10https://gerrit.wikimedia.org/r/269422 (https://phabricator.wikimedia.org/T123711) (owner: 10Elukey) [17:23:46] !log nodetool-a removenode ec0c5a3d-2648-4933-8434-a8d163b92188 in preparation for restbase1007 bootstrap [17:23:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:26:11] (03CR) 10DCausse: "I'll verify the next run after it's merged" [puppet] - 10https://gerrit.wikimedia.org/r/256954 (owner: 10DCausse) [17:26:30] !log mc1004.eqiad put back into redis/memcached pool [17:26:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:27:42] (03PS1) 10Alexandros Kosiaris: package_builder: Set PATH for cron updates [puppet] - 10https://gerrit.wikimedia.org/r/269441 (https://phabricator.wikimedia.org/T125999) [17:28:07] PROBLEM - puppet last run on ms-be1003 is CRITICAL: CRITICAL: Puppet has 1 failures [17:29:03] (03PS1) 10Giuseppe Lavagetto: mediawiki: decommission mw1037 [puppet] - 10https://gerrit.wikimedia.org/r/269442 (https://phabricator.wikimedia.org/T126350) [17:29:34] 6operations, 10ops-eqiad, 10DBA: Decommission pc1001-1003 - https://phabricator.wikimedia.org/T124962#2011894 (10RobH) Please note that pc1001-1003 & labsdb1001-1003 are the last 6 remaining cisco systems in use. As such, this task is a blocker for T103374. [17:29:53] 6operations, 10ops-eqiad, 10DBA: Decommission pc1001-1003 - https://phabricator.wikimedia.org/T124962#2011897 (10RobH) [17:29:55] 6operations, 10ops-eqiad: What to do with decommissioned ciscos? - https://phabricator.wikimedia.org/T103374#1388150 (10RobH) [17:30:55] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: decommission mw1037 [puppet] - 10https://gerrit.wikimedia.org/r/269442 (https://phabricator.wikimedia.org/T126350) (owner: 10Giuseppe Lavagetto) [17:31:49] 6operations, 10ops-eqiad: decom protactinium (datacenter) - https://phabricator.wikimedia.org/T123798#2011900 (10Cmjohnson) 5Open>3Resolved Wiped, added to spares/Decom list. [17:31:59] 6operations, 10ops-eqiad: What to do with decommissioned ciscos? - https://phabricator.wikimedia.org/T103374#2011902 (10RobH) I'll be adding in the cisco decommission tasks for both eqiad and codfw as blockers to this task. Once they are all complete, we can pull all the paperwork from accounting and try to f... [17:33:43] (03PS1) 10Jcrespo: Add exception to /srv partition for db1043 and db1048 [puppet] - 10https://gerrit.wikimedia.org/r/269443 (https://phabricator.wikimedia.org/T126209) [17:33:55] (03PS2) 10Jcrespo: Add exception to /srv partition for db1043 and db1048 [puppet] - 10https://gerrit.wikimedia.org/r/269443 (https://phabricator.wikimedia.org/T126209) [17:35:22] PROBLEM - puppet last run on cp4006 is CRITICAL: CRITICAL: puppet fail [17:35:26] (03PS1) 10Chad: Stop including RandomRootPage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269444 [17:35:28] (03CR) 10Jcrespo: [C: 032] Add exception to /srv partition for db1043 and db1048 [puppet] - 10https://gerrit.wikimedia.org/r/269443 (https://phabricator.wikimedia.org/T126209) (owner: 10Jcrespo) [17:36:30] PROBLEM - NTP on cp1055 is CRITICAL: NTP CRITICAL: Offset unknown [17:37:46] (03PS2) 10DCausse: Use snappy for mediawiki avro logs [puppet] - 10https://gerrit.wikimedia.org/r/256954 [17:38:30] (03PS2) 10Andrew Bogott: Tell keystone to use the actual username as the username field. [puppet] - 10https://gerrit.wikimedia.org/r/269426 [17:40:00] (03PS2) 10Chad: Stop including RandomRootPage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269444 [17:40:58] (03CR) 10Andrew Bogott: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/269426 (owner: 10Andrew Bogott) [17:42:35] (03PS1) 10Joal: Make MediaWiki camus run in essential queue [puppet] - 10https://gerrit.wikimedia.org/r/269445 (https://phabricator.wikimedia.org/T125967) [17:43:32] PROBLEM - zuul_gearman_service on gallium is CRITICAL: Connection refused [17:44:01] PROBLEM - zuul_service_running on gallium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-server [17:44:25] 6operations, 10RESTBase, 10hardware-requests: normalize eqiad restbase cluster - replace restbase1001-1006 - https://phabricator.wikimedia.org/T125842#2011938 (10RobH) [17:47:50] RECOVERY - zuul_service_running on gallium is OK: PROCS OK: 2 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-server [17:48:34] (03CR) 10Alexandros Kosiaris: [C: 032] package_builder: Set PATH for cron updates [puppet] - 10https://gerrit.wikimedia.org/r/269441 (https://phabricator.wikimedia.org/T125999) (owner: 10Alexandros Kosiaris) [17:48:36] (03CR) 10Andrew Bogott: [C: 032 V: 032] "Overriding Jenkins to get this done before another deploy window starts" [puppet] - 10https://gerrit.wikimedia.org/r/269426 (owner: 10Andrew Bogott) [17:48:40] (03PS2) 10Alexandros Kosiaris: package_builder: Set PATH for cron updates [puppet] - 10https://gerrit.wikimedia.org/r/269441 (https://phabricator.wikimedia.org/T125999) [17:48:44] (03CR) 10Alexandros Kosiaris: [V: 032] package_builder: Set PATH for cron updates [puppet] - 10https://gerrit.wikimedia.org/r/269441 (https://phabricator.wikimedia.org/T125999) (owner: 10Alexandros Kosiaris) [17:49:02] RECOVERY - zuul_gearman_service on gallium is OK: TCP OK - 0.000 second response time on port 4730 [17:51:18] 7Blocked-on-Operations, 6operations, 10Wikimedia-General-or-Unknown: Invalidate all users sessions - https://phabricator.wikimedia.org/T124440#2011962 (10Tgr) ...didn't make it yet into wmf.13 though. [17:51:20] RECOVERY - NTP on cp1055 is OK: NTP OK: Offset -3.814697266e-05 secs [17:52:31] (03PS3) 10Ottomata: Use snappy for mediawiki avro logs [puppet] - 10https://gerrit.wikimedia.org/r/256954 (owner: 10DCausse) [17:52:51] (03CR) 10ArielGlenn: "Do we need to be able to avoid rotating the aes salt master key on key deletes? It just means that after the rotation there will be a del" [puppet] - 10https://gerrit.wikimedia.org/r/269437 (https://phabricator.wikimedia.org/T124761) (owner: 10Giuseppe Lavagetto) [17:53:00] (03CR) 10Ottomata: [C: 032 V: 032] Use snappy for mediawiki avro logs [puppet] - 10https://gerrit.wikimedia.org/r/256954 (owner: 10DCausse) [17:53:25] (03PS1) 10Jcrespo: Disable phabricator db mariadb10-only features until upgraded [puppet] - 10https://gerrit.wikimedia.org/r/269447 (https://phabricator.wikimedia.org/T126352) [17:53:41] !log krenair@mira Synchronized php-1.27.0-wmf.12/extensions/OpenStackManager: https://gerrit.wikimedia.org/r/#/c/269439/ (duration: 03m 15s) [17:53:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:53:49] andrewbogott, ^ [17:54:01] RECOVERY - puppet last run on ms-be1003 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [17:54:02] (03CR) 10jenkins-bot: [V: 04-1] Disable phabricator db mariadb10-only features until upgraded [puppet] - 10https://gerrit.wikimedia.org/r/269447 (https://phabricator.wikimedia.org/T126352) (owner: 10Jcrespo) [17:54:07] it wasted a minute trying to connect to mw1037 [17:54:14] !log ssh: connect to host mw1037.eqiad.wmnet port 22: Connection timed out [17:54:15] ok, let’s see if I can still log in... [17:54:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:54:20] (03PS2) 10Jcrespo: Disable phabricator db mariadb10-only features until upgraded [puppet] - 10https://gerrit.wikimedia.org/r/269447 (https://phabricator.wikimedia.org/T126352) [17:55:20] moritzm: jzerebecki: We now have what I think is a correct fix in place. Can you verify that wikitech logins still work for you? [17:56:10] andrewbogott: still works for me [17:56:28] jzerebecki: and you can still see all the things you could see before? [17:56:40] (03CR) 10Jcrespo: [C: 032] Disable phabricator db mariadb10-only features until upgraded [puppet] - 10https://gerrit.wikimedia.org/r/269447 (https://phabricator.wikimedia.org/T126352) (owner: 10Jcrespo) [17:57:09] andrewbogott: uh do you have an example? [17:57:25] jzerebecki: project management stuff, mostly [17:57:35] if you don’t notice anything being broken, probably nothing is broken :) [17:57:57] nova project and other nova stuff I assume [17:58:17] yeah [18:00:05] yurik gwicke cscott arlolra subbu bearND mdholloway: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160209T1800). [18:00:38] andrewbogott: yea from a quick look I can still see all the nova stuff I [18:00:42] 'm supposed to [18:00:51] that’s great. thanks for checking [18:02:32] rats and I didn't fix up the timeout stuff [18:03:00] RECOVERY - puppet last run on cp4006 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [18:03:22] yw [18:05:44] !log bringing down db1048's mysql for cloning to db2012 [18:05:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:06:20] PROBLEM - NTP on cp1066 is CRITICAL: NTP CRITICAL: Offset unknown [18:09:31] (03PS1) 10ArielGlenn: add timeout option to restart() in git-deploy runner [puppet] - 10https://gerrit.wikimedia.org/r/269450 [18:09:36] fixing it right now. [18:10:21] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [18:11:02] Krenair: yeah, known, _joe_ is working on mw1013 [18:11:04] (03CR) 10ArielGlenn: [C: 032] add timeout option to restart() in git-deploy runner [puppet] - 10https://gerrit.wikimedia.org/r/269450 (owner: 10ArielGlenn) [18:11:22] greg-g, so what about mw1037? [18:11:31] oh [18:12:02] sorry, I meant 1037 [18:12:09] PROBLEM - Host cp1099 is DOWN: PING CRITICAL - Packet loss = 100% [18:12:34] https://tools.wmflabs.org/sal/production?p=0&q=mw1037&d= [18:12:59] RECOVERY - Host cp1099 is UP: PING OK - Packet loss = 0%, RTA = 9.39 ms [18:12:59] PROBLEM - haproxy failover on dbproxy1003 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [18:13:38] paravoid: apergos: I've been working as fast as I can on it. _joe_ veto'd our short term fixes in favor of the puppet provider for scap3 which I've been devoting 100% of my time on ever since then [18:13:44] chasemp: ^ [18:14:08] can we get something in there that will let us enable puppet on the box? [18:15:44] chase and I both submitted patches that would have done that... [18:15:58] (the veto'd patches I mentioned) [18:16:29] maybe even one got merged let me check [18:16:32] the dbproxy is me, but a) it is the backup host, not the main one and b) there are not users, what is why I thought it was not monitored [18:18:42] apergos: this was the biggest pre-requisite for actually fixing things on iridium: https://gerrit.wikimedia.org/r/#/c/262742/ and it should be ready to go [18:20:11] now I just need to schedule a short downtime to symlink /srv/phab/ to /srv/deployment/phabricator/deployment and move /srv/phab/repos to /srv/repos .. then kill most of the phabricator class in puppet and the job is done [18:21:00] RECOVERY - NTP on cp1066 is OK: NTP OK: Offset -4.839897156e-05 secs [18:22:59] 6operations, 10DBA, 5Patch-For-Review: Prepare db1018 and s2-slaves for s2 master failover - https://phabricator.wikimedia.org/T125215#2012042 (10Pcoombe) @greg @jcrespo I can help with this. Is there a page/email you would like the "Read more" link to point at? Or should we just leave that out? [18:23:10] 10Ops-Access-Requests, 6operations, 6Services, 5Patch-For-Review: Requesting restbase-admins access to RESTBase cluster for Petr Pchelko - https://phabricator.wikimedia.org/T126283#2012043 (10Dzahn) or maybe an alternative is to amend the permissions the restbase-admins have with the deploy commands, so th... [18:23:18] I'll write the patch to kill most of the phabricator class [18:23:25] and we can merge that first [18:24:38] 10Ops-Access-Requests, 6operations, 6Services, 5Patch-For-Review: Requesting restbase-roots access to RESTBase cluster for Petr Pchelko - https://phabricator.wikimedia.org/T126283#2012045 (10GWicke) [18:25:01] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:25:56] 10Ops-Access-Requests, 6operations, 6Services, 5Patch-For-Review: Requesting restbase-roots access to RESTBase cluster for Petr Pchelko - https://phabricator.wikimedia.org/T126283#2010042 (10GWicke) Re-titled to ask for `restbase-roots` access, per @mobrovac. [18:32:42] 10Ops-Access-Requests, 6operations, 6Services, 5Patch-For-Review: Requesting restbase-roots access to RESTBase cluster for Petr Pchelko - https://phabricator.wikimedia.org/T126283#2012067 (10mobrovac) >>! In T126283#2012043, @Dzahn wrote: > or maybe an alternative is to amend the permissions the restbase-a... [18:34:01] 7Blocked-on-Operations, 6operations, 10Wikimedia-General-or-Unknown: Invalidate all users sessions - https://phabricator.wikimedia.org/T124440#2012071 (10Amire80) Might be relevant: A user in the Hebrew Wikipedia wrote that he became logged out of his desktop account, but remained logged-in in the mobile app. [18:35:04] (03CR) 1020after4: [C: 031] "I don't know how to be sure it's a noop but I say deploy and find out? ;)" [puppet] - 10https://gerrit.wikimedia.org/r/260937 (owner: 10Dzahn) [18:35:52] (03CR) 1020after4: [C: 031] Add \n to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268468 (owner: 10Dereckson) [18:36:12] twentyafterfour: ok I'm around [18:36:38] when's the best time to break beta [18:37:09] in the evening [18:37:17] ok! [18:37:26] pdt evening, to be clear :P [18:37:38] yes [18:37:40] * twentyafterfour was about to say, who's evening ;) [18:37:41] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.223, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [18:37:41] PROBLEM - cassandra-a service on restbase1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [18:37:45] i want to split those roles [18:37:51] what twentyafterfour just reviewed [18:38:04] I"m not clinic duty but I'm sticking around to get this done [18:38:17] it's gonna be way easier to read than one large role/beta.pp [18:38:24] yeah, if there's not much risk of fallout, you can do it, we do other puppet changes during the day, but if you're especially worried, wait until after swat is done [18:38:39] subbu: are you realy deploying parsoid or was that bot spam just a tease? [18:38:41] evening swat [18:38:53] ok, fair, thank you, i won't do it right now [18:38:56] apergos, no, no parsoid deploy today. [18:38:59] heh [18:39:02] ok thanks [18:39:10] PROBLEM - puppet last run on ms-fe2002 is CRITICAL: CRITICAL: Puppet has 1 failures [18:39:11] i think it is just a standard services deploy line copied over to tues and thu. [18:39:18] for any of those services that need to be deployed. [18:39:19] gotcha [18:39:20] gwicke cscott arlolra subbu bearND mdholloway: anyone deploying any services? if not, i would like to deploy graphoid. We have 20 min left. cc: greg-g [18:39:35] no mobileapps deployment /cc bearloga [18:39:40] oops, cc bearND [18:40:09] yurik, no [18:40:23] thx :) [18:40:50] apergos: what are you up to re deploying? cc yurik [18:41:18] apergos: since you asked re parsoid, just making sure you're OK with a graphoid deploy [18:41:22] 7Blocked-on-Operations, 6operations, 10RESTBase, 10hardware-requests: Expand SSD space in Cassandra cluster - https://phabricator.wikimedia.org/T121575#2012084 (10RobH) [18:41:27] greg-g: I wanted to do some live testing with subbu but I have twentyafterfour's phab fixes to babysit for the evening [18:41:31] so that will likely do it [18:41:43] ahh, kk [18:41:47] yurik: ok, doit [18:41:48] so all's clear? [18:41:49] I just don't know anything about graphoid but ok [18:41:50] cool :) [18:41:58] anything breaks you own it etc :-P [18:42:08] apergos: nah, just making sure you didn't need it to be on hold (graphoid deploy) [18:42:13] nope [18:42:17] word [18:42:26] you cali ppl you! [18:42:34] apergos: I'm around to babysit phabricator, I just need someone to merge patches [18:42:43] I can look at em too [18:42:54] if mutante's also staring at them, so much the better [18:43:12] apergos: I have root on iridium, just don't have +2 on puppet [18:43:26] apergos: ? which one [18:44:32] we'll see in a minute, remember the phab outage? this is patches to fix that up [18:45:05] permanently [18:46:20] (03PS1) 10Chad: Prune a ton of old branches, add wmf.13 branch symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269458 [18:47:22] ok [18:47:31] oh, I should probably push that wmf.13 OSM change [18:47:53] on a related noted, are the phab labs instances still not reachable? [18:48:05] 6operations, 5Patch-For-Review: reinstall eqiad memcache servers with jessie - https://phabricator.wikimedia.org/T123711#2012105 (10elukey) For some reason the last code reviews didn't get into this phab task: de-pool mc1004: https://gerrit.wikimedia.org/r/#/c/269378/ re-pool mc1004: https://gerrit.wikimedia.... [18:48:05] or... maybe that was already done? hm [18:48:25] 6operations, 10Deployment-Systems, 10Salt, 5Patch-For-Review: [Trebuchet] Salt times out on parsoid restarts - https://phabricator.wikimedia.org/T63882#2012106 (10ArielGlenn) https://gerrit.wikimedia.org/r/#/c/269450/ to fix the runner so it has a timeout option. the other two places that need fixes are in... [18:48:34] ahh, wmf.13 wasn't pushed out yet, okay then [18:48:39] mutante: I am still not able to log in to phab-01 or phab-02 [18:48:51] the phab instances in labs were not reachable by me earlier today [18:48:54] from bastion-restricted [18:49:10] phab-03 works [18:49:27] 6operations, 10DBA, 5Patch-For-Review: Prepare db1018 and s2-slaves for s2 master failover - https://phabricator.wikimedia.org/T125215#2012107 (10jcrespo) @Pcoombe There is: https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160209T2300 But we can create a page or email with the following in... [18:49:29] and phab-scap, where I have been testing the puppet provider for scap packages [18:49:32] (03CR) 10Chad: [C: 032] Prune a ton of old branches, add wmf.13 branch symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269458 (owner: 10Chad) [18:49:36] 6operations, 5Patch-For-Review: reinstall eqiad memcache servers with jessie - https://phabricator.wikimedia.org/T123711#2012108 (10elukey) Next steps: 1) Work with Joe on https://phabricator.wikimedia.org/T124761 to get wmf-reimage up and running again. 2) Rollout Jessie to the other nodes following the abov... [18:49:55] phab-scap is ok [18:49:59] (03Merged) 10jenkins-bot: Prune a ton of old branches, add wmf.13 branch symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269458 (owner: 10Chad) [18:50:00] I don't know bout phab-03 [18:50:03] 01 and 02 are fail [18:50:14] (03PS19) 1020after4: Puppet provider for scap3 [puppet] - 10https://gerrit.wikimedia.org/r/262742 (https://phabricator.wikimedia.org/T113072) (owner: 10Alexandros Kosiaris) [18:51:09] https://phabricator.wikimedia.org/T126323 these [18:51:15] (03CR) 1020after4: [C: 031] "can we merge this now? it works for me in testing on deploy.phabricator.eqiad.wmflabs + phab-scap.phabricator.eqiad.wmflabs" [puppet] - 10https://gerrit.wikimedia.org/r/262742 (https://phabricator.wikimedia.org/T113072) (owner: 10Alexandros Kosiaris) [18:52:58] so twentyafterfour this is going to be a slow review for me because I haven't looked at the code before [18:53:36] apergos: would it help if we got thcipriani or marxarelli to look it over again? Most of the code isn't my handiwork [18:54:06] (03PS1) 10Chad: group0 to 1.27.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269463 [18:54:18] or even akosiaris if he's around? [18:54:23] (03PS2) 10Dzahn: admin: add ppchelko to restbase-roots [puppet] - 10https://gerrit.wikimedia.org/r/269369 (https://phabricator.wikimedia.org/T126283) [18:54:31] 6operations, 10Deployment-Systems, 6Release-Engineering-Team, 6Services: `git deploy service restart` asked for sudo password - https://phabricator.wikimedia.org/T126359#2012131 (10mobrovac) [18:55:22] it fails one of the spec tests, but that's because the test was written before I had to remove package_settings... I could use marxarelli's help with fixing that test [18:55:54] yes let's have both of them look at it if they are available [18:55:59] I will read through it carefully as well [18:56:22] twentyafterfour: i can help refactor the tests [18:56:29] RECOVERY - haproxy failover on dbproxy1003 is OK: OK check_failover servers up 2 down 0 [18:57:22] !log deployed graphoid [18:57:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:58:56] (03PS3) 10Dzahn: admin: add ppchelko to restbase-roots [puppet] - 10https://gerrit.wikimedia.org/r/269369 (https://phabricator.wikimedia.org/T126283) [19:00:21] (03PS1) 10ArielGlenn: adapt trebuchet-trigger for timeout to restart function [software/deployment/trebuchet-trigger] - 10https://gerrit.wikimedia.org/r/269465 (https://phabricator.wikimedia.org/T63882) [19:00:48] (03PS2) 10Chad: group0 to 1.27.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269463 [19:01:11] 10Ops-Access-Requests, 6operations, 6Services, 5Patch-For-Review: Requesting restbase-roots access to RESTBase cluster for Petr Pchelko - https://phabricator.wikimedia.org/T126283#2012151 (10Dzahn) Amended the Gerrit change to become "add to roots" as well. [19:01:42] marxarelli: the package_settings feature is gone, replaced with install_options [19:01:43] twentyafterfour: looking at the puppet provider—is install only running the fetch stage? [19:02:07] thcipriani: yes because it seems that fetch does the checkout on the first run [19:02:20] running the promote stage didn't do that [19:02:38] (it actually promoted an empty directory) [19:02:58] twentyafterfour: it should be run without a stage, actually [19:03:02] Should be able to leave the stage off and it'll run all stages. [19:03:04] really? [19:03:17] I thought stage was a required option [19:03:32] twentyafterfour: up until recently, it was [19:04:10] RECOVERY - puppet last run on ms-fe2002 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [19:04:59] (03PS2) 10Dzahn: admin: remove ssh key of jkrauska [puppet] - 10https://gerrit.wikimedia.org/r/269350 (https://phabricator.wikimedia.org/T126260) [19:05:06] (03CR) 10Dzahn: [C: 032] admin: remove ssh key of jkrauska [puppet] - 10https://gerrit.wikimedia.org/r/269350 (https://phabricator.wikimedia.org/T126260) (owner: 10Dzahn) [19:07:08] (03CR) 10Dduvall: [C: 04-1] Puppet provider for scap3 (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/262742 (https://phabricator.wikimedia.org/T113072) (owner: 10Alexandros Kosiaris) [19:07:17] !log demon@mira Started scap: pruning tons of stale branches + sync wmf.13 files for later + testwiki to wmf.13 to build l10n cache [19:07:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:07:33] twentyafterfour: do you want to pair on fixing it up? [19:07:52] !log demon@mira scap failed: CalledProcessError Command '/usr/local/bin/mwscript rebuildLocalisationCache.php --wiki="testwiki" --outdir="/tmp/scap_l10n_2315818744" --threads=10 --lang en --quiet' returned non-zero exit status 255 (duration: 00m 34s) [19:07:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:08:25] (03CR) 10Chad: [C: 032] Stop including RandomRootPage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269444 (owner: 10Chad) [19:08:33] 6operations, 10DBA, 5Patch-For-Review: Prepare db1018 and s2-slaves for s2 master failover - https://phabricator.wikimedia.org/T125215#2012161 (10greg) My simple/simplifying edits: > Between 23:00 and 23:59 UTC, February 9th 2016 there is a scheduled maintenance window that will affect some of the wikis hos... [19:08:47] twentyafterfour: there is /srv/deployment/scap/scap/bin/deploy-local but not /usr/bin/deploy-local it seems [19:08:54] (03Merged) 10jenkins-bot: Stop including RandomRootPage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269444 (owner: 10Chad) [19:09:02] Whoops, shoulda merged first. [19:09:05] ah I see others are looking at it. yay [19:09:36] !log demon@mira Started scap: pruning tons of stale branches + sync wmf.13 files for later + testwiki to wmf.13 to build l10n cache (try 2) [19:09:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:10:05] <_joe_> ostriches: what is wrong atm? [19:10:28] I forgot to merge my "stop loading this deprecated extension patch" prior to building wmf.13 l10n cache. [19:10:35] It rightly said "where's the damn extension" [19:11:04] !log cp4006 (upload ulsfo) rebooting -> kernel 4.4 canary [19:11:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:11:14] 6operations, 10ops-eqiad: dbstore1001 management interface has saturated the number of available ssh connections - https://phabricator.wikimedia.org/T126227#2012168 (10Cmjohnson) @jcrespo this will require me to shut down for a minute and remove power cables and drain flea power. [19:11:17] marxarelli: sure [19:11:42] apergos: are you referring to iridium? it needs the scap package from apt [19:11:53] to install /usr/bin/deploy-local and friends [19:12:18] I was on mira because deployment-host and no brain [19:12:22] 6operations, 10ops-eqiad: dbstore1002.mgmt.eqiad.wmnet: "No more sessions are available for this type of connection!" - https://phabricator.wikimedia.org/T119488#2012171 (10Cmjohnson) @jcrespo this will also require me to power off and remove power cables and drain flea power. [19:12:54] 6operations, 10ops-eqiad: Decommission cp1037-1040 - https://phabricator.wikimedia.org/T83553#2012174 (10Cmjohnson) p:5Normal>3Low [19:13:01] so that's local to be run on the target host then [19:13:02] ic [19:13:21] yeah [19:13:39] !log gerrit - add ppchelko to mediawiki-services [19:13:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:14:20] twentyafterfour: cool, let's move on over to #scap [19:14:21] apergos: we need to puppetize the scap package, I don't know if there is a way to make the provider require the scap package from apt but thcipriani is working on a scap class in puppet which would require both the provider and the package [19:14:30] marxarelli: ok [19:15:01] 6operations, 10ops-eqiad: dbstore1002.mgmt.eqiad.wmnet: "No more sessions are available for this type of connection!" - https://phabricator.wikimedia.org/T119488#2012179 (10jcrespo) 5Open>3stalled Let's block it for now, I will ping you back when I can setup with the users (analytics) a maintenance window. [19:15:26] (03PS1) 10Ema: WIP: Maps VCL forward-porting to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/269466 (https://phabricator.wikimedia.org/T124279) [19:18:26] 7Blocked-on-Operations, 6operations, 10Wikimedia-General-or-Unknown: Invalidate all users sessions - https://phabricator.wikimedia.org/T124440#2012199 (10Tgr) >>! In T124440#2012071, @Amire80 wrote: > Might be relevant: A user in the Hebrew Wikipedia wrote that he became logged out of his desktop account, bu... [19:21:13] 6operations, 10DBA, 5Patch-For-Review: Prepare db1018 and s2-slaves for s2 master failover - https://phabricator.wikimedia.org/T125215#2012206 (10jcrespo) @Pcoombe I've copied Greg's version into https://wikitech.wikimedia.org/wiki/Planned_Maintenance-February_9_2016 [19:22:01] ^Should I send an email to wikitech? [19:22:46] 6operations, 10ops-eqiad: Hardware problem (probably memory) on elastic1021 - https://phabricator.wikimedia.org/T125973#2012208 (10Cmjohnson) Finally shipped Dear Johnson, Christopher, Your dispatch shipped on 2/9/2016 11:25:53 AM What's Next? If you need to make any changes to the dispatch contact inform... [19:24:34] jynus: that'd be great [19:24:49] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [24.0] [19:26:56] _joe_: Can we prune mw1037 from mediawiki-installation too since it's down? [19:27:15] <_joe_> ostriches: I think I did? [19:27:24] <_joe_> wtf? [19:27:52] (03PS4) 10EBernhardson: Better mediawiki REPL [puppet] - 10https://gerrit.wikimedia.org/r/268541 [19:28:24] _joe_: eg: 19:25:54 ['/srv/deployment/scap/scap/bin/sync-common', '--no-update-l10n', 'mw1010.eqiad.wmnet', 'mw1033.eqiad.wmnet', 'mw1070.eqiad.wmnet', 'mw1097.eqiad.wmnet', 'mw1216.eqiad.wmnet', 'mw1161.eqiad.wmnet', 'mw1201.eqiad.wmnet', 'mw2001.codfw.wmnet', 'mw2041.codfw.wmnet', 'mw2080.codfw.wmnet', 'mw2119.codfw.wmnet', 'mw2187.codfw.wmnet'] on [19:28:24] mw1037.eqiad.wmnet returned [255]: ssh: connect to host mw1037.eqiad.wmnet port 22: Connection timed out [19:28:56] <_joe_> ostriches: yeah I know you're right :) [19:29:26] just for funsies: https://tools.wmflabs.org/sal/production?p=0&q=mw1037&d= [19:29:50] <_joe_> ostriches: puppet is failing post-compilation on mira [19:30:40] Aw boo :( [19:30:51] <_joe_> ostriches: I'm going to dinner though, I hope someone else will fix it [19:31:12] <_joe_> greg-g: I removed that server 2 hours ago [19:31:24] no comment :) [19:31:30] <_joe_> the problem is someone broke puppet in a non-obvious way [19:32:25] <_joe_> ostriches: or, remove it from the file for now, puppet will agree [19:34:30] PROBLEM - Last backup of the tools filesystem on labstore1001 is CRITICAL: CRITICAL - Last run result for unit replicate-tools was exit-code [19:34:32] Well the file's root owned so I can't exactly do that hehe [19:35:47] !log cp3048 (upload esams) rebooting -> kernel 4.4 canary [19:35:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:37:00] !log demon@mira Finished scap: pruning tons of stale branches + sync wmf.13 files for later + testwiki to wmf.13 to build l10n cache (try 2) (duration: 27m 24s) [19:37:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:38:07] (03PS1) 10ArielGlenn: Revert "Reload keyholder-agent on keyholder-auth change" [puppet] - 10https://gerrit.wikimedia.org/r/269468 [19:38:20] (03PS2) 10ArielGlenn: Revert "Reload keyholder-agent on keyholder-auth change" [puppet] - 10https://gerrit.wikimedia.org/r/269468 [19:38:58] (03CR) 10Ori.livneh: [C: 031] "Thanks Ariel." [puppet] - 10https://gerrit.wikimedia.org/r/269468 (owner: 10ArielGlenn) [19:39:50] come on jenkins [19:40:52] is jenkins asleep or what? [19:41:03] ori: you're welcome if I ever get a chance tomerge it [19:42:11] getting ready to not care and merge it over jenkins' silence [19:42:58] (03CR) 10ArielGlenn: [C: 032 V: 032] Revert "Reload keyholder-agent on keyholder-auth change" [puppet] - 10https://gerrit.wikimedia.org/r/269468 (owner: 10ArielGlenn) [19:43:07] sorry jenkins [19:45:31] jynus: thanks for sending that [19:46:32] 6operations, 10DBA, 5Patch-For-Review: Reimage db2012 - https://phabricator.wikimedia.org/T126209#2012334 (10jcrespo) db2012 was successfully reimaged, but I could not reconnect dbstore2001- I need to reimport that database into it. [19:47:37] ostriches: mira is up to date for puppet now [19:47:46] Thx! [19:49:01] 6operations, 10DBA, 5Patch-For-Review: Prepare db1018 and s2-slaves for s2 master failover - https://phabricator.wikimedia.org/T125215#2012338 (10jcrespo) I've sent an email with that text to wikitech, too. [19:55:08] !log Updated operations/dumps/dcat on snapshot1003 from 0a71deb232 to 92ab37d94e [19:55:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:55:15] twentyafterfour: can you do something like [19:55:49] Class['::scap3::packages'] -> Package <| provider == 'scap3' |> or whatever it is, in puppet? [19:56:01] I"m sure that syntax is broken but you get what I mean [19:57:38] apergos: yeah that sounds good [19:57:49] apergos: you actually had it right [19:57:52] heh [19:58:26] 7Blocked-on-Operations, 10Beta-Cluster-Infrastructure, 6Discovery, 6Release-Engineering-Team, and 2 others: Beta: submodule update reverts new portals commits - https://phabricator.wikimedia.org/T126061#2012357 (10greg) [19:58:34] I think some of the ops team are allergic to the -> syntax though [19:59:40] don't care [19:59:57] let's get this in and working and happy and the syntax can be changed later if someone is allergic enough [20:00:00] twentyafterfour: not for the spaceship operator [20:00:01] as long as the code doesn't suck is all [20:00:05] hashar: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160209T2000). Please do the needful. [20:00:16] 6operations, 10Analytics, 10MediaWiki-extensions-ContentTranslation: Make the command `sql wikishared` work on terbium like `sql enwiki`, `sql centralauth`, etc. - https://phabricator.wikimedia.org/T122474#2012373 (10Nuria) @Amire80 closing as data is been gathered on 1002 now [20:00:34] (03CR) 10Chad: [C: 032] group0 to 1.27.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269463 (owner: 10Chad) [20:00:44] 6operations, 10MediaWiki-extensions-ContentTranslation: Make the command `sql wikishared` work on terbium like `sql enwiki`, `sql centralauth`, etc. - https://phabricator.wikimedia.org/T122474#2012377 (10Nuria) 5Open>3Resolved [20:01:01] (03Merged) 10jenkins-bot: group0 to 1.27.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269463 (owner: 10Chad) [20:01:43] 6operations, 10MediaWiki-extensions-ContentTranslation: Make the command `sql wikishared` work on terbium like `sql enwiki`, `sql centralauth`, etc. - https://phabricator.wikimedia.org/T122474#2012384 (10Krenair) [20:02:33] !log demon@mira Started scap: all group0 to wmf.13 [20:02:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:06:05] (03PS2) 10Ottomata: Updates to work with hiera, Hive/Oozie MySQL db can now be hosted on remote node [puppet/cdh] - 10https://gerrit.wikimedia.org/r/269340 (https://phabricator.wikimedia.org/T109859) [20:06:26] ostriches: you are doing the train? [20:07:34] aude: yeah, antoine is out tonight for family reasons [20:07:41] ok [20:07:59] (03PS6) 10Ottomata: Refactor manifests/role/analytics/* into modules/role, use hiera to configure [puppet] - 10https://gerrit.wikimedia.org/r/267797 (https://phabricator.wikimedia.org/T109859) [20:08:25] (03PS20) 10Dduvall: Puppet provider for scap3 [puppet] - 10https://gerrit.wikimedia.org/r/262742 (https://phabricator.wikimedia.org/T113072) (owner: 10Alexandros Kosiaris) [20:08:37] i don't need anything, but am making dinner :) and if there is a wikidata-related problem, it might take me a few minutes to respond [20:08:48] * aude expects no problems, of course [20:08:57] of course [20:09:24] (03CR) 10jenkins-bot: [V: 04-1] Refactor manifests/role/analytics/* into modules/role, use hiera to configure [puppet] - 10https://gerrit.wikimedia.org/r/267797 (https://phabricator.wikimedia.org/T109859) (owner: 10Ottomata) [20:11:25] (03CR) 10Ottomata: [C: 032] Updates to work with hiera, Hive/Oozie MySQL db can now be hosted on remote node [puppet/cdh] - 10https://gerrit.wikimedia.org/r/269340 (https://phabricator.wikimedia.org/T109859) (owner: 10Ottomata) [20:11:32] !log cp1067, cp1071 (text, upload in eqiad) -> 4.4 canaries (rebooting over the next ~8 mins or so) [20:11:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:11:59] (03PS3) 10Ori.livneh: add a dependency on xhprof/xhgui [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251137 [20:12:02] (03PS7) 10Ottomata: Refactor manifests/role/analytics/* into modules/role, use hiera to configure [puppet] - 10https://gerrit.wikimedia.org/r/267797 (https://phabricator.wikimedia.org/T109859) [20:12:23] (03CR) 10jenkins-bot: [V: 04-1] add a dependency on xhprof/xhgui [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251137 (owner: 10Ori.livneh) [20:12:48] (03PS21) 10Dduvall: Puppet provider for scap3 [puppet] - 10https://gerrit.wikimedia.org/r/262742 (https://phabricator.wikimedia.org/T113072) (owner: 10Alexandros Kosiaris) [20:13:13] (03Restored) 10Subramanya Sastry: parsoid-vd-client: Set screenShotDelay to 5 seconds [puppet] - 10https://gerrit.wikimedia.org/r/269314 (owner: 10Subramanya Sastry) [20:13:20] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Core: cr1-codfw:xe-5/0/2 (Zayo, OGYX/124337//ZYO, 38.8ms) {#?} [10Gbps wave]BR [20:13:46] (03PS2) 10Subramanya Sastry: parsoid-vd-client: Set screenShotDelay to 2 seconds [puppet] - 10https://gerrit.wikimedia.org/r/269314 [20:14:39] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/2: down - Core: cr2-ulsfo:xe-1/3/0 (Zayo, OGYX/124337//ZYO, 38.8ms) {#11541} [10Gbps wave]BR [20:18:00] subbu: want that now? [20:20:08] (03PS1) 10Reedy: [WIP] Enable ORES on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269478 (https://phabricator.wikimedia.org/T120923) [20:22:04] (03Abandoned) 10Ottomata: Revert "Update AQS config with new syntax" [puppet] - 10https://gerrit.wikimedia.org/r/269216 (owner: 10Ottomata) [20:22:09] (03CR) 10Ori.livneh: [C: 031] Implement /w/static.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263566 (https://phabricator.wikimedia.org/T99096) (owner: 10Krinkle) [20:24:06] (03PS2) 10Krinkle: mediawiki: Clean up beta sites Apache configs [puppet] - 10https://gerrit.wikimedia.org/r/268578 [20:25:40] !log cache kernel reboots done (all on '3.19.0-2-amd64 #1 SMP Debian 3.19.3-9 (2016-01-04)', except 4x canaries on '4.4.0-1-amd64 #1 SMP Debian 4.4-1~wmf1 (2016-01-26)') [20:25:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:25:44] (03PS2) 10Dzahn: Beta: Rebase mw-config submodules [puppet] - 10https://gerrit.wikimedia.org/r/268737 (https://phabricator.wikimedia.org/T126061) (owner: 10Thcipriani) [20:26:49] (03PS4) 10Ori.livneh: add a dependency on xhprof/xhgui [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251137 [20:27:03] 7Puppet, 10Continuous-Integration-Infrastructure: Need a better way of testing puppet patches for contint/integration stuff - https://phabricator.wikimedia.org/T126370#2012566 (10Legoktm) 3NEW [20:27:20] (03CR) 10jenkins-bot: [V: 04-1] add a dependency on xhprof/xhgui [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251137 (owner: 10Ori.livneh) [20:30:54] (03CR) 10Dzahn: [C: 031] "has it been cherry-picked on the beta puppetmaster already? i believe that is what usually happens" [puppet] - 10https://gerrit.wikimedia.org/r/268737 (https://phabricator.wikimedia.org/T126061) (owner: 10Thcipriani) [20:31:30] 6operations, 7HHVM, 7Tracking: Complete the use of HHVM over Zend PHP on the Wikimedia cluster (tracking) - https://phabricator.wikimedia.org/T86081#2012595 (10ori) >>! In T86081#2010135, @hashar wrote: > This task is a blocker for getting rid of Zend 5.3 Jenkins jobs for WMF branches and master branches dep... [20:31:42] (03CR) 10Thcipriani: "Working on cherry-picking. puppet is failing for deployment-bastion currently due to another issue I'm working out presently :)" [puppet] - 10https://gerrit.wikimedia.org/r/268737 (https://phabricator.wikimedia.org/T126061) (owner: 10Thcipriani) [20:31:43] _joe_: for 266609, it seems fine to start with the global backend (math/captcha). Then maybe non-commons and commons? [20:32:09] er, I mean godog [20:32:18] !log demon@mira Finished scap: all group0 to wmf.13 (duration: 29m 45s) [20:32:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:32:23] (03CR) 10Ottomata: [C: 032] "This has been tested in labs, and is not yet applying any of the new roles in production. That will be done in small pieces in further co" [puppet] - 10https://gerrit.wikimedia.org/r/267797 (https://phabricator.wikimedia.org/T109859) (owner: 10Ottomata) [20:33:41] (03PS3) 10Krinkle: mediawiki: Clean up beta sites Apache configs [puppet] - 10https://gerrit.wikimedia.org/r/268578 [20:35:42] (03PS22) 10Dduvall: Puppet provider for scap3 [puppet] - 10https://gerrit.wikimedia.org/r/262742 (https://phabricator.wikimedia.org/T113072) (owner: 10Alexandros Kosiaris) [20:36:38] (03CR) 10Dduvall: [C: 031] "We've refactored the specs and thoroughly tested on labs (phab-scap.eqiad.wmflabs)." [puppet] - 10https://gerrit.wikimedia.org/r/262742 (https://phabricator.wikimedia.org/T113072) (owner: 10Alexandros Kosiaris) [20:37:20] !log restarting nginx for libssl update on cp1049.eqiad.wmnet,cp4008.ulsfo.wmnet,cp3042.esams.wmnet,cp3049.esams.wmnet [20:37:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:39:17] (03CR) 1020after4: [C: 031] "now we pass uid to puppet's execute method and let puppet handle the user context switching instead of calling sudo. That should be an imp" [puppet] - 10https://gerrit.wikimedia.org/r/262742 (https://phabricator.wikimedia.org/T113072) (owner: 10Alexandros Kosiaris) [20:40:00] apergos: ^ [20:40:17] looking [20:40:38] (03CR) 10Krinkle: mediawiki: Clean up beta sites Apache configs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/268578 (owner: 10Krinkle) [20:40:44] (03PS1) 10Dzahn: wikitech: add wikitech.m.wikimedia.org as server alias [puppet] - 10https://gerrit.wikimedia.org/r/269504 (https://phabricator.wikimedia.org/T120527) [20:40:46] (03PS4) 10Krinkle: mediawiki: Clean up beta sites Apache configs [puppet] - 10https://gerrit.wikimedia.org/r/268578 [20:41:29] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/#/c/269504/" [dns] - 10https://gerrit.wikimedia.org/r/268734 (https://phabricator.wikimedia.org/T120527) (owner: 10Dzahn) [20:42:09] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [20:45:00] PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [5000000.0] [20:47:12] (03CR) 10Alexandros Kosiaris: "Maybe abandon if it's not the root cause ?" [puppet] - 10https://gerrit.wikimedia.org/r/269103 (https://phabricator.wikimedia.org/T125999) (owner: 10Hashar) [20:47:39] (03CR) 10Ori.livneh: "Could you migrate this to the mediawiki module? It is not really part of scap. Other than that, I think it's ready to go." [puppet] - 10https://gerrit.wikimedia.org/r/268541 (owner: 10EBernhardson) [20:48:39] huh an akosiaris [20:48:41] unbelievable [20:48:57] twentyafterfour: are you all in agreement that it's ready to go? [20:49:09] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [20:50:11] by "all" I mean you, marxarelli, thcipriani? [20:50:17] apergos: ? [20:50:30] ah akosiaris, how would you feel at being last reviewer of [20:50:39] https://gerrit.wikimedia.org/r/#/c/262742/ ? [20:50:44] this is the scap3 provider [20:51:14] 22 patchsets... wow this thing has grown [20:51:16] yeah sure [20:51:18] I'm about ready to pull the trigger but given you did a lot of work on it [20:51:20] thanks alot [20:52:24] (03PS3) 10Dzahn: wikitech: add wikitech.m.wikimedia.org as server alias [puppet] - 10https://gerrit.wikimedia.org/r/269504 (https://phabricator.wikimedia.org/T120527) [20:53:43] (03CR) 10Southparkfan: [C: 031] wikitech: add wikitech.m.wikimedia.org as server alias [puppet] - 10https://gerrit.wikimedia.org/r/269504 (https://phabricator.wikimedia.org/T120527) (owner: 10Dzahn) [20:55:14] (03PS4) 10Dzahn: wikitech: add wikitech.m.wikimedia.org as server alias [puppet] - 10https://gerrit.wikimedia.org/r/269504 (https://phabricator.wikimedia.org/T120527) [20:57:56] (03CR) 10Dzahn: [C: 032] wikitech: add wikitech.m.wikimedia.org as server alias [puppet] - 10https://gerrit.wikimedia.org/r/269504 (https://phabricator.wikimedia.org/T120527) (owner: 10Dzahn) [20:59:49] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [21:00:46] 6operations, 10Traffic, 6Zero: Use Text IP for Mobile hostnames to gain SPDY/H2 coalesce between the two - https://phabricator.wikimedia.org/T124482#2012719 (10BBlack) p:5Triage>3Low Updates: 1. We're still trying to get to the bottom of historical and present mysteries about Zero-rated whitelist subnet... [21:03:00] RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 50.00% above the threshold [1000000.0] [21:04:27] (03CR) 10Ladsgroup: "Per includes/Api.php in the extension $wgOresWikiId = 'fawiki' is not needed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269478 (https://phabricator.wikimedia.org/T120923) (owner: 10Reedy) [21:04:37] (03CR) 10Krinkle: [C: 031] "Applied locally on puppetmaster in Beta Labs. Working as expected :)" [puppet] - 10https://gerrit.wikimedia.org/r/268578 (owner: 10Krinkle) [21:04:58] (03PS1) 10Andrew Bogott: Move promethium into wikitechexp so subbu can use it. [puppet] - 10https://gerrit.wikimedia.org/r/269526 (https://phabricator.wikimedia.org/T125166) [21:06:22] Krinkle: re static_host stuff, what about hieradata/labs.yaml:role::cache::base::static_host: 'deployment.wikimedia.beta.wmflabs.org' [21:06:30] does that still work for labs? [21:06:38] (03CR) 10Andrew Bogott: [C: 032] Move promethium into wikitechexp so subbu can use it. [puppet] - 10https://gerrit.wikimedia.org/r/269526 (https://phabricator.wikimedia.org/T125166) (owner: 10Andrew Bogott) [21:07:41] PROBLEM - puppet last run on analytics1027 is CRITICAL: CRITICAL: puppet fail [21:08:08] (not that it matters in the short term I guess) [21:08:18] (03PS2) 10BBlack: cache: Change static_host from www.wikimedia.org to en.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/269224 (owner: 10Krinkle) [21:10:18] (03CR) 10BBlack: [C: 032] cache: Change static_host from www.wikimedia.org to en.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/269224 (owner: 10Krinkle) [21:10:37] 6operations, 10ops-ulsfo: ulsfo temperature-related exceptions - https://phabricator.wikimedia.org/T119631#2012758 (10RobH) I didn't end up opening that ticket when I said I would, but it has been opened as of today. I've requested they investigate the temperature differential between racks 1.23 (which is not... [21:10:44] (03PS4) 10BBlack: cache: Normalise hostname for /w/skins,resources,extensions [puppet] - 10https://gerrit.wikimedia.org/r/269149 (https://phabricator.wikimedia.org/T99096) (owner: 10Krinkle) [21:11:02] (03CR) 10Dzahn: "[tin:~] $ curl -H "Host:wikitech.m.wikimedia.org" http://silver.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/268734 (https://phabricator.wikimedia.org/T120527) (owner: 10Dzahn) [21:11:22] (03PS4) 10Dzahn: wikitech: wikitech.m.wikimedia.org -> CNAME silver [dns] - 10https://gerrit.wikimedia.org/r/268734 (https://phabricator.wikimedia.org/T120527) [21:11:39] (03CR) 10BBlack: [C: 032 V: 032] cache: Normalise hostname for /w/skins,resources,extensions [puppet] - 10https://gerrit.wikimedia.org/r/269149 (https://phabricator.wikimedia.org/T99096) (owner: 10Krinkle) [21:14:18] (03PS5) 10Dzahn: wikitech: wikitech.m.wikimedia.org -> CNAME silver [dns] - 10https://gerrit.wikimedia.org/r/268734 (https://phabricator.wikimedia.org/T120527) [21:15:01] (03CR) 10Alex Monk: "Note the cert does not cover it" [dns] - 10https://gerrit.wikimedia.org/r/268734 (https://phabricator.wikimedia.org/T120527) (owner: 10Dzahn) [21:15:40] (03PS6) 10Dzahn: wikitech.m.wikimedia.org -> silver, just showed portal [dns] - 10https://gerrit.wikimedia.org/r/268734 (https://phabricator.wikimedia.org/T120527) [21:15:56] sigh @ cert [21:16:58] what's the smaller bug? wikitech.m. does not get you to wikitech at all, just random wikimedia portal. or it does get you to wikitech but cert error ? [21:17:07] do we need a wikitech.m.? [21:17:24] cert error is a no-go [21:18:01] probably not enough https://phabricator.wikimedia.org/T120527#1870935 [21:18:11] it was an attempt because [21:18:14] "Well, wikitech has MobileFrontend installed: https://wikitech.wikimedia.org/wiki/?useformat=mobile" [21:18:20] meh, actually I think wikitech.m.wm.o wouldn't be needed as all traffic to wikitech.wm.o would be passed to the mwserver all the time anyway? [21:18:26] so it would be able to show a mobile friendly version [21:18:31] 6operations, 10Wikimedia-Etherpad: Specific etherpad link hangs then shows JavaScript error - https://phabricator.wikimedia.org/T126379#2012792 (10Krenair) [21:18:34] but not really existing [21:18:42] 6operations, 10DBA, 5Patch-For-Review: Prepare db1018 and s2-slaves for s2 master failover - https://phabricator.wikimedia.org/T125215#2012795 (10Pcoombe) Okay, there are CentralNotice banners set up for logged-in users between 2245 and 2359 UTC on the affected wikis, linking to the above page. [[ https://en... [21:19:46] MobileFrontend should automatically serve the mobile version based on the url or user-agent, and wikitech.m shouldn't be needed [assuming MF has the right configuration and no Varnish is involved]? [21:19:59] (03CR) 10Dzahn: [C: 04-2] "do not merge because of the certificate issue" [dns] - 10https://gerrit.wikimedia.org/r/268734 (https://phabricator.wikimedia.org/T120527) (owner: 10Dzahn) [21:20:03] not really, sadly [21:20:31] MobileFrontend, at least as I've seen us do it for the normal wikis, relies heavily on varnish mangling of request hostnames, and varnish detection of user-agent, etc [21:20:40] and varnish setting X-Subdomain [21:20:59] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [21:21:07] ?useformat=mobile *might* work? I'm really not sure [21:21:16] it does [21:21:25] how does it persist that? [21:21:30] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 [21:21:53] ?useformat=mobile has always looked a bit scary, I've never dug into exactly how it operates [21:21:57] MobileFrontend serves me the mobile version at my phone [21:22:07] well it basically forces the mobile version [21:22:20] right but then further links in the page don't all have ?useformat=mobile do they? [21:22:51] MobileFrontend can decide whether to serve the desktop verison or the mobile version based on the user-agent header anyway [21:23:04] No useformat param needed for that [21:23:06] at the moment you click the mobile view link and it sends you to ?mobileaction=toggle_view_mobile [21:23:10] which persists somehow [21:23:18] because further clicks on that wiki don't require it [21:23:24] cookie? [21:23:27] probably in a scary way that pollutes caches that I don't want to know about [21:23:37] I see a mf_useformat=true cookie [21:23:48] which we totally don't vary on in the caches.... [21:23:49] MobileFrontend has been a pain in the ass for me [21:24:03] https://github.com/miraheze/puppet/blob/master/modules/varnish/templates/default.vcl#L159 top 1 ugly hack [21:24:11] apergos: I believe it's ready, marxarelli|afk signed off on it as well. feedback from thcipriani and akosiaris is welcomed [21:24:30] yeah akosiaris said he'd give it one last review [21:24:38] 6operations, 10DBA, 5Patch-For-Review: Prepare db1018 and s2-slaves for s2 master failover - https://phabricator.wikimedia.org/T125215#2012803 (10jcrespo) Thank you! [21:24:42] one he signs off I'm gonna call it time to merge [21:24:51] *once he [21:24:54] :) [21:25:12] if MFE can do UA detection for redirect on its own, why does our VCL also detect UA for redirect? [21:25:24] 6operations, 6Labs, 10Labs-Infrastructure, 10Reading-Web, and 3 others: https://wikitech.m.wikimedia.org/ serves wikimedia.org portal - https://phabricator.wikimedia.org/T120527#2012804 (10Dzahn) added the ServerAlias in Apache config, silver answers for wikitech.m now, but merging the DNS change to switch... [21:25:42] Why do we use m. at all? ;) [21:25:55] (03CR) 10Thcipriani: [C: 031] "scap parts all look correct." [puppet] - 10https://gerrit.wikimedia.org/r/262742 (https://phabricator.wikimedia.org/T113072) (owner: 10Alexandros Kosiaris) [21:26:08] well m. is different content than non-mdot, so caches need something to vary on [21:26:41] I would vary on the mf_useformat + stopmobileredirect cookies and the user-agent [21:27:20] the sane/normal case you'd expect from a cleanly-designed fresh app would be that m-dot and non-m-dot hostnames are what the app looks at for emitting desktop-vs-mobile content [21:27:24] couldn't you just assume a different default cookie value depending on UA, if the cookie is explicitly set then show that format instead? [21:27:38] and then on top of that, you may do UA detection and/or sticky cookies for automatically redirecting from one to the other [21:27:42] Adding ?useformat=true for 'mobile users' is my varnish hack [21:27:52] or would that be problematic because desktop users would never accidentally get to mobile pages? [21:27:56] which is basically what we have on the production wikis, but that functionality is split between MFE + Varnish [21:27:58] * Krenair grumbles [21:28:27] I actually don't want to mess with URLs, but there seems no other possibility for me to force the mobile version [21:29:24] the way it works for our production wikis is that MediaWiki always only sees requests for the desktop hostname (e.g. en.wikipedia.org). when a request comes in for en.m.wikipedia.org, we rewrite it in varnish to en.wikipedia.org and then also inject the header X-Subdomain: M [21:29:36] yeah [21:29:53] and then we do other magic in varnish to make sure the cache stays split between the two sets of content MW emits for en.wikipedia.org [21:30:15] If MobileFrontend could force the mobile version based on a header (like X-Use-MobileFrontend or whatever), that would be cool [21:30:27] it does, using the header X-Subdomain [21:30:38] 6operations, 7Mail, 7Mobile, 5Patch-For-Review: consolidate mailman redirects in exim aliases file - https://phabricator.wikimedia.org/T123581#2012819 (10Dzahn) Eliza has setup announcements@ (http://wmf.zendesk.com/requests/10081) removing it on our side [21:30:40] ah? /me looks [21:31:05] but really, it (it being some combination of MFE and MW-core) should just split based on configured desktop and mobile request hostnames and natively handle the m-dot request hostname without rewrite or X-Subdomain injection, IMHO. [21:33:08] !log ran package upgrades on wikitech-static [21:33:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:34:54] X-Subdomain works because wgMFMobileHeader is set to that I guess? [21:35:54] 6operations, 7Mail: Move most (all?) exim personal aliases to OIT - https://phabricator.wikimedia.org/T122144#2012834 (10Dzahn) [21:35:56] 6operations, 7Mail, 7Mobile, 5Patch-For-Review: consolidate mailman redirects in exim aliases file - https://phabricator.wikimedia.org/T123581#2012833 (10Dzahn) 5Open>3Resolved [21:36:12] 6operations, 7Mail, 7Mobile: consolidate mailman redirects in exim aliases file - https://phabricator.wikimedia.org/T123581#1933178 (10Dzahn) [21:37:08] 6operations, 7Mail: move grants aliases to OIT? - https://phabricator.wikimedia.org/T83791#2012848 (10Dzahn) Thank you for submitting a support request. Your request (#10080) has been received, and will be reviewed by our support staff soon. http://wmf.zendesk.com/requests/10080 [21:37:29] SPF|Cloud: I honestly have no idea, I try not to look at that code or my brain might explode :) [21:37:39] I just know how it works here from a blackbox perspective [21:38:01] at least it looks like wgMFMobileHeader is not used in MobileFrontend at all outside of phpunit tests [21:38:11] it is set in our config repo [21:38:12] (03CR) 1020after4: Add scap3 deployment option for services (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/269143 (owner: 10Thcipriani) [21:38:15] wmf-config/mobile.php: $wgMFMobileHeader = 'X-Subdomain'; [21:38:31] so you are using some magic I don't understand [21:39:01] includes/MobileContext.php: $mobileHeader = $this->getMFConfig()->get( 'MFMobileHeader' ); [21:39:04] includes/MobileContext.php: $mobileHeader = $config->get( 'MFMobileHeader' ); [21:39:08] ^ there's the missing link you're looking for [21:39:20] bah [21:39:58] I've literally spent weeks trying to make a proper VCL >_> [21:40:16] I don't know if recommending to read ours helps or hurts :) [21:40:32] you should blog as "the bblackbox perspective" [21:40:44] heh [21:41:22] 6operations, 10ops-eqiad: dbstore1002.mgmt.eqiad.wmnet: "No more sessions are available for this type of connection!" - https://phabricator.wikimedia.org/T119488#2012862 (10Cmjohnson) p:5Normal>3Triage [21:41:28] What I do know is that Varnish 4 sucks a little bit [21:41:40] :) [21:42:07] ema is working through our Varnish 3->4 transition, but starting with our simpler clusters, not text that handles this stuff yet [21:42:37] Yeah, I saw that. When setting up Varnish I had absolutely no experience with it, so I was literally copy-pasting some of your text cache stuff [21:42:51] if you ignore the zero-related stuff, the parts that handle m-dot are here in vcl_recv: https://github.com/wikimedia/operations-puppet/blob/production/templates/varnish/text-frontend.inc.vcl.erb#L155 [21:43:32] but this also critically goes with it: https://github.com/wikimedia/operations-puppet/blob/production/templates/varnish/text-backend.inc.vcl.erb#L56 [21:43:37] Not realizing that req.hash_ignore_busy does not make sense if you have only one layer (instead of a front- and backend), so guess how bad the site worked the first time [21:43:59] (that's where we rewrite the request hostname to the desktop hostname at the last second before talking to MW, so that varnish still differentiates the two hostnames for all caching purposes) [21:44:20] I think that if you refactor the mobile vcl a bit that it might make sense to put clear instructions on the mediawikiwiki help pages [21:44:51] in the medium term, yeah [21:45:22] in the long term, we should really work to undo the fact that a lot of the front edge of MW functionality is actually encoded in WMF apache and/or varnish configurations [21:45:34] by having MW do some of that work for itself, with URL routing and hostname handling, etc.... [21:45:53] there are some related tickets out there somewhere [21:47:01] another way to think of the same thing: a MW installation should be able to be fully functional on its own using some embedded HTTP listener, or worst case with a minimal/empty apache config forwarding into fastcgi or whatever. [21:47:18] Biggest issue I had until now (which might not be experienced with your upgrade at all - it is possible that it's just an error of my side) is that sometimes purge doesn't work [21:47:24] (03PS1) 10Ottomata: Hardcoding path to refinery-camus versioned jar in mediawiki camus job [puppet] - 10https://gerrit.wikimedia.org/r/269538 [21:47:26] everything a production site does with more-complicated apache/varnish/nginx config should be about performance and operational concerns, not making things basically-work. [21:47:49] +1 [21:48:55] bblack, MF can do autodetection with a single config setting on wikis without HTTP caching [21:48:59] 6operations, 6Labs, 10wikitech.wikimedia.org: Update wikitech-static OS/PHP version - https://phabricator.wikimedia.org/T126385#2012874 (10Krenair) 3NEW [21:49:10] and we're actually using it on wikitech [21:49:14] MaxSem: can it differentiate on request hostname? [21:49:15] (03CR) 10Ottomata: [C: 032] Hardcoding path to refinery-camus versioned jar in mediawiki camus job [puppet] - 10https://gerrit.wikimedia.org/r/269538 (owner: 10Ottomata) [21:49:44] bblack, not outta the box but I think you can do it with hooks [21:50:19] (if not, then something else in front of it has to detect the m-dot hostname, rewrite it to the standard/desktop hostname, and inject X-Subdomain. and then if there's a cache in front, the cache has to be smart about varying on X-Subdomain or something similar) [21:51:33] (or not rewriting the hostname until after cache-differentiation is over with) [21:51:49] https://www.mediawiki.org/wiki/Extension:MobileFrontend/Configuring_browser_auto-detection [21:51:55] the main problem here is not in MF but in MW where lots of things can break in non-obvious ways if you serve stuff from 2 domains [21:52:06] yeah [21:52:21] I imagine the work to make it work "right" is on both sides, with new hooks and whatnot [21:52:55] maybe an early hook in MW (early as in near the start of an inbound request) for MF to examine and alter the incoming HTTP request's hostname [21:53:01] (03PS1) 10BryanDavis: Monolog: reorder Monolog processors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269539 (https://phabricator.wikimedia.org/T124985) [21:53:16] where it can be configured to s/\.m\.// and set some flag to emit mobile content for that request, which causes other MF code to be invoked later [21:53:54] 6operations: Adding/Removing users from enWP Arbcom Mailinglist archives - https://phabricator.wikimedia.org/T123787#2012898 (10Dzahn) >>! In T123787#1940361, @yuvipanda wrote: > If only we didn't use software from the early 90s... Yea, i also think we should get rid of the email protocol. ? >>! In T123787#200... [21:54:03] (then MW core still sees it as the desktop hostname, but it works right from the external world's perspective) [21:54:44] ostriches, greg-g: I've got a config change to fix a logging misconfiguration. Can I merge and sync it? -- https://gerrit.wikimedia.org/r/#/c/269539 [21:55:36] Fine by me [21:57:14] something along those lines anyways, should allow MW/MF to take over the stuff in varnish that currently rewrites mobile hostnames and sets X-Subdomain [21:59:21] ostriches: thx [21:59:57] (03CR) 10BryanDavis: [C: 032] Monolog: reorder Monolog processors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269539 (https://phabricator.wikimedia.org/T124985) (owner: 10BryanDavis) [22:00:52] (03Merged) 10jenkins-bot: Monolog: reorder Monolog processors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269539 (https://phabricator.wikimedia.org/T124985) (owner: 10BryanDavis) [22:04:04] !log bd808@mira Synchronized wmf-config/logging.php: Monolog: reorder Monolog processors (b356eeb) (duration: 02m 15s) [22:04:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:04:37] (03PS5) 10Ori.livneh: mediawiki: Clean up beta sites Apache configs [puppet] - 10https://gerrit.wikimedia.org/r/268578 (owner: 10Krinkle) [22:04:40] anomie, tgr: ^^ that fixed it [22:04:46] (03CR) 10Ori.livneh: [C: 032 V: 032] mediawiki: Clean up beta sites Apache configs [puppet] - 10https://gerrit.wikimedia.org/r/268578 (owner: 10Krinkle) [22:08:34] 6operations, 10Deployment-Systems, 6Performance-Team, 10Traffic, 5Patch-For-Review: Make Varnish cache for /static/$wmfbranch/ expire when resources change within branch lifetime - https://phabricator.wikimedia.org/T99096#2012953 (10Krinkle) [22:08:56] (03PS3) 10Aaron Schulz: Set initial $wgMaxUserDBWriteDuration value [mediawiki-config] - 10https://gerrit.wikimedia.org/r/260507 (https://phabricator.wikimedia.org/T95501) [22:09:04] (03PS5) 10Krinkle: [DONT MERGE] Set $wgResourceBasePath to "/w" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268715 (https://phabricator.wikimedia.org/T99096) [22:09:34] 6operations: Adding/Removing users from enWP Arbcom Mailinglist archives - https://phabricator.wikimedia.org/T123787#2012962 (10Dzahn) Kelapstick - done https://en.wikipedia.org/wiki/User_talk:Kelapstick#Arbcom [22:16:08] (03PS4) 10Krinkle: Set initial $wgMaxUserDBWriteDuration value [mediawiki-config] - 10https://gerrit.wikimedia.org/r/260507 (https://phabricator.wikimedia.org/T95501) (owner: 10Aaron Schulz) [22:20:49] PROBLEM - puppet last run on analytics1027 is CRITICAL: CRITICAL: puppet fail [22:22:36] 6operations, 6Phabricator, 7Mail: DomainKeys Identified Mail (DKIM) for phabricator.wikimedia.org - https://phabricator.wikimedia.org/T116805#2013031 (10Aklapper) >>! In T116805#1852017, @fgiunchedi wrote: > if I'm reading exim's configuration right For anybody else who wants to take a look: https://phabri... [22:23:35] (03PS1) 10Ottomata: Fix for old role::analytics::hive::server with new cdh module changes [puppet] - 10https://gerrit.wikimedia.org/r/269543 [22:25:55] (03CR) 10Ottomata: [C: 032] Fix for old role::analytics::hive::server with new cdh module changes [puppet] - 10https://gerrit.wikimedia.org/r/269543 (owner: 10Ottomata) [22:27:38] (03PS1) 10Ottomata: Another fix for old analytics role with new cdh module [puppet] - 10https://gerrit.wikimedia.org/r/269544 [22:27:53] (03CR) 10Ottomata: [C: 032 V: 032] Another fix for old analytics role with new cdh module [puppet] - 10https://gerrit.wikimedia.org/r/269544 (owner: 10Ottomata) [22:28:06] 6operations: Adding/Removing users from enWP Arbcom Mailinglist archives - https://phabricator.wikimedia.org/T123787#2013077 (10Jalexander) Sorry, I had not provided the links this time just because for Yuvi it was easiest for me to send the email out (and he provided it on terbium). Happy to have you do it thou... [22:28:42] 7Puppet, 6Phabricator, 5Patch-For-Review: Create puppet role for Phabricator hosted repo testing - https://phabricator.wikimedia.org/T104827#2013084 (10Aklapper) @Negative24 : Could you share the status of this task (as it's assigned to you)? [22:31:53] (03PS5) 10EBernhardson: Better mediawiki REPL [puppet] - 10https://gerrit.wikimedia.org/r/268541 [22:33:18] 6operations: Adding/Removing users from enWP Arbcom Mailinglist archives - https://phabricator.wikimedia.org/T123787#2013098 (10Dzahn) a:3Dzahn [22:41:11] 7Puppet, 6Phabricator, 5Patch-For-Review: Create puppet role for Phabricator hosted repo testing - https://phabricator.wikimedia.org/T104827#2013128 (10Negative24) I believe @chasemp was going to do something with the code in the change set. [22:49:05] banners are live [22:59:22] 6operations: Adding/Removing users from enWP Arbcom Mailinglist archives - https://phabricator.wikimedia.org/T123787#2013206 (10Dzahn) Ok, cool. Glad it works this way :) Drmies: https://en.wikipedia.org/wiki/User_talk:Drmies#arbcom Gamaliel: https://en.wikipedia.org/wiki/User_talk:Gamaliel#arbcom [22:59:40] 6operations: Adding/Removing users from enWP Arbcom Mailinglist archives - https://phabricator.wikimedia.org/T123787#2013207 (10Dzahn) 5Open>3Resolved [23:00:04] jynus: Respected human, time to deploy s2 database master switch (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160209T2300). Please do the needful. [23:01:00] (03PS3) 10Jcrespo: Enabling read only mode for s2 before its master failover [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269391 (https://phabricator.wikimedia.org/T125215) [23:01:09] any op still around? [23:01:50] (03CR) 10Jcrespo: [C: 031] Enabling read only mode for s2 before its master failover [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269391 (https://phabricator.wikimedia.org/T125215) (owner: 10Jcrespo) [23:01:59] sounds like mutante is here [23:02:18] !log changing topology of s2 slaves in preparation for master failover [23:02:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:03:39] after this I will "break" tendril/dbtree because it does not support circular replication yet [23:04:12] plus know we will start to get database read-only exceptions from the queue [23:04:21] due to a known mediawiki bug [23:04:40] PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 64.00% of data above the critical threshold [5000000.0] [23:05:24] has the job queue for those wikis not been paused jynus? [23:05:36] it will not be paused [23:05:44] it should not affect it [23:06:15] but it refuses to run on non immediate slaves like these: https://tendril.wikimedia.org/tree [23:06:39] keep that handy, I will not break it [23:07:18] fatalmonitor is already complaining https://logstash.wikimedia.org/#/dashboard/elasticsearch/fatalmonitor [23:07:39] I'm around but not really [23:07:41] those do not affect end users directly [23:07:52] however if everything goes to hell then I guess I will be around :-D [23:08:09] (03PS1) 10Dzahn: admin: remove user jkrauska [puppet] - 10https://gerrit.wikimedia.org/r/269549 (https://phabricator.wikimedia.org/T126260) [23:09:13] (03CR) 10Dzahn: [C: 032] admin: remove user jkrauska [puppet] - 10https://gerrit.wikimedia.org/r/269549 (https://phabricator.wikimedia.org/T126260) (owner: 10Dzahn) [23:09:25] !log setting up circular replication between db1018 and db1024 for potential rollback [23:09:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:10:05] that worked [23:10:30] next item on checklist is start the actual downtime [23:10:43] not really downtime, "read-only" [23:12:20] (03CR) 10Jcrespo: [C: 032] Enabling read only mode for s2 before its master failover [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269391 (https://phabricator.wikimedia.org/T125215) (owner: 10Jcrespo) [23:13:34] _joe_: is there a chance some server was missed when you touched the symlinks for https://phabricator.wikimedia.org/T124440 ? [23:13:49] that would maybe explain https://phabricator.wikimedia.org/T126395 [23:13:59] although that's a very uncertain maybe [23:14:12] <_joe_> tgr: it might be the case, but pretty low chance [23:14:50] syncing [23:15:20] RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 50.00% above the threshold [1000000.0] [23:15:32] <_joe_> tgr: I can touch it again, but given what thedj is reporting there, I doubt 1 server out of 100 being out of sync could have that effect [23:16:10] yeah, I am clearly grasping at straws here [23:16:14] <_joe_> tgr: we could ask thedj to use the debug extension and restart hhvm on mw1017 if the problem persists [23:16:41] plus, I don't see how that would cause a CSRF token fail [23:16:47] a full logout, maybe [23:16:47] <_joe_> tgr: this *might* have to do with us re-inserting a redis sessions server in rotation [23:16:52] !log jynus@mira Synchronized wmf-config/db-eqiad.php: Enabling read only mode for s2 before its master failover (duration: 02m 14s) [23:16:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:17:02] <_joe_> see the SAL for times [23:17:08] it probably does, in some form [23:17:16] (03PS1) 10Jcrespo: Master failover for s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269553 [23:17:18] can we edit on pt? [23:17:18] <_joe_> tgr: but then again, doesn't make that much sense [23:17:25] the SAL log mentioned in the task correlates perfectly [23:17:37] but as you say, that doesn't explain anything [23:17:46] <_joe_> correlate with what? [23:17:48] no we cannot, next phase [23:17:48] the graph should go down with time [23:18:23] <_joe_> oh ok [23:18:27] if you check the graph that's linked, the spike at the end starts exactly when the redis server is put back [23:18:31] writes continue, I suppose from background threads [23:18:37] <_joe_> tgr: and yeah, that spike is expected [23:18:41] !log setting db1024 in read only mode [23:18:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:19:19] <_joe_> so what happens when you put a server back in rotation is keys get rebalanced [23:19:25] <_joe_> but let me check that server is working [23:19:33] indeed [23:19:46] !log Changed /src/mediawiki/wikiverisons.php on mw1017 (X-Wikimedia-Debug) to set all wikis to 1.27.0-wmf.13 [23:19:48] (03CR) 10Jcrespo: [C: 032] Master failover for s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269553 (owner: 10Jcrespo) [23:19:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:19:57] plus, it probably still had the old unsalted user tokens when put back [23:20:45] but 1) the spike should be decreasing 2) that doesn't explain how the same user can repeatedly get session losses [23:21:10] <_joe_> ok that server is just empty [23:21:11] fatalmonitor is relativelly happy [23:21:22] <_joe_> tgr: this explains what you're seeing [23:21:37] sync taking forever again [23:21:55] <_joe_> tgr: so I'd say either some opsen in usa takes a look, or we just revert the change to nutcracker [23:22:16] jynus: yeah :/ it's not fast to rsync 250K files between the masters [23:22:23] oh, I agree [23:22:30] but I will discuss that later [23:22:32] :-) [23:22:45] !log jynus@mira Synchronized wmf-config/db-eqiad.php: Actual mediawiki master failover (duration: 02m 14s) [23:22:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:22:51] I am not complaining about the deployment tool [23:23:14] the times have been 1-2m since last week [23:23:15] <_joe_> tgr: tcp 0 0 127.0.0.1:6379 0.0.0.0:* LISTEN 1549/redis-server 1 [23:23:18] <_joe_> shit [23:23:22] <_joe_> ok, fixing this [23:23:23] any pages rerendered due to job queue mean it tries to update page_links_something [23:23:30] so that could be a cause of writes too [23:23:31] anyways [23:23:41] _joe_: thanks! [23:23:56] !log setting db1018 in read/write mode [23:23:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:24:26] too many writes in read-only mode [23:25:04] time for mediawiki to return to r/w and test the shit out of it [23:25:16] (03CR) 10Mobrovac: [C: 04-1] Add scap3 deployment option for services (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/269143 (owner: 10Thcipriani) [23:25:16] :-) [23:26:11] (03PS1) 10Jcrespo: Set mediawiki back in read-only mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269557 [23:26:43] jynus: read write mode ;) [23:27:09] (03CR) 10Jcrespo: "I mean read-write" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269557 (owner: 10Jcrespo) [23:27:23] (03CR) 10Jcrespo: [C: 032] Set mediawiki back in read-only mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269557 (owner: 10Jcrespo) [23:27:31] <_joe_> !log disabled puppet on mc1004, added "bind 0.0.0.0" to its redis config, restarted redis (T126395) [23:27:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:28:43] "Warning: Invalid argument: function: method 'ob_gzhandler' not found in [23:29:00] /srv/mediawiki/docroot/noc/conf/highlight.php on line 38" [23:29:08] * bd808 looks into that [23:29:24] how are those logs coming? [23:29:39] too silent for my taste [23:30:20] !log jynus@mira Synchronized wmf-config/db-eqiad.php: Disable read only mode for s2 after its master failover (duration: 02m 09s) [23:30:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:30:39] writes work again [23:30:51] <_joe_> tgr: the errors should go up for a few and then go down [23:31:32] (03PS1) 10Andrew Bogott: Add wgOpenStackManagerNovaIdentityV3URI to wikitech configs. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269558 [23:31:47] I'm looking at writes on pt wp it seems [23:32:20] yep coming in fine [23:33:55] small incident on db1024, incompatible binlog formats [23:33:58] fixed now [23:34:04] (03PS1) 10Subramanya Sastry: ruthenium: puppetize script to update parsoid + restart services [puppet] - 10https://gerrit.wikimedia.org/r/269559 [23:34:09] not a huge issue because it is depooled [23:34:27] bah wikibugs [23:34:39] <_joe_> thedj: still here? [23:34:44] apergos: my fault :) [23:34:57] things seem pretty stable [23:35:03] this I gotta hear, greg-g :-D [23:35:06] <_joe_> tgr: the obvious issue is not fixed I guess [23:35:21] there will be a lot of tuning now, but I am not worried about that now [23:35:25] apergos: I bulk edited 48 tasks, which kills wikibugs [23:35:29] hahahaha [23:35:32] I guess it would [23:35:42] well you get to bring it back from the dead then [23:35:51] (I don't have access onwhatever tool thingie that is) [23:36:05] it just comes back automatically (in -labs, then as needed in other channels) [23:36:13] what does one do to 48 tasks at once anyways? [23:36:25] in fact, the only db logs that I get now are in mediawikiwiki [23:36:28] moving projects after I created a new/better one [23:36:42] ah makes sense [23:36:46] translate_groupstats [23:36:48] apergos: https://phabricator.wikimedia.org/T126261 you might be interested and/or will also be spammed in email :) [23:36:51] (03PS2) 10Subramanya Sastry: ruthenium: puppetize script to update parsoid + restart services [puppet] - 10https://gerrit.wikimedia.org/r/269559 [23:37:16] I flag the trebushet stuff as salt (project) [23:37:24] there's not so much in there so [23:38:14] !log setting db1018's binlog_format as STATEMENT [23:38:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:38:51] added myself anyways [23:39:14] (03PS23) 1020after4: Puppet provider for scap3 [puppet] - 10https://gerrit.wikimedia.org/r/262742 (https://phabricator.wikimedia.org/T113072) (owner: 10Alexandros Kosiaris) [23:39:16] (03PS1) 1020after4: scap::target to configure scap3 deployment repository and deploy-user. [puppet] - 10https://gerrit.wikimedia.org/r/269560 (https://phabricator.wikimedia.org/T113072) [23:39:18] (03PS1) 1020after4: Clean up phabricator roles in puppet to remove tag pinning. [puppet] - 10https://gerrit.wikimedia.org/r/269561 (https://phabricator.wikimedia.org/T125851) [23:40:25] tendril/dbtree should work now, althoug pointing to the wrong master, fixing [23:40:35] 6operations, 6Release-Engineering-Team, 6Services, 10Trebuchet: `git deploy service restart` asked for sudo password - https://phabricator.wikimedia.org/T126359#2013457 (10greg) [23:40:37] 7Puppet, 10Trebuchet: Trebuchet master should be separate from scap - https://phabricator.wikimedia.org/T96042#2013458 (10greg) [23:40:39] 6operations, 10Trebuchet: git fat/git deploy doesn't always unstub files [Trebuchet] - https://phabricator.wikimedia.org/T98962#2013459 (10greg) [23:40:44] (03PS2) 10Jcrespo: s2-master now points to db1018 (instead of db1024) [dns] - 10https://gerrit.wikimedia.org/r/269381 (https://phabricator.wikimedia.org/T125215) [23:40:56] (03CR) 10Jcrespo: [C: 032] s2-master now points to db1018 (instead of db1024) [dns] - 10https://gerrit.wikimedia.org/r/269381 (https://phabricator.wikimedia.org/T125215) (owner: 10Jcrespo) [23:40:57] 6operations, 10Salt, 10Trebuchet, 5Patch-For-Review: [Trebuchet] Salt times out on parsoid restarts - https://phabricator.wikimedia.org/T63882#2013469 (10greg) [23:42:09] (03PS3) 10Tim Starling: parsoid-vd-client: Set screenShotDelay to 2 seconds [puppet] - 10https://gerrit.wikimedia.org/r/269314 (owner: 10Subramanya Sastry) [23:42:19] (03CR) 10Tim Starling: [C: 032] parsoid-vd-client: Set screenShotDelay to 2 seconds [puppet] - 10https://gerrit.wikimedia.org/r/269314 (owner: 10Subramanya Sastry) [23:42:28] (03PS1) 10BryanDavis: Remove ob_start() from docroot/noc/conf/highlight.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269562 [23:42:56] (03CR) 1020after4: "the last patch set was unintentional but it contained no changes - only a rebase" [puppet] - 10https://gerrit.wikimedia.org/r/262742 (https://phabricator.wikimedia.org/T113072) (owner: 10Alexandros Kosiaris) [23:43:10] there it goes, mediawiki bak in all its glory https://tendril.wikimedia.org/tree [23:43:34] https://dbtree.wikimedia.org/ for non NDAs [23:43:47] now with more SSL! [23:43:57] with MariaDB10! [23:44:11] _joe_: you mean the session failure count? [23:44:15] <_joe_> yes [23:44:22] why are labs and dbstore etc. excluded from dbtree? [23:45:14] as in if there is any specific reason no, why I think it is- because it runs no production code [23:45:18] (03CR) 1020after4: "I've submitted the same changes to `scap::target` in a separate change, maybe we can merge that before this one?" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/269143 (owner: 10Thcipriani) [23:45:33] so people can see if production is down or not [23:45:54] it might be useful to have the labs dbs in a labs view separately [23:45:57] but I accept a patch of a toggle "show only production servers" [23:45:58] when someone had free time [23:46:16] Krenair, do not think what dbtree can do for you [23:46:22] haha [23:46:27] bout what you can do for dbtree! [23:46:36] show me da patch! [23:46:39] :-) [23:46:42] heh [23:47:04] I would love for that to be interactive and do everthing I did here with a mouse click [23:47:05] _joe_ tgr don't have much oppertunity to test, i'm already in bed and about to close the lid on the laptop, but i just added a space char, and it saved instantly [23:47:12] we are not yet there [23:47:32] <_joe_> thedj: ok it's pretty late indeed :) [23:47:50] PROBLEM - cassandra-a service on praseodymium is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [23:48:01] <_joe_> errors are down to the norm afaics [23:48:06] <_joe_> since 15 minutes [23:48:26] cool, let's call this fixed then [23:48:47] I have some exceptions on the logs [23:49:00] PROBLEM - cassandra-a CQL 10.64.16.188:9042 on praseodymium is CRITICAL: Connection refused [23:49:21] Database is read-only: The database has been automatically locked while the slave database servers catch up to the master. [23:49:27] oh [23:49:33] 6operations, 10MediaWiki-Page-editing: Unexplained edit token errors - https://phabricator.wikimedia.org/T126395#2013504 (10Joe) p:5Triage>3High [23:49:44] I do not know if they are retries of the queue [23:49:55] that error out later, or they are real [23:50:12] they are all from rpc [23:50:33] catchup should be fast if they are real [23:51:14] when does the queue servers reset its config? [23:51:16] got a sample one? [23:51:26] /rpc/RunJobs.php?wiki=enwiktionary&type=refreshLinks&maxtime=30&maxmem=300M [23:51:34] at 23:48 [23:51:42] Database is read-only: The database has been automatically locked while the slave database servers catch up to the master. [23:51:45] oh runjobs.php ... ugh [23:51:46] um [23:51:49] lemme see something [23:51:51] 6operations, 10MediaWiki-Page-editing: Unexplained edit token errors - https://phabricator.wikimedia.org/T126395#2013518 (10Joe) I see the error count has normalized since 12:30, so I guess my manual action (I disabled puppet on the server and added a 'bind 0.0.0.0' rule by hand) had a positive effect. I'll cl... [23:52:15] 6operations, 10MediaWiki-Page-editing: Unexplained edit token errors - https://phabricator.wikimedia.org/T126395#2013531 (10Joe) a:3Joe [23:52:30] <_joe_> you might need to restart 'jobrunner' and 'jobchron' [23:52:30] 6operations, 6Performance-Team, 10scap, 7HHVM, 3Scap3: Make scap able to depool/repool servers via the conftool API - https://phabricator.wikimedia.org/T104352#2013535 (10greg) [23:52:39] <_joe_> as they are long-running processes [23:52:59] <_joe_> ori and AaronSchulz can help :) [23:53:00] ok, trying [23:53:14] actually [23:53:18] it is going down now [23:53:31] <_joe_> jynus: well I'd check with them anyways [23:53:37] yes [23:53:40] of course [23:53:41] <_joe_> it could also be long running jobs [23:54:03] <_joe_> that happened to last several minutes [23:54:06] <_joe_> who knows? [23:54:08] <_joe_> :) [23:54:10] restarting the services is easy and safe: [23:54:23] sudo salt -G 'cluster:jobrunner' cmd.run 'service jobrunner status | grep running && service jobrunner restart' [23:54:26] I think as soon as one of those finishes with its list of refreshes the next one will be ok [23:54:26] ok, tell me because even if I think is no longer an issue [23:54:29] sudo salt -G 'cluster:jobrunner' cmd.run 'service jobchron status | grep running && service jobchron restart' [23:54:33] I need to know [23:54:51] the 'grep running' test is to avoid starting the services on codfw where they are not enabled [23:55:09] the refreshlinks ones take longer [23:55:27] !log restarting jobrunner and jobchron [23:55:29] so you should see mostly those [23:55:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:56:07] oh it is relativelly fast [23:56:09] I don't think the restart will make any difference but it doesn't hurt [23:56:50] I think it was already fixed by the time I noticed it [23:57:01] but if it continues, we can discard that [23:57:05] (03PS1) 10EBernhardson: Increase completion suggester replicas for busy wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269568 (https://phabricator.wikimedia.org/T125667) [23:57:47] 6operations, 10Trebuchet: pmtpa remnants in trebuchet redis - https://phabricator.wikimedia.org/T111301#2013580 (10greg) [23:58:05] * _joe_ off [23:58:25] connection errors are to non-s2 servers, and were already realativelly common [23:58:30] RECOVERY - cassandra-a service on praseodymium is OK: OK - cassandra-a is active [23:58:50] so I think it went pretty good [23:59:01] my biggest fear now is a long-term issue [23:59:34] 2 days later, a replication incompatibility, or "oh, we forgot to do this"