[00:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor I � Unicode. All rise for Evening SWAT (Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171116T0000). [00:00:04] No GERRIT patches in the queue for this window AFAICS. [00:01:57] (03PS1) 10RobH: adding jdrewniak to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/391732 (https://phabricator.wikimedia.org/T180639) [00:03:57] (03PS5) 10Thcipriani: Scap: scap_source correct gid [puppet] - 10https://gerrit.wikimedia.org/r/361796 [00:08:30] (03PS1) 10Dzahn: pmacct: move firewall to role, use profile [puppet] - 10https://gerrit.wikimedia.org/r/391735 [00:13:44] (03PS1) 10Dzahn: openldap: move firewall/standard to roles, use profile [puppet] - 10https://gerrit.wikimedia.org/r/391737 [00:15:06] (03PS2) 10Dzahn: openldap: move firewall/standard to roles, use profile [puppet] - 10https://gerrit.wikimedia.org/r/391737 [00:16:01] (03CR) 10Dzahn: [C: 032] pmacct: move firewall to role, use profile [puppet] - 10https://gerrit.wikimedia.org/r/391735 (owner: 10Dzahn) [00:23:46] (03PS1) 10Dzahn: icinga,deployment_server: use profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/391739 [00:24:10] (03PS1) 10Madhuvishy: bootstrapvz-stretch: Remove backports from sources [puppet] - 10https://gerrit.wikimedia.org/r/391740 (https://phabricator.wikimedia.org/T158583) [00:24:39] (03CR) 10jerkins-bot: [V: 04-1] bootstrapvz-stretch: Remove backports from sources [puppet] - 10https://gerrit.wikimedia.org/r/391740 (https://phabricator.wikimedia.org/T158583) (owner: 10Madhuvishy) [00:25:39] (03PS1) 10Madhuvishy: bootstrapvz-stretch: Remove backports from sources [puppet] - 10https://gerrit.wikimedia.org/r/391741 (https://phabricator.wikimedia.org/T158583) [00:25:51] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [00:26:34] (03Abandoned) 10Madhuvishy: bootstrapvz-stretch: Remove backports from sources [puppet] - 10https://gerrit.wikimedia.org/r/391740 (https://phabricator.wikimedia.org/T158583) (owner: 10Madhuvishy) [00:27:00] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [00:27:52] (03PS1) 10Dzahn: ci::master/firewall: move base::firewall to role [puppet] - 10https://gerrit.wikimedia.org/r/391742 [00:31:12] (03PS1) 10Dzahn: piwik: move firewall to role, include vs instantiate [puppet] - 10https://gerrit.wikimedia.org/r/391744 [00:34:43] (03PS3) 10Dzahn: webperf: Skip upper limits for values in navtiming2 [puppet] - 10https://gerrit.wikimedia.org/r/391496 (https://phabricator.wikimedia.org/T104902) (owner: 10Phedenskog) [00:35:15] (03PS2) 10Madhuvishy: bootstrapvz-stretch: Remove backports from sources [puppet] - 10https://gerrit.wikimedia.org/r/391741 (https://phabricator.wikimedia.org/T158583) [00:35:21] (03CR) 10Dzahn: [C: 032] webperf: Skip upper limits for values in navtiming2 [puppet] - 10https://gerrit.wikimedia.org/r/391496 (https://phabricator.wikimedia.org/T104902) (owner: 10Phedenskog) [00:41:26] (03CR) 10Dzahn: [C: 031] bootstrapvz-stretch: Remove backports from sources [puppet] - 10https://gerrit.wikimedia.org/r/391741 (https://phabricator.wikimedia.org/T158583) (owner: 10Madhuvishy) [00:43:07] (03PS3) 10Madhuvishy: bootstrapvz-stretch: Remove backports from sources [puppet] - 10https://gerrit.wikimedia.org/r/391741 (https://phabricator.wikimedia.org/T158583) [00:43:12] (03CR) 10Madhuvishy: [V: 032 C: 032] bootstrapvz-stretch: Remove backports from sources [puppet] - 10https://gerrit.wikimedia.org/r/391741 (https://phabricator.wikimedia.org/T158583) (owner: 10Madhuvishy) [00:48:08] (03PS1) 10Chad: Clean up branch referencing logic, should fix scap prep on master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391745 [00:50:51] (03PS5) 10Dzahn: Phabricator: Override the frog token's label [puppet] - 10https://gerrit.wikimedia.org/r/371660 (https://phabricator.wikimedia.org/T173208) (owner: 10Greg Grossmeier) [00:51:17] (03CR) 10Dzahn: [C: 032] Phabricator: Override the frog token's label [puppet] - 10https://gerrit.wikimedia.org/r/371660 (https://phabricator.wikimedia.org/T173208) (owner: 10Greg Grossmeier) [00:54:10] (03CR) 10Dzahn: "also: https://gerrit.wikimedia.org/r/#/c/383375/" [puppet] - 10https://gerrit.wikimedia.org/r/391741 (https://phabricator.wikimedia.org/T158583) (owner: 10Madhuvishy) [00:56:02] (03CR) 10Thcipriani: [C: 031] Clean up branch referencing logic, should fix scap prep on master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391745 (owner: 10Chad) [00:58:08] (03CR) 10Dzahn: "https://phabricator.wikimedia.org/T173233 is resolved and closed" [puppet] - 10https://gerrit.wikimedia.org/r/371663 (https://phabricator.wikimedia.org/T173233) (owner: 10Greg Grossmeier) [00:58:54] (03Abandoned) 10Dzahn: Add addshore to contint-admins [puppet] - 10https://gerrit.wikimedia.org/r/371663 (https://phabricator.wikimedia.org/T173233) (owner: 10Greg Grossmeier) [00:59:31] !log Removing 2FA from Twotwo2019 (T180438) [00:59:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:59:39] T180438: Reset 2 Factor Authentication for twotwo2019@kowiki - https://phabricator.wikimedia.org/T180438 [00:59:57] !log Removing 2FA from AuburnPilot (T180654) [01:00:04] twentyafterfour: #bothumor I � Unicode. All rise for Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171116T0100). [01:00:04] No GERRIT patches in the queue for this window AFAICS. [01:00:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:00:06] T180654: Disable Two-factor authentication for user AuburnPilot (enwiki) - https://phabricator.wikimedia.org/T180654 [01:00:19] (03PS2) 10Chad: Clean up branch referencing logic, should fix scap prep on master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391745 [01:01:51] 10Operations, 10Cassandra, 10Services (doing), 10User-Eevans: Aberrant load on instances involved in recent bootstrap - https://phabricator.wikimedia.org/T180568#3765369 (10Eevans) 05Open>03Resolved a:03Eevans These nodes have been stable now for some time so I'll close this ticket to signal that it... [01:03:05] (03CR) 10Chad: [C: 032] Clean up branch referencing logic, should fix scap prep on master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391745 (owner: 10Chad) [01:04:25] (03Merged) 10jenkins-bot: Clean up branch referencing logic, should fix scap prep on master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391745 (owner: 10Chad) [01:05:51] !log demon@tin Synchronized scap/plugins/prep.py: no-op, co-master sync (duration: 00m 55s) [01:05:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:06:22] (03CR) 10jenkins-bot: Clean up branch referencing logic, should fix scap prep on master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391745 (owner: 10Chad) [01:22:40] !log updating phabricator (belatedly) [01:22:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:28:55] !log Phabricator will be offline for a couple of minutes while I apply database migrations. [01:29:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:29:12] ah, cool, I won't report the 502 then! :) [01:29:21] 503* [01:29:40] quiddity: yeah hopefully it doesn't take long, the migration script is going to run a bit longer than usual I think [01:29:41] PROBLEM - https://phabricator.wikimedia.org on phab2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - string focus on bug not found on https://phabricator.wikimedia.org:443https://phabricator.wikimedia.org/ - 2210 bytes in 0.015 second response time [01:30:11] quiddity: Now if that 503 increases to like 504 or 505....? [01:30:12] :p [01:30:17] That's how it works, right? [01:30:33] Phabricator is 503ing for me. Guess you know already. [01:30:52] error inflation! if we can get deflation, we could go down to error #420 [01:31:01] andre__, yup :) [01:31:04] upgrades [01:31:52] Oh it's that time of the week, true. Thanks [01:31:59] * andre__ lost in timezones [01:33:03] twentyafterfour: Ideally we could depool them or something and report a more sane error like "Down for maintenance, back in a jiffy" [01:33:18] I tried to do something like that for Gerrit, but tbh it was more hassle than it was worth considering the length of downtime usually [01:35:04] only twitter has an error 420?? https://developer.twitter.com/en/docs/basics/response-codes -- Hmm, maybe someone will submit something good for the 40th anniversary of https://en.wikipedia.org/wiki/April_Fools%27_Day_Request_for_Comments [01:37:02] Spring Framework too, apparently: https://en.wikipedia.org/wiki/List_of_HTTP_status_codes#Unofficial_codes [01:37:37] Interestingly, Twitter using 420 for 429...l wonder if rate-limit errors via the API should do similar.... [01:38:03] Ah, ThrottledError already does that [01:38:45] But ErrorPageError which works ugly outside of the web ui [01:38:50] (I found that recently in cli stuff) [01:42:41] Nooo not spring framework [01:42:55] Eh, may work ok in the api.php [01:44:06] ACKNOWLEDGEMENT - https://phabricator.wikimedia.org on phab2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - string focus on bug not found on https://phabricator.wikimedia.org:443https://phabricator.wikimedia.org/ - 2210 bytes in 0.010 second response time 20after4 maintenance [01:45:35] man this migration is slow [01:46:19] Upstream needs to get a bigger codebase so that they will have realisticly representative performance...then they might try to optimize this stuff more [01:46:51] are you asking us to fill up their database with lots of feature requests and bug reports? ;) [01:47:13] progress: 433494 rows out of 751477 [01:47:36] legoktm: in this instance, it would require hundreds of repositories with hundreds of commits [01:47:44] er hundreds of thousands of commits I mean [01:48:21] ah [01:48:35] they have a measley 5 repositories or something at phacility [01:49:02] we have 1912 [01:51:10] almost done... 650k out of 750k [01:51:42] twentyafterfour: Tbh, we should look at cleaning some of this up. Only mirror what "matters" [01:51:49] 99 git repositories on the wall, 99 git repositories, take one down, pass it 'round, 98 git repositories on the wall [01:52:06] AndyRussG: No no, you take one down, pass it around, now there's 147 git repos [01:52:14] For each one you take down, you gain at least 48 or so [01:52:23] hehehe good point yes [01:53:13] if only you could fork other things to make more of them so easily [01:53:41] Anyway, I'm out. It's beer o'clock [01:53:56] no_justification: good night :) [01:53:57] enjoy [01:57:21] I don't want to intervene while you're in the middle of this, but just a quick note [01:57:39] that we should figure out a way to communicate similar downtimes in the future better [01:58:09] paravoid: I agree, though it's actually hard to predict the duration of the downtime [01:58:22] upstream gives some indication but their guidance was ~ 1.5 minutes [01:58:43] yeah, even a better error page would go a long way [01:59:15] paravoid: can we depool like no_justification suggested? Would that change the error page? [01:59:16] I was a bit worried and came in here to check :) [02:00:09] I'm actually not sure why it's still running at this point, all of the database rows appear to be there. I guess I can start apache while it finishes garbage collecting or whatever it's doing [02:00:27] I'm not saying this to put pressure on you to finish :) [02:01:07] just mentioning it for the future, we should brainstorm on ways to communicate this better [02:01:59] paravoid: ideally there'd be an easy flag we could set from the service side that varnish could look for and respond nicer [02:02:37] Gerrit, even without varnish, does the same opaque 503s [02:02:59] We built a flag thing in but it requires a puppet change to enact [02:03:11] Which... Kind of takes awhile when Gerrit is what's down! [02:03:28] let's file a phab task for all that [02:03:29] oh wait :P [02:03:49] but yeah, when phab gets back up let's discuss on a task [02:07:54] FINALLY finished that one migration [02:08:00] now just a few more (fast ones) to go [02:09:08] aaaannd done. [02:09:44] !log phabricator database migrations complete, service is back online [02:09:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:09:51] RECOVERY - https://phabricator.wikimedia.org on phab2001 is OK: HTTP OK: HTTP/1.1 200 OK - 34523 bytes in 0.265 second response time [02:11:52] thanks, twentyafterfour :) [02:12:36] quiddity: np, sorry for the interruption [02:13:00] paravoid: you want me to make a task for better error pages / phab downtime notifications? [02:13:38] yeah if you agreed [02:13:40] *agree [02:13:42] sounds like gerrit too? [02:14:37] indeed, it would be great to have an alternative error page (plus some prior warning in the instances when long downtime can be anticipated) [02:15:45] The gerrit implementation varies base on a hiera property called $maint_mode [02:17:44] Which is nice, but the hoop to jump through to change that, merge, apply, do your maintenance, then undo it is kinda a PITA [02:17:57] Especially in gerrit's case, where you can't undo it until the service is reachable again [02:23:17] 10Operations, 10Gerrit, 10Phabricator, 10Traffic, 10periodic-update: Phabricator and Gerrit: Improve the way that maintenance downtime is communicated to users. - https://phabricator.wikimedia.org/T180655#3765427 (10mmodell) [02:23:24] 10Operations, 10Gerrit, 10Phabricator, 10Traffic, 10periodic-update: Phabricator and Gerrit: Improve the way that maintenance downtime is communicated to users. - https://phabricator.wikimedia.org/T180655#3765440 (10mmodell) p:05Triage>03Normal [02:23:26] https://phabricator.wikimedia.org/T180655 [02:24:43] no_justification: yeah it would be nice if we had a way to do it without merging a patch in operations/puppet ... [02:25:04] paravoid: no_justification: T180655 [02:25:05] T180655: Phabricator and Gerrit: Improve the way that maintenance downtime is communicated to users. - https://phabricator.wikimedia.org/T180655 [02:25:25] I wonder if we could do something fancy with a file on disk with apache directives... [02:25:42] if ( /etc/maint-mode ) { responseWithDowntimeMessage(); } [02:25:58] Although those file exist checks might be cached at startup? [02:25:59] idk [02:26:23] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.7) (duration: 07m 54s) [02:26:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:27:14] 10Operations, 10Gerrit, 10Phabricator, 10Traffic, 10periodic-update: Phabricator and Gerrit: Improve the way that maintenance downtime is communicated to users. - https://phabricator.wikimedia.org/T180655#3765442 (10mmodell) [02:28:13] no_justification: yes we could do something like that, I think... but if it's not cached then we incur the cost of a stat() for each request (probably not too bif of a deal but not ideal either) [02:28:32] RewriteCond can do file existence checks [02:28:53] In gerrit's case, you're going to hit the disk on almost any request [02:28:55] Anyway [02:30:25] I'm filing a task about the Wikimedia Error page having a layout error... what #tag would I use - just #operations? (I can't find prior discussions, because "error page" is a hard keyphrase to search for in a tracking tool!) [02:31:09] It depends on which one! [02:31:16] Some are in wmf-config, some are in puppet [02:31:18] :D [02:32:41] /o\ ok, #general-or-unknown it is! https://phabricator.wikimedia.org/T180656 [02:32:49] twentyafterfour: Environment variables are easy, but might require apache restarts. [02:34:12] quiddity: Ah, that one would definitely be varnish [02:34:14] :) [02:34:52] no_justification, #varnish tag was archived... [02:35:17] I meant its served by varnish. Error page itself is in puppet...somewhere [02:35:49] 10Operations, 10wikidiff2, 10Patch-For-Review, 10User-Addshore, and 2 others: Update and use php-wikidiff2 1.5.1 & MovedParagraphDetectionCutoff in production - https://phabricator.wikimedia.org/T177891#3765457 (10Legoktm) In Timeless if you hover over the arrows the dots show up underneath: https://www.me... [02:36:01] oic. I clearly need food. thanks :) [02:39:24] 10Operations, 10Gerrit, 10Phabricator, 10Traffic, 10periodic-update: Phabricator and Gerrit: Improve the way that maintenance downtime is communicated to users. - https://phabricator.wikimedia.org/T180655#3765463 (10demon) Ideas for implementation from IRC: * Environment variables--might require Apache r... [02:41:04] Brain dumped a little [02:41:10] Ok, I'm out for realz this time [02:41:23] (03PS1) 10Tim Starling: In furl use /usr/bin/php instead of php5 [puppet] - 10https://gerrit.wikimedia.org/r/391748 [03:01:21] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0 [03:25:01] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 751.43 seconds [03:34:30] PROBLEM - MariaDB Slave Lag: m3 on db2012 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 318.85 seconds [03:48:51] PROBLEM - MariaDB Slave Lag: m3 on db1059 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.05 seconds [04:01:20] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 210.52 seconds [04:07:20] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [04:37:02] 10Operations, 10Traffic: Change "CP" cookie from subdomain to project level - https://phabricator.wikimedia.org/T180407#3765500 (10Krinkle) >>! In T180407#3761847, @BBlack wrote: > Does RL make use of the CP cookie information to use different module-loading strategies for H/1 vs H/2? I remember that being th... [04:57:05] (03CR) 10Krinkle: [C: 031] In furl use /usr/bin/php instead of php5 [puppet] - 10https://gerrit.wikimedia.org/r/391748 (owner: 10Tim Starling) [05:01:16] (03CR) 10Krinkle: [WIP] Split profile.php from StartProfiler, and create PhpAutoPrepend.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391162 (https://phabricator.wikimedia.org/T180183) (owner: 10Krinkle) [05:01:39] (03PS5) 10Krinkle: Split profile.php from StartProfiler, and create PhpAutoPrepend.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391162 (https://phabricator.wikimedia.org/T180183) [05:08:28] (03PS6) 10Krinkle: Split profile.php from StartProfiler, and create PhpAutoPrepend.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391162 (https://phabricator.wikimedia.org/T180183) [05:10:35] (03PS7) 10Krinkle: Split profile.php from StartProfiler, and create PhpAutoPrepend.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391162 (https://phabricator.wikimedia.org/T180183) [05:19:28] 10Operations, 10Domains, 10Traffic: Domain Hacks - https://phabricator.wikimedia.org/T180657#3765537 (10UpsandDowns1234) [05:19:47] (03CR) 10Tim Starling: [C: 031] "Looks good, can be merged. I suggest self-merging when you are ready to do a quick test in beta, followed by production deployment when te" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391162 (https://phabricator.wikimedia.org/T180183) (owner: 10Krinkle) [05:25:22] 10Operations, 10Domains, 10Traffic: Purchase domains mediawi.ki and media.wiki to use as a url shortener - https://phabricator.wikimedia.org/T180657#3765540 (10Bawolff) [05:27:41] PROBLEM - Check Varnish expiry mailbox lag on cp4026 is CRITICAL: CRITICAL: expiry mailbox lag is 2119037 [05:43:30] RECOVERY - MariaDB Slave Lag: m3 on db1059 is OK: OK slave_sql_lag Replication lag: 9.23 seconds [05:45:17] 10Operations, 10Domains, 10Traffic: Purchase domains mediawi.ki and media.wiki to use as a url shortener - https://phabricator.wikimedia.org/T180657#3765523 (10greg) w.wiki is the current domain that we'll use for short links, per T108649 (and T108557). Do we need more? [05:57:54] (03PS4) 10Marostegui: install_server: Reimage db1101 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/391527 (https://phabricator.wikimedia.org/T178359) [05:58:38] (03CR) 10Marostegui: [C: 032] install_server: Reimage db1101 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/391527 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:01:03] (03PS1) 10Marostegui: db1101.yaml: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/391752 (https://phabricator.wikimedia.org/T178359) [06:02:05] (03PS2) 10Marostegui: db1101.yaml: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/391752 (https://phabricator.wikimedia.org/T178359) [06:02:50] (03CR) 10Marostegui: [C: 032] db1101.yaml: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/391752 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:06:30] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0 [06:06:50] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [06:07:10] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1071 and db1099" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391753 [06:07:14] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1071 and db1099" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391753 [06:10:11] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1071 and db1099" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391753 (owner: 10Marostegui) [06:11:25] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1071 and db1099" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391753 (owner: 10Marostegui) [06:11:35] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1071 and db1099" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391753 (owner: 10Marostegui) [06:14:37] !log smalyshev@tin Started deploy [wdqs/wdqs@b44cf27]: data reload/T176593 [06:14:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:05] !log smalyshev@tin Finished deploy [wdqs/wdqs@b44cf27]: data reload/T176593 (duration: 00m 28s) [06:15:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:30] What's up with wikiversions.json locally modified in tin? [06:19:15] no_justification ^ [06:20:10] marostegui: had to roll back wikidatawiki. Forgot to commit. [06:20:33] Ideally committed but could last til morning long as nobody does a full scap [06:21:57] no_justification: but it is preventing me to rebase other changes and deploy :-( [06:22:41] Just toss it in a local commit? I'm not near my laptop at all [06:22:45] I'm sorry [06:23:02] I rather wait that mess in that repo :-) [06:23:28] Its just wiki versions json right? [06:23:35] wikiversions.json yep [06:23:58] I promise it's safe to commit! [06:24:11] I'll even +2 the change in Gerrit [06:30:51] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1099 and db1071 - T174569 (duration: 00m 49s) [06:30:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:59] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [06:34:23] (03PS1) 10Marostegui: db-eqiad.php: Depool db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391754 (https://phabricator.wikimedia.org/T174569) [06:41:00] RECOVERY - MariaDB Slave Lag: m3 on db2012 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [06:41:05] (03PS1) 10Krinkle: wikidatawiki back to wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391755 [06:42:25] (03CR) 10Krinkle: [C: 032] wikidatawiki back to wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391755 (owner: 10Krinkle) [06:43:34] (03Merged) 10jenkins-bot: wikidatawiki back to wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391755 (owner: 10Krinkle) [06:43:56] no_justification: :) [06:46:20] (03CR) 10jenkins-bot: wikidatawiki back to wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391755 (owner: 10Krinkle) [06:46:57] (03PS2) 10Marostegui: db-eqiad.php: Depool db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391754 (https://phabricator.wikimedia.org/T174569) [06:47:05] Krinkle: thanks! :) [06:49:17] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391754 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [06:50:09] (03CR) 10Smalyshev: [C: 031] wdqs: ensure blazegraph data file has correct ownership [puppet] - 10https://gerrit.wikimedia.org/r/391039 (owner: 10Gehel) [06:50:25] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391754 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [06:50:34] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391754 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [06:51:27] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1096 - T174569 (duration: 00m 48s) [06:51:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:33] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [06:54:57] !log Deploy alter table on db1096 - T174569 [06:55:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:58] Krinkle: ty [07:16:39] (03PS1) 10Marostegui: mariadb: Convert db1101 to multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/391757 (https://phabricator.wikimedia.org/T178359) [07:17:34] !log ppchelko@tin Started deploy [eventlogging/eventbus@872cfb3]: Revert gerrit 302372 due to AssertionError T180017 [07:17:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:41] T180017: Timeouts on event delivery to EventBus - https://phabricator.wikimedia.org/T180017 [07:17:49] !log ppchelko@tin Finished deploy [eventlogging/eventbus@872cfb3]: Revert gerrit 302372 due to AssertionError T180017 (duration: 00m 14s) [07:17:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:27] (03Draft2) 10Jayprakash12345: Enable Single edit tab in Catalan Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391756 [07:19:17] (03PS3) 10Jayprakash12345: Enable Single edit tab in Catalan Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391756 (https://phabricator.wikimedia.org/T180660) [07:20:39] (03CR) 10Marostegui: [C: 032] mariadb: Convert db1101 to multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/391757 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [07:35:41] (03PS1) 10Marostegui: db-eqiad.php: Depool db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391760 (https://phabricator.wikimedia.org/T178359) [07:37:24] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391760 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [07:38:33] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391760 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [07:38:43] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391760 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [07:39:53] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1034 - T178359 (duration: 00m 49s) [07:40:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:01] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [07:40:21] !log Stop MySQL on db1034 to copy its content to db1101.s7 - T178359 [07:40:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:56] (03PS5) 10Gehel: wdqs: ensure blazegraph data file has correct ownership [puppet] - 10https://gerrit.wikimedia.org/r/391039 [07:45:43] (03CR) 10Gehel: [C: 032] wdqs: ensure blazegraph data file has correct ownership [puppet] - 10https://gerrit.wikimedia.org/r/391039 (owner: 10Gehel) [07:46:51] 10Operations, 10Domains, 10Traffic: Purchase domains mediawi.ki and media.wiki to use as a url shortener - https://phabricator.wikimedia.org/T180657#3765669 (10UpsandDowns1234) I like domain hacks, and they make lives a lot easier. (Goo.gl/e) or (google.com)? (Group.me) or (GroupMe.com)? [07:48:08] 10Operations, 10Domains, 10Traffic: Purchase domains mediawi.ki and media.wiki to use as a url shortener - https://phabricator.wikimedia.org/T180657#3765672 (10UpsandDowns1234) And w.wiki url shortening is disabled. Bummer. [07:49:28] 10Operations, 10Domains, 10Traffic: Purchase domains mediawi.ki and media.wiki to use as a url shortener - https://phabricator.wikimedia.org/T180657#3765673 (10UpsandDowns1234) And the idea with url shortening is that the article name will be the thing after the url so that it is easy to enter in. [07:50:38] (03PS1) 10Marostegui: db-eqiad.php: Repool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391762 [07:52:06] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391762 (owner: 10Marostegui) [07:53:15] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391762 (owner: 10Marostegui) [07:53:27] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391762 (owner: 10Marostegui) [07:53:45] (03CR) 10Giuseppe Lavagetto: [C: 031] [WIP] Puppetize Netbox [puppet] - 10https://gerrit.wikimedia.org/r/387880 (https://phabricator.wikimedia.org/T170144) (owner: 10Ayounsi) [07:54:32] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1106 (duration: 00m 49s) [07:54:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:49] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [08:11:10] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 [08:13:08] (03PS1) 10Marostegui: db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391763 (https://phabricator.wikimedia.org/T174569) [08:14:38] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391763 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [08:15:51] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391763 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [08:16:20] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391763 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [08:17:12] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1092 - T174569 (duration: 00m 49s) [08:17:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:19] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [08:18:59] !log Deploy schema change on db1092 - T174569 [08:19:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:15] (03CR) 10Muehlenhoff: [C: 031] adding jdrewniak to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/391732 (https://phabricator.wikimedia.org/T180639) (owner: 10RobH) [09:11:06] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Nice work! Overall this is a nice module/profile/role combo. Here's a first round of comments, most minor" (0321 comments) [puppet] - 10https://gerrit.wikimedia.org/r/387880 (https://phabricator.wikimedia.org/T170144) (owner: 10Ayounsi) [09:18:50] (03CR) 10Alexandros Kosiaris: [C: 031] etherpad,ganglia,tor_relay: use profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/391723 (owner: 10Dzahn) [09:19:28] !log upgrade grafana to 4.6.1 on https://grafana.wikimedia.org/ - T180428 [09:19:29] (03CR) 10Elukey: First commit (0310 comments) [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/389475 (https://phabricator.wikimedia.org/T177459) (owner: 10Elukey) [09:19:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:35] T180428: Upgrade to latest Grafana 4.6 - https://phabricator.wikimedia.org/T180428 [09:21:55] (03CR) 10Alexandros Kosiaris: [C: 031] lists,otrs: use profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/391722 (owner: 10Dzahn) [09:23:04] !log bootstrap restbase2002-c - T179422 [09:23:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:11] T179422: Reshape RESTBase Cassandra clusters - https://phabricator.wikimedia.org/T179422 [09:23:29] RECOVERY - Check systemd state on restbase2002 is OK: OK - running: The system is fully operational [09:23:39] RECOVERY - cassandra-c service on restbase2002 is OK: OK - cassandra-c is active [09:24:39] RECOVERY - cassandra-c SSL 10.192.16.167:7001 on restbase2002 is OK: SSL OK - Certificate restbase2002-c valid until 2018-08-17 16:11:46 +0000 (expires in 274 days) [09:30:00] !log uploaded icu57.1-6+wmf2 to jessie-wikimedia/component/icu57 [09:30:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:07] (03PS18) 10Elukey: First commit [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/389475 (https://phabricator.wikimedia.org/T177459) [09:33:26] (03PS2) 10Addshore: Stop using extension-list-wikidata from Wikidata build for prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391251 (https://phabricator.wikimedia.org/T177060) [09:33:30] (03PS3) 10Addshore: Stop using extension-list-wikidata from Wikidata build for prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391251 (https://phabricator.wikimedia.org/T177060) [09:33:36] 10Operations, 10monitoring, 10Graphite, 10Performance-Team (Radar): Upgrade to latest Grafana 4.6 - https://phabricator.wikimedia.org/T180428#3765801 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi The scope of this task is done, but let's followup on {T175708} on @volans' concerns re: sqlite scalabi... [09:36:28] (03PS1) 10Elukey: role::aqs: restore hyperswitch's consistency to localQuorum [puppet] - 10https://gerrit.wikimedia.org/r/391765 (https://phabricator.wikimedia.org/T164348) [09:38:22] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/8805/aqs1008.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/391765 (https://phabricator.wikimedia.org/T164348) (owner: 10Elukey) [09:38:49] !log rebooting prometheus servers in codfw for update to 4.9.51 [09:38:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:37] (03CR) 10Filippo Giunchedi: "> see how it compiles fine and without differences on bast1001/2001" [puppet] - 10https://gerrit.wikimedia.org/r/353599 (owner: 10Dzahn) [09:42:17] jouncebot: next [09:42:18] In 2 hour(s) and 17 minute(s): Kill the Wikidata build (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171116T1200) [09:42:22] jouncebot: refresh [09:42:25] I refreshed my knowledge about deployments. [09:42:27] jouncebot: next [09:42:27] In 0 hour(s) and 17 minute(s): Kill the Wikidata build (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171116T1000) [09:44:42] !log restart aqs on aqs1004 to apply localQuorum (https://gerrit.wikimedia.org/r/391765) - T164348 [09:44:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:49] T164348: Investigate the use of local_quorum for AQS - https://phabricator.wikimedia.org/T164348 [09:46:56] 10Operations, 10Domains, 10Traffic: Purchase domains mediawi.ki and media.wiki to use as a url shortener - https://phabricator.wikimedia.org/T180657#3765523 (10Dzahn) Please see T88873, T88873#1691739 (and T105829) for a long history of buying .wiki domains (that never ended up being used). [09:53:55] (03CR) 10Filippo Giunchedi: "> > see how it compiles fine and without differences on bast1001/2001" [puppet] - 10https://gerrit.wikimedia.org/r/353599 (owner: 10Dzahn) [09:55:17] (03PS2) 10Dzahn: lists,otrs: use profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/391722 [09:55:35] (03PS1) 10Marostegui: db1101: Moved it from s2 to s5 and s7 [software] - 10https://gerrit.wikimedia.org/r/391766 (https://phabricator.wikimedia.org/T178359) [09:56:04] (03CR) 10Dzahn: [C: 032] lists,otrs: use profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/391722 (owner: 10Dzahn) [09:56:52] (03CR) 10Dzahn: "thanks for confirming and making a ticket:) cool" [puppet] - 10https://gerrit.wikimedia.org/r/353599 (owner: 10Dzahn) [09:59:34] !log rebooting prometheus servers in eqiad for update to 4.9.51 [09:59:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:05] addshore: It is that lovely time of the day again! You are hereby commanded to deploy Kill the Wikidata build. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171116T1000). [10:00:05] No GERRIT patches in the queue for this window AFAICS. [10:00:39] (03CR) 10Marostegui: [C: 032] db1101: Moved it from s2 to s5 and s7 [software] - 10https://gerrit.wikimedia.org/r/391766 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [10:01:52] (03Merged) 10jenkins-bot: db1101: Moved it from s2 to s5 and s7 [software] - 10https://gerrit.wikimedia.org/r/391766 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [10:02:06] o/ [10:02:15] * addshore waits for things to merge [10:02:34] (03PS2) 10Dzahn: etherpad,ganglia,tor_relay: use profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/391723 [10:07:41] (03CR) 10Dzahn: [C: 032] etherpad,ganglia,tor_relay: use profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/391723 (owner: 10Dzahn) [10:09:43] (03PS1) 10Marostegui: db-eqiad.php: Repool db1034 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391774 [10:09:55] (03CR) 10Marostegui: [C: 04-2] "Wait for the lag to be gone" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391774 (owner: 10Marostegui) [10:14:25] 10Operations, 10puppet-compiler: puppet compiler fail compilation on manifests using puppetdb - https://phabricator.wikimedia.org/T180671#3766021 (10Joe) [10:15:49] !log addshore@tin Synchronized php-1.31.0-wmf.8/extensions/WikibaseQualityConstraints: T180634 Bring WikibaseQualityConstraints up to date with .8 branch of Wikidata build extension (duration: 00m 53s) [10:15:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:56] T180634: Bring extensions that were in the Wikidata build up to date with the code currently in the build - https://phabricator.wikimedia.org/T180634 [10:17:02] akosiaris: re: the firewall includes. it was no-op on everything until etherpad.. unexpectedly: /Stage[main]/Sysctl/File[/etc/sysctl.d/70-ferm_conntrack.conf]/ensure: removed .. ehmm... [10:17:17] tries to figure out why [10:18:15] mutante: if you have 5 seconds today could you have a look at why my shell scripts with +x in git in puppet for my home dir landed on the servers with no +x? :( [10:19:01] addshore: sounds like https://gerrit.wikimedia.org/r/#/c/377056/ [10:19:23] oooh, that looks about right! [10:19:55] (03CR) 10Addshore: [C: 031] user homes: Allow git to control +x for $HOME files [puppet] - 10https://gerrit.wikimedia.org/r/377056 (owner: 10BryanDavis) [10:21:04] mutante: looks like some race ? I just ran puppet and: Notice: /Stage[main]/Base::Firewall/Sysctl::Parameters[ferm_conntrack]/Sysctl::Conffile[ferm_conntrack]/File[/etc/sysctl.d/70-ferm_conntrack.conf]/ensure: created [10:21:31] akosiaris: ooh.. so maybe it has nothing to do with my change then [10:21:59] yeah and probably nothing to do with etherpad either [10:22:09] right.. hmm [10:22:48] it's a noop now though [10:23:06] I 've ran it like 10 times in a row and all have been noops [10:23:23] there is still /etc/ferm/conntrack-sysctl.conf [10:23:40] ok, thanks for checking [10:24:45] !log addshore@tin Synchronized php-1.31.0-wmf.8/extensions/Wikibase: T180634 Bring Wikibase up to date with .8 branch of Wikidata build extension (duration: 01m 46s) [10:24:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:51] T180634: Bring extensions that were in the Wikidata build up to date with the code currently in the build - https://phabricator.wikimedia.org/T180634 [10:26:49] !log Kill the build deploy slot done! [10:26:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:40] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1034 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391774 (owner: 10Marostegui) [10:29:22] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1034 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391774 (owner: 10Marostegui) [10:29:35] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1034 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391774 (owner: 10Marostegui) [10:30:36] (03PS19) 10Elukey: First commit [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/389475 (https://phabricator.wikimedia.org/T177459) [10:30:47] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1034 with low weight - T178359 (duration: 00m 48s) [10:30:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:56] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [10:32:23] (03CR) 10Elukey: "Removed the debian/ dir, it will be added to a separate debian branch." [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/389475 (https://phabricator.wikimedia.org/T177459) (owner: 10Elukey) [10:32:43] (03CR) 10Alexandros Kosiaris: [C: 04-1] user homes: Allow git to control +x for $HOME files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/377056 (owner: 10BryanDavis) [10:35:13] (03PS1) 10Jcrespo: wikireplicas: Point all labsdb replica hosts to labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/391783 (https://phabricator.wikimedia.org/T179244) [10:35:52] (03PS2) 10Dzahn: piwik: move firewall to role, include vs instantiate [puppet] - 10https://gerrit.wikimedia.org/r/391744 [10:36:21] (03CR) 10Dzahn: [C: 032] piwik: move firewall to role, include vs instantiate [puppet] - 10https://gerrit.wikimedia.org/r/391744 (owner: 10Dzahn) [10:36:27] (03CR) 10Marostegui: [C: 031] wikireplicas: Point all labsdb replica hosts to labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/391783 (https://phabricator.wikimedia.org/T179244) (owner: 10Jcrespo) [10:36:51] thanks mutante ! [10:37:51] RECOVERY - Check Varnish expiry mailbox lag on cp4026 is OK: OK: expiry mailbox lag is 0 [10:38:24] elukey: :) no-op on bohrium [10:40:17] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391784 [10:42:01] (03PS2) 10Jcrespo: wikireplicas: Point all labsdb replica hosts to labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/391783 (https://phabricator.wikimedia.org/T179244) [10:42:42] (03CR) 10Jcrespo: [C: 032] wikireplicas: Point all labsdb replica hosts to labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/391783 (https://phabricator.wikimedia.org/T179244) (owner: 10Jcrespo) [10:45:39] 10Operations, 10Continuous-Integration-Infrastructure (shipyard), 10Patch-For-Review, 10User-Joe: Unify production and CI docker image build process - https://phabricator.wikimedia.org/T177276#3766088 (10Joe) Thanks @thcipriani for converting the use of ops/puppet already! [10:49:14] (03PS1) 10Lucas Werkmeister (WMDE): Add rudimentary WBQC configuration on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391785 [10:49:38] (03PS3) 10DCausse: [cirrus] Add overridden iw prefix for svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389986 (https://phabricator.wikimedia.org/T177913) [10:50:39] (03CR) 10Gehel: [C: 031] "LGTM for wdqs and elasticsearch" [puppet] - 10https://gerrit.wikimedia.org/r/391236 (https://phabricator.wikimedia.org/T170353) (owner: 10Volans) [10:51:27] (03PS2) 10Addshore: Add rudimentary WBQC configuration on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391785 (https://phabricator.wikimedia.org/T180665) (owner: 10Lucas Werkmeister (WMDE)) [10:53:25] (03PS1) 10Marostegui: tools.my.cnf: Remove filters for s51290 [puppet] - 10https://gerrit.wikimedia.org/r/391789 (https://phabricator.wikimedia.org/T180560) [10:54:11] (03PS2) 10Marostegui: tools.my.cnf: Remove filters for s51290 [puppet] - 10https://gerrit.wikimedia.org/r/391789 (https://phabricator.wikimedia.org/T180560) [10:55:02] (03CR) 10Marostegui: [C: 032] tools.my.cnf: Remove filters for s51290 [puppet] - 10https://gerrit.wikimedia.org/r/391789 (https://phabricator.wikimedia.org/T180560) (owner: 10Marostegui) [10:58:20] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391784 (owner: 10Marostegui) [10:58:24] 10Operations, 10Patch-For-Review: Revisit Pybal depool thresholds for app servers - https://phabricator.wikimedia.org/T178799#3766127 (10Joe) To summarize the historical reasons of those values: - Appservers used to be super-overloaded at times, so we had a very steep depool threshold to avoid thundering herd... [10:58:46] (03CR) 10Addshore: [C: 032] Add rudimentary WBQC configuration on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391785 (https://phabricator.wikimedia.org/T180665) (owner: 10Lucas Werkmeister (WMDE)) [11:01:17] !log shutting down labsdb1010 to clone to labsdb1009 [11:01:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:29] addshore: sorry, didn't realise you had a deployment slot, my change will be merged, but I can wait to deploy [11:02:40] marostegui: thats fine! [11:03:00] I'll sync mine and leave you to sync yours :) [11:03:04] can I deploy then? it shouldn't take more than a minute :) [11:03:07] yup! [11:03:10] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391784 (owner: 10Marostegui) [11:03:11] awesome! [11:03:15] doing it [11:03:23] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391784 (owner: 10Marostegui) [11:03:28] (03Merged) 10jenkins-bot: Add rudimentary WBQC configuration on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391785 (https://phabricator.wikimedia.org/T180665) (owner: 10Lucas Werkmeister (WMDE)) [11:03:36] (03CR) 10jenkins-bot: Add rudimentary WBQC configuration on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391785 (https://phabricator.wikimedia.org/T180665) (owner: 10Lucas Werkmeister (WMDE)) [11:04:17] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase db1034 weight - T178359 (duration: 00m 48s) [11:04:18] addshore: all yours - thanks again! [11:04:23] thanks! [11:04:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:24] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [11:09:42] !log rebooting debug proxies (hassium/hassaleh) for update to 4.9.51 [11:09:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:30] !log upgrade prometheus to 1.8.1+ds+k8s-1 in ulsfo/esams/eqiad - T177395 [11:21:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:37] T177395: Improve monitoring of the Kubernetes clusters - https://phabricator.wikimedia.org/T177395 [11:24:57] (03PS1) 10Lucas Werkmeister (WMDE): Fix WBQC configuration for testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391793 (https://phabricator.wikimedia.org/T180665) [11:25:47] (03CR) 10Addshore: [C: 032] Fix WBQC configuration for testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391793 (https://phabricator.wikimedia.org/T180665) (owner: 10Lucas Werkmeister (WMDE)) [11:26:56] (03Merged) 10jenkins-bot: Fix WBQC configuration for testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391793 (https://phabricator.wikimedia.org/T180665) (owner: 10Lucas Werkmeister (WMDE)) [11:27:05] (03CR) 10jenkins-bot: Fix WBQC configuration for testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391793 (https://phabricator.wikimedia.org/T180665) (owner: 10Lucas Werkmeister (WMDE)) [11:30:45] (03CR) 10Alexandros Kosiaris: [C: 04-1] user homes: Allow git to control +x for $HOME files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/377056 (owner: 10BryanDavis) [11:32:23] !log addshore@terbium:~$ mwscript extensions/WikibaseQualityConstraints/maintenance/ImportConstraintStatements.php --wiki=testwikidatawiki > importConstraintStatements.log [11:32:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:22] !log addshore@tin Synchronized wmf-config/Wikibase-production.php: testwikidata only, T180665, WBQC configuration for testwikidatawiki (duration: 00m 49s) [11:37:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:30] T180665: Warning: preg_match() expects parameter 2 to be string, array given in /srv/mediawiki/php-1.31.0-wmf.8/extensions/Wikidata/extensions/Constraints/includes/ConstraintCheck/Helper/SparqlHelper.php on line 462 - https://phabricator.wikimedia.org/T180665 [11:40:06] (03PS1) 10Giuseppe Lavagetto: lvs::configuration: standardize depool thresholds for mw servers [puppet] - 10https://gerrit.wikimedia.org/r/391797 (https://phabricator.wikimedia.org/T178799) [11:42:26] !log installing openssl updates on conf* clusters [11:42:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:13] (03PS1) 10Elukey: [WIP] profile::redids::jobqueue: stagger redis slave restarts [puppet] - 10https://gerrit.wikimedia.org/r/391798 [11:54:24] this is horrible and I am only seeing pcc outputs :) [11:54:55] PROBLEM - DPKG on conf1004 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:55:55] RECOVERY - DPKG on conf1004 is OK: All packages OK [11:56:17] (03PS2) 10Filippo Giunchedi: prometheus: add redis jobs [puppet] - 10https://gerrit.wikimedia.org/r/391024 (https://phabricator.wikimedia.org/T148637) [11:57:55] PROBLEM - DPKG on conf1005 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:58:04] !log addshore@tin Synchronized php-1.31.0-wmf.8/extensions/WikibaseQualityConstraints: T180665 [[gerrit:391787|Fix SparqlHelper::getCacheMaxAge()]] (duration: 00m 53s) [11:58:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:12] T180665: Warning: preg_match() expects parameter 2 to be string, array given in /srv/mediawiki/php-1.31.0-wmf.8/extensions/Wikidata/extensions/Constraints/includes/ConstraintCheck/Helper/SparqlHelper.php on line 462 - https://phabricator.wikimedia.org/T180665 [11:58:55] RECOVERY - DPKG on conf1005 is OK: All packages OK [12:00:21] (03PS2) 10Elukey: [WIP] profile::redis::jobqueue: stagger redis slave restarts [puppet] - 10https://gerrit.wikimedia.org/r/391798 [12:00:52] (03CR) 10jerkins-bot: [V: 04-1] [WIP] profile::redis::jobqueue: stagger redis slave restarts [puppet] - 10https://gerrit.wikimedia.org/r/391798 (owner: 10Elukey) [12:01:16] RECOVERY - cassandra-c CQL 10.192.16.167:9042 on restbase2002 is OK: TCP OK - 0.036 second response time on 10.192.16.167 port 9042 [12:01:58] (03CR) 10Filippo Giunchedi: [C: 032] "Currently untestable with pcc https://puppet-compiler.wmflabs.org/compiler03/8809/" [puppet] - 10https://gerrit.wikimedia.org/r/391024 (https://phabricator.wikimedia.org/T148637) (owner: 10Filippo Giunchedi) [12:03:01] mobrovac: ^ 2002-c just finished fyi [12:03:01] !log addshore@tin Synchronized php-1.31.0-wmf.8/extensions/Wikidata/extensions/Constraints: T180665 [[gerrit:391790|Manually applied: Fix SparqlHelper::getCacheMaxAge()]] (duration: 00m 52s) [12:03:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:56] (03PS3) 10Elukey: [WIP] profile::redis::jobqueue: stagger redis slave restarts [puppet] - 10https://gerrit.wikimedia.org/r/391798 [12:05:13] godog: yup, just seen it! yay :) [12:07:13] (03Abandoned) 10Hoo man: Declare requirements in mediawiki::maintenance::wikidata [puppet] - 10https://gerrit.wikimedia.org/r/386662 (owner: 10Hoo man) [12:09:25] PROBLEM - puppet last run on rdb1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:09:26] (03PS1) 10Filippo Giunchedi: profile: fix prometheus::redis_exporter invocation [puppet] - 10https://gerrit.wikimedia.org/r/391802 [12:09:42] rdb puppet failures is me, fixing with ^ [12:09:42] !log rebooting dbmonitor* hosts for update to 4.9.51 [12:09:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:17] (03CR) 10Filippo Giunchedi: [C: 032] profile: fix prometheus::redis_exporter invocation [puppet] - 10https://gerrit.wikimedia.org/r/391802 (owner: 10Filippo Giunchedi) [12:10:55] PROBLEM - puppet last run on maps1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:11:42] ditto maps, should be recovering [12:11:56] PROBLEM - puppet last run on rdb1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:12:05] PROBLEM - puppet last run on rdb2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:14:25] RECOVERY - puppet last run on rdb1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:15:55] RECOVERY - puppet last run on maps1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:16:55] PROBLEM - puppet last run on maps-test2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:16:56] RECOVERY - puppet last run on rdb1004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:17:05] RECOVERY - puppet last run on rdb2002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:17:12] (03CR) 10Hoo man: [C: 032] Work around HHVM bug by using XMLWriter::writeAttribute() [dumps/dcat] - 10https://gerrit.wikimedia.org/r/391489 (https://phabricator.wikimedia.org/T117534) (owner: 10Tim Starling) [12:21:38] (03CR) 10Hoo man: [V: 032 C: 032] Work around HHVM bug by using XMLWriter::writeAttribute() [dumps/dcat] - 10https://gerrit.wikimedia.org/r/391489 (https://phabricator.wikimedia.org/T117534) (owner: 10Tim Starling) [12:23:04] (03CR) 10Hoo man: [V: 032 C: 032] "Confirmed to produce the exact same output (compared to the deployed version) on both php 5.5 and hhvm 3.12.7 when run with production dat" [dumps/dcat] - 10https://gerrit.wikimedia.org/r/391489 (https://phabricator.wikimedia.org/T117534) (owner: 10Tim Starling) [12:25:57] (03PS1) 10Alexandros Kosiaris: Add k8s::kubeconfig define [puppet] - 10https://gerrit.wikimedia.org/r/391804 (https://phabricator.wikimedia.org/T177393) [12:26:01] (03PS1) 10Alexandros Kosiaris: Add parameter for kubelet's kubeconfig [puppet] - 10https://gerrit.wikimedia.org/r/391805 (https://phabricator.wikimedia.org/T177393) [12:26:03] (03PS1) 10Alexandros Kosiaris: Add kubeconfig parameter to k8s::proxy [puppet] - 10https://gerrit.wikimedia.org/r/391806 (https://phabricator.wikimedia.org/T177393) [12:26:05] (03PS1) 10Alexandros Kosiaris: Remove unused cluster_dns_ip kubelet parameter [puppet] - 10https://gerrit.wikimedia.org/r/391807 [12:27:54] !log Updated operations/dumps/dcat (ea4e75..7734e04) on snapshot1007 [12:27:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:14] (03CR) 10Hoo man: "Change has been deployed" [dumps/dcat] - 10https://gerrit.wikimedia.org/r/391489 (https://phabricator.wikimedia.org/T117534) (owner: 10Tim Starling) [12:30:57] (03PS1) 10Hoo man: Snapshot: Use canonical php (= hhvm) for dcat [puppet] - 10https://gerrit.wikimedia.org/r/391808 (https://phabricator.wikimedia.org/T117534) [12:31:03] (03CR) 10MarcoAurelio: "You should schedule this for SWAT if you want this merged." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370791 (https://phabricator.wikimedia.org/T101983) (owner: 10TerraCodes) [12:31:18] (03CR) 10jerkins-bot: [V: 04-1] Snapshot: Use canonical php (= hhvm) for dcat [puppet] - 10https://gerrit.wikimedia.org/r/391808 (https://phabricator.wikimedia.org/T117534) (owner: 10Hoo man) [12:32:12] (03PS2) 10Hoo man: Snapshot: Use canonical php (= hhvm) for dcat [puppet] - 10https://gerrit.wikimedia.org/r/391808 (https://phabricator.wikimedia.org/T117534) [12:33:17] 10Operations, 10Dumps-Generation, 10HHVM, 10Patch-For-Review: Convert snapshot hosts to use HHVM and trusty - https://phabricator.wikimedia.org/T94277#3766346 (10hoo) [12:34:41] !log installing openssl updates on memcached/redis clusters [12:34:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:55] RECOVERY - puppet last run on maps-test2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:51:23] 10Operations, 10wikidiff2, 10Patch-For-Review, 10User-Addshore, and 2 others: Update and use php-wikidiff2 1.5.1 & MovedParagraphDetectionCutoff in production - https://phabricator.wikimedia.org/T177891#3674094 (10matmarex) >>! In T177891#3765457, @Legoktm wrote: > Also the RSS feeds just show the dots sin... [12:54:30] !log rebooting radium (tor relay) for update to 4.9.51 [12:54:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:07] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391810 [12:59:25] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391810 (owner: 10Marostegui) [13:00:36] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391810 (owner: 10Marostegui) [13:00:45] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391810 (owner: 10Marostegui) [13:01:58] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase db1034 weight - T178359 (duration: 00m 49s) [13:02:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:05] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [13:07:12] 10Operations, 10wikidiff2, 10Patch-For-Review, 10User-Addshore, and 2 others: Update and use php-wikidiff2 1.5.1 & MovedParagraphDetectionCutoff in production - https://phabricator.wikimedia.org/T177891#3766467 (10WMDE-Fisch) > > Also the RSS feeds just show the dots since no styles are applied: https://ww... [13:07:45] !log restart aqs on aqs100[5-9] to apply localQuorum (https://gerrit.wikimedia.org/r/391765) - T164348 [13:07:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:51] T164348: Investigate the use of local_quorum for AQS - https://phabricator.wikimedia.org/T164348 [13:12:19] !log installing openssl updates on pc* hosts [13:12:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:19] (03CR) 10ArielGlenn: [C: 032] Snapshot: Use canonical php (= hhvm) for dcat [puppet] - 10https://gerrit.wikimedia.org/r/391808 (https://phabricator.wikimedia.org/T117534) (owner: 10Hoo man) [13:23:58] (03PS4) 10Elukey: [WIP] profile::redis::jobqueue: stagger redis slave restarts [puppet] - 10https://gerrit.wikimedia.org/r/391798 [13:24:24] (03CR) 10jerkins-bot: [V: 04-1] [WIP] profile::redis::jobqueue: stagger redis slave restarts [puppet] - 10https://gerrit.wikimedia.org/r/391798 (owner: 10Elukey) [13:24:32] !log rebooting dubnium/pollux (openldap corp mirror) for update to 4.9.51 [13:24:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:28] (03PS5) 10Elukey: [WIP] profile::redis::jobqueue: stagger redis slave restarts [puppet] - 10https://gerrit.wikimedia.org/r/391798 [13:33:04] !log installing openssl updates on es* hosts [13:33:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:44] (03CR) 10Elukey: [C: 031] "One note about a eventlogging graph but the analytics part LGTM! Thanks!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/391236 (https://phabricator.wikimedia.org/T170353) (owner: 10Volans) [13:46:56] !log rebooting bohrium (piwik host) for update to 4.9.51 [13:47:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:05] (03PS1) 10Muehlenhoff: Extend Cumin alias with analytics_cluster::coordinator [puppet] - 10https://gerrit.wikimedia.org/r/391822 [13:59:05] (03PS2) 10Muehlenhoff: Extend Cumin alias with analytics_cluster::coordinator [puppet] - 10https://gerrit.wikimedia.org/r/391822 [14:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: (Dis)respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171116T1400). Please do the needful. [14:00:04] dcausse: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:13] o/ [14:00:55] (03CR) 10Muehlenhoff: [C: 032] Extend Cumin alias with analytics_cluster::coordinator [puppet] - 10https://gerrit.wikimedia.org/r/391822 (owner: 10Muehlenhoff) [14:02:54] !log starting upgrade of elasticsearch eqiad - T178411 [14:03:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:01] T178411: Upgrade cirrus elasticsearch clusters to 5.5.x - https://phabricator.wikimedia.org/T178411 [14:03:43] I can SWAT [14:03:59] zeljkof: no objections? ^ [14:07:46] (03PS4) 10Rush: openstack2: no Icinga paging (SMS) if on labtest [puppet] - 10https://gerrit.wikimedia.org/r/384892 (https://phabricator.wikimedia.org/T178008) (owner: 10Dzahn) [14:08:03] ok starting to swat my patch unless someone objects [14:08:36] (03CR) 10DCausse: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389986 (https://phabricator.wikimedia.org/T177913) (owner: 10DCausse) [14:09:48] (03Merged) 10jenkins-bot: [cirrus] Add overridden iw prefix for svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389986 (https://phabricator.wikimedia.org/T177913) (owner: 10DCausse) [14:12:52] (03CR) 10jenkins-bot: [cirrus] Add overridden iw prefix for svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389986 (https://phabricator.wikimedia.org/T177913) (owner: 10DCausse) [14:16:44] !log dcausse@tin Synchronized wmf-config/InitialiseSettings.php: T177913: [cirrus] Add overridden iw prefix for svwiki (1/2) (duration: 00m 50s) [14:16:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:51] T177913: Cannot click svwikisource search results when running a crosswiki search on svwiki - https://phabricator.wikimedia.org/T177913 [14:17:24] (03PS1) 10Alexandros Kosiaris: k8s_infrastructure_users: Set kubelet instead of node [labs/private] - 10https://gerrit.wikimedia.org/r/391823 [14:18:29] (03CR) 10Rush: [C: 032] openstack2: no Icinga paging (SMS) if on labtest [puppet] - 10https://gerrit.wikimedia.org/r/384892 (https://phabricator.wikimedia.org/T178008) (owner: 10Dzahn) [14:18:54] !log dcausse@tin Synchronized wmf-config/CirrusSearch-common.php: T177913: [cirrus] Add overridden iw prefix for svwiki (2/2) (duration: 00m 48s) [14:19:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:18] (03CR) 10Rush: "ran this through puppet compiler and went ahead, I didn't think you would mind Daniel. Seems fine." [puppet] - 10https://gerrit.wikimedia.org/r/384892 (https://phabricator.wikimedia.org/T178008) (owner: 10Dzahn) [14:20:23] (03PS2) 10Herron: puppet: point codfw mediawiki canary appservers at puppetmaster2001 [puppet] - 10https://gerrit.wikimedia.org/r/391646 (https://phabricator.wikimedia.org/T177254) [14:23:05] (03CR) 10Giuseppe Lavagetto: [C: 031] puppet: point codfw mediawiki canary appservers at puppetmaster2001 [puppet] - 10https://gerrit.wikimedia.org/r/391646 (https://phabricator.wikimedia.org/T177254) (owner: 10Herron) [14:23:11] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] k8s_infrastructure_users: Set kubelet instead of node [labs/private] - 10https://gerrit.wikimedia.org/r/391823 (owner: 10Alexandros Kosiaris) [14:23:38] !log rebooting kubetcd* to 4.9.51 [14:23:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:37] (03PS1) 10Rush: labstore: rsync server on misc (dumps hosting) [puppet] - 10https://gerrit.wikimedia.org/r/391824 (https://phabricator.wikimedia.org/T165136) [14:26:52] (03PS2) 10Rush: labstore: rsync server on misc (dumps hosting) [puppet] - 10https://gerrit.wikimedia.org/r/391824 (https://phabricator.wikimedia.org/T165136) [14:26:55] (03CR) 10jerkins-bot: [V: 04-1] labstore: rsync server on misc (dumps hosting) [puppet] - 10https://gerrit.wikimedia.org/r/391824 (https://phabricator.wikimedia.org/T165136) (owner: 10Rush) [14:26:59] !log EU swat done [14:27:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:39] (03CR) 10Lokal Profil: [C: 032] Add PHP linter [dumps/dcat] - 10https://gerrit.wikimedia.org/r/390999 (https://phabricator.wikimedia.org/T180328) (owner: 10Hashar) [14:28:03] (03CR) 10Lokal Profil: [C: 032] Add linters for i18n json files [dumps/dcat] - 10https://gerrit.wikimedia.org/r/390994 (https://phabricator.wikimedia.org/T180328) (owner: 10Hashar) [14:28:05] (03PS3) 10Rush: labstore: rsync server on misc (dumps hosting) [puppet] - 10https://gerrit.wikimedia.org/r/391824 (https://phabricator.wikimedia.org/T165136) [14:29:53] (03CR) 10Rush: "self note: need to remember to add rules for drbd things before this lands on labstore1004/5" [puppet] - 10https://gerrit.wikimedia.org/r/353508 (https://phabricator.wikimedia.org/T165136) (owner: 10Muehlenhoff) [14:33:01] (03PS2) 10Dzahn: icinga: reorder includes, use profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/391739 [14:33:36] (03CR) 10Dzahn: "i don't mind at all, thanks for merging :)" [puppet] - 10https://gerrit.wikimedia.org/r/384892 (https://phabricator.wikimedia.org/T178008) (owner: 10Dzahn) [14:33:49] (03CR) 10Rush: "I'm confusing with https://phabricator.wikimedia.org/T167114#3673749" [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/357369 (https://phabricator.wikimedia.org/T167114) (owner: 10Ladsgroup) [14:34:16] (03PS3) 10Rush: rabbitmq: add a giant default config [puppet] - 10https://gerrit.wikimedia.org/r/375822 (https://phabricator.wikimedia.org/T170492) (owner: 10Andrew Bogott) [14:34:56] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391826 [14:35:02] (03CR) 10Rush: "I /think/ with all of the changes in procurement this is an unused code path now. I'll ask robh" [puppet] - 10https://gerrit.wikimedia.org/r/357354 (https://phabricator.wikimedia.org/T159043) (owner: 1020after4) [14:37:05] (03PS3) 10Dzahn: toollabs/icinga: no paging if on labtest [puppet] - 10https://gerrit.wikimedia.org/r/384893 (https://phabricator.wikimedia.org/T178008) [14:37:54] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391826 (owner: 10Marostegui) [14:38:43] (03CR) 10Dzahn: [C: 032] "ok, cool, i'll merge this one too per "doesn't mean it's not a good idea to do" :)" [puppet] - 10https://gerrit.wikimedia.org/r/384893 (https://phabricator.wikimedia.org/T178008) (owner: 10Dzahn) [14:39:07] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391826 (owner: 10Marostegui) [14:39:16] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391826 (owner: 10Marostegui) [14:39:47] (03CR) 10Rush: "I think https://grafana.wikimedia.org/dashboard/db/labs-monitoring?refresh=5m&orgId=1&var-LabstoreServers=labstore1003 is an orphaned dash" [puppet] - 10https://gerrit.wikimedia.org/r/391236 (https://phabricator.wikimedia.org/T170353) (owner: 10Volans) [14:40:48] (03PS3) 10Dzahn: icinga: reorder includes, use profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/391739 [14:40:57] (03CR) 10Muehlenhoff: labstore: rsync server on misc (dumps hosting) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/391824 (https://phabricator.wikimedia.org/T165136) (owner: 10Rush) [14:41:16] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase db1034 weight - T178359 (duration: 00m 49s) [14:41:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:24] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [14:42:33] 10Operations, 10Phabricator, 10Traffic, 10Zero: Missing IP addresses for Maroc Telecom - https://phabricator.wikimedia.org/T174342#3766633 (10Aklapper) 105.66.130.22 is another Moroccan mobile (?) IP that could register on Phab. Might welcome investigation too. [14:43:40] (03PS1) 10Elukey: profile::mariadb::misc::eventlogging:replication: add EL sanitization cron [puppet] - 10https://gerrit.wikimedia.org/r/391828 (https://phabricator.wikimedia.org/T156933) [14:44:05] (03CR) 10jerkins-bot: [V: 04-1] profile::mariadb::misc::eventlogging:replication: add EL sanitization cron [puppet] - 10https://gerrit.wikimedia.org/r/391828 (https://phabricator.wikimedia.org/T156933) (owner: 10Elukey) [14:45:27] uff [14:45:34] (03PS1) 10Alexandros Kosiaris: kubernetes::master: add interface::add_ip6_mapped [puppet] - 10https://gerrit.wikimedia.org/r/391829 [14:45:38] (03PS2) 10Elukey: profile::mariadb::misc::eventlogging:replication: add EL sanitization cron [puppet] - 10https://gerrit.wikimedia.org/r/391828 (https://phabricator.wikimedia.org/T156933) [14:46:14] (03CR) 10jerkins-bot: [V: 04-1] kubernetes::master: add interface::add_ip6_mapped [puppet] - 10https://gerrit.wikimedia.org/r/391829 (owner: 10Alexandros Kosiaris) [14:48:04] (03CR) 10Dzahn: [C: 032] "no-op, just cosmetic http://puppet-compiler.wmflabs.org/8815/" [puppet] - 10https://gerrit.wikimedia.org/r/391739 (owner: 10Dzahn) [14:48:25] 10Operations, 10Gerrit, 10Readers-Web-Backlog, 10Patch-For-Review, and 3 others: [spike] Temporarily allow pushing large objects - https://phabricator.wikimedia.org/T178189#3766646 (10akosiaris) > If we're all agreed with this as a way forward, then this task and {T178570} can be resolved. Agreed on my pa... [14:50:00] !log updating puppet compiler's facts (following https://wikitech.wikimedia.org/w/index.php?title=Nova_Resource:Puppet3-diffs#FAQ) [14:50:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:51] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] "I am overriding jenkins on this. This -1 is from the wmf style guide about interface::add_ip6_mapped { 'main': } being used in site.pp, wh" [puppet] - 10https://gerrit.wikimedia.org/r/391829 (owner: 10Alexandros Kosiaris) [14:50:58] (03PS2) 10Alexandros Kosiaris: kubernetes::master: add interface::add_ip6_mapped [puppet] - 10https://gerrit.wikimedia.org/r/391829 [14:51:01] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] kubernetes::master: add interface::add_ip6_mapped [puppet] - 10https://gerrit.wikimedia.org/r/391829 (owner: 10Alexandros Kosiaris) [14:51:18] (03PS1) 10Marostegui: db-eqiad.php: Depool db1051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391830 (https://phabricator.wikimedia.org/T177208) [14:52:31] (03PS3) 10Dzahn: cumin: update aliases for "nonprod" testing roles [puppet] - 10https://gerrit.wikimedia.org/r/391721 [14:53:05] (03CR) 10Dzahn: [C: 032] "checked with traffic team, these count as non-prod" [puppet] - 10https://gerrit.wikimedia.org/r/391721 (owner: 10Dzahn) [14:53:58] (03CR) 10Jcrespo: [C: 031] db-eqiad.php: Depool db1051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391830 (https://phabricator.wikimedia.org/T177208) (owner: 10Marostegui) [14:54:05] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391830 (https://phabricator.wikimedia.org/T177208) (owner: 10Marostegui) [14:55:59] !log rebooting serpens (slapd) for update to 4.9.51 [14:56:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:29] (03PS4) 10Rush: labstore: rsync server on misc (dumps hosting) [puppet] - 10https://gerrit.wikimedia.org/r/391824 (https://phabricator.wikimedia.org/T165136) [14:58:20] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler02/8818/" [puppet] - 10https://gerrit.wikimedia.org/r/391828 (https://phabricator.wikimedia.org/T156933) (owner: 10Elukey) [14:58:28] (03PS5) 10Rush: labstore: rsync server on misc (dumps hosting) [puppet] - 10https://gerrit.wikimedia.org/r/391824 (https://phabricator.wikimedia.org/T165136) [14:58:40] (03PS6) 10Rush: labstore: rsync server on misc (dumps hosting) [puppet] - 10https://gerrit.wikimedia.org/r/391824 (https://phabricator.wikimedia.org/T165136) [15:00:26] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391830 (https://phabricator.wikimedia.org/T177208) (owner: 10Marostegui) [15:00:29] (03PS2) 10Alexandros Kosiaris: Add k8s::kubeconfig define [puppet] - 10https://gerrit.wikimedia.org/r/391804 (https://phabricator.wikimedia.org/T177393) [15:00:31] (03PS2) 10Alexandros Kosiaris: Add parameter for kubelet's kubeconfig [puppet] - 10https://gerrit.wikimedia.org/r/391805 (https://phabricator.wikimedia.org/T177393) [15:00:33] (03PS2) 10Alexandros Kosiaris: Add kubeconfig parameter to k8s::proxy [puppet] - 10https://gerrit.wikimedia.org/r/391806 (https://phabricator.wikimedia.org/T177393) [15:00:35] (03PS2) 10Alexandros Kosiaris: Remove unused cluster_dns_ip kubelet parameter [puppet] - 10https://gerrit.wikimedia.org/r/391807 [15:00:37] (03CR) 10Herron: [C: 032] puppet: point codfw mediawiki canary appservers at puppetmaster2001 [puppet] - 10https://gerrit.wikimedia.org/r/391646 (https://phabricator.wikimedia.org/T177254) (owner: 10Herron) [15:00:41] (03PS3) 10Herron: puppet: point codfw mediawiki canary appservers at puppetmaster2001 [puppet] - 10https://gerrit.wikimedia.org/r/391646 (https://phabricator.wikimedia.org/T177254) [15:00:43] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391830 (https://phabricator.wikimedia.org/T177208) (owner: 10Marostegui) [15:00:43] !log beginning cut over of codfw canary appservers to puppet 4 master puppetmaster2001 [15:00:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:42] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1051 - T177208 (duration: 00m 48s) [15:01:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:49] T177208: Provide dedicated database resources for wikidata - https://phabricator.wikimedia.org/T177208 [15:02:03] (03CR) 10Dzahn: [C: 04-2] misc PHP apps: convert roles to profiles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/391610 (owner: 10Dzahn) [15:02:13] !log milimetric@tin Started deploy [analytics/refinery@4ef15d3]: Mainly deploying the interlanguage navigation dataset [15:02:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:16] (03PS1) 10Marostegui: db-eqiad.php: Increase weight for db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391831 [15:06:11] 10Operations, 10DBA, 10MediaWiki-Configuration, 10Wikidata: Test testwikidatawiki on s8 - https://phabricator.wikimedia.org/T180694#3766687 (10jcrespo) [15:06:58] !log installing postgres security updates on labsdb1006/1007 [15:07:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:24] 10Operations, 10DBA, 10MediaWiki-Configuration, 10Wikidata: Test moving testwikidatawiki database to s8 replica set on Wikimedia - https://phabricator.wikimedia.org/T180694#3766687 (10jcrespo) [15:08:22] (03PS7) 10Dzahn: misc PHP apps: convert roles to profiles [puppet] - 10https://gerrit.wikimedia.org/r/391610 [15:09:06] (03CR) 10jerkins-bot: [V: 04-1] misc PHP apps: convert roles to profiles [puppet] - 10https://gerrit.wikimedia.org/r/391610 (owner: 10Dzahn) [15:09:39] (03PS3) 10Alexandros Kosiaris: Add k8s::kubeconfig define [puppet] - 10https://gerrit.wikimedia.org/r/391804 (https://phabricator.wikimedia.org/T177393) [15:09:41] (03PS3) 10Alexandros Kosiaris: Add parameter for kubelet's kubeconfig [puppet] - 10https://gerrit.wikimedia.org/r/391805 (https://phabricator.wikimedia.org/T177393) [15:09:43] (03PS3) 10Alexandros Kosiaris: Add kubeconfig parameter to k8s::proxy [puppet] - 10https://gerrit.wikimedia.org/r/391806 (https://phabricator.wikimedia.org/T177393) [15:09:45] (03PS3) 10Alexandros Kosiaris: Remove unused cluster_dns_ip kubelet parameter [puppet] - 10https://gerrit.wikimedia.org/r/391807 [15:09:47] (03PS1) 10Alexandros Kosiaris: kubernetes:node Add AAAA record resolving for masters [puppet] - 10https://gerrit.wikimedia.org/r/391833 [15:10:04] (03CR) 10Alexandros Kosiaris: [C: 032] "Noop per https://puppet-compiler.wmflabs.org/compiler03/8816/" [puppet] - 10https://gerrit.wikimedia.org/r/391804 (https://phabricator.wikimedia.org/T177393) (owner: 10Alexandros Kosiaris) [15:10:18] (03CR) 10Alexandros Kosiaris: [C: 032] "Noop per https://puppet-compiler.wmflabs.org/compiler03/8816/" [puppet] - 10https://gerrit.wikimedia.org/r/391805 (https://phabricator.wikimedia.org/T177393) (owner: 10Alexandros Kosiaris) [15:10:20] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase weight for db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391831 (owner: 10Marostegui) [15:10:26] (03CR) 10Alexandros Kosiaris: "Noop per https://puppet-compiler.wmflabs.org/compiler03/8816/" [puppet] - 10https://gerrit.wikimedia.org/r/391806 (https://phabricator.wikimedia.org/T177393) (owner: 10Alexandros Kosiaris) [15:10:31] (03PS1) 10Filippo Giunchedi: prometheus: drop addr/alias redis_exporter labels [puppet] - 10https://gerrit.wikimedia.org/r/391834 (https://phabricator.wikimedia.org/T148637) [15:10:35] PROBLEM - puppet last run on argon is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/share/ca-certificates/wmf_ca_2017_2020.crt] [15:10:35] (03CR) 10Alexandros Kosiaris: [C: 032] "Noop per https://puppet-compiler.wmflabs.org/compiler03/8816/" [puppet] - 10https://gerrit.wikimedia.org/r/391807 (owner: 10Alexandros Kosiaris) [15:10:39] !log upgrade prometheus-redis-exporter to 0.13-1 - T148637 [15:10:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:45] T148637: Port redis statistics to Prometheus - https://phabricator.wikimedia.org/T148637 [15:10:45] (03CR) 10jerkins-bot: [V: 04-1] prometheus: drop addr/alias redis_exporter labels [puppet] - 10https://gerrit.wikimedia.org/r/391834 (https://phabricator.wikimedia.org/T148637) (owner: 10Filippo Giunchedi) [15:10:50] (03CR) 10Alexandros Kosiaris: [C: 032] Add kubeconfig parameter to k8s::proxy [puppet] - 10https://gerrit.wikimedia.org/r/391806 (https://phabricator.wikimedia.org/T177393) (owner: 10Alexandros Kosiaris) [15:11:43] (03PS8) 10Dzahn: misc PHP apps: convert roles to profiles [puppet] - 10https://gerrit.wikimedia.org/r/391610 [15:12:15] (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight for db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391831 (owner: 10Marostegui) [15:12:28] (03CR) 10jenkins-bot: db-eqiad.php: Increase weight for db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391831 (owner: 10Marostegui) [15:12:47] (03PS2) 10Filippo Giunchedi: prometheus: drop addr/alias redis_exporter labels [puppet] - 10https://gerrit.wikimedia.org/r/391834 (https://phabricator.wikimedia.org/T148637) [15:13:18] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase db1034 weight (duration: 00m 48s) [15:13:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:32] (03PS6) 10Elukey: [WIP] profile::redis::jobqueue: stagger redis slave restarts [puppet] - 10https://gerrit.wikimedia.org/r/391798 [15:13:51] (03PS1) 10Jcrespo: Setup s8 replica set on codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391835 (https://phabricator.wikimedia.org/T177208) [15:14:34] (03PS1) 10Marostegui: Revert "Revert "db-codfw.php: Repool db2038"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391836 [15:14:43] (03CR) 10jerkins-bot: [V: 04-1] Revert "Revert "db-codfw.php: Repool db2038"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391836 (owner: 10Marostegui) [15:14:47] (03Abandoned) 10Marostegui: Revert "Revert "db-codfw.php: Repool db2038"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391836 (owner: 10Marostegui) [15:15:34] RECOVERY - puppet last run on argon is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:16:17] (03PS9) 10Dzahn: misc PHP apps: convert roles to profiles [puppet] - 10https://gerrit.wikimedia.org/r/391610 [15:16:20] (03CR) 10jerkins-bot: [V: 04-1] Setup s8 replica set on codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391835 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo) [15:16:22] (03PS1) 10Marostegui: db-codfw.php: Repool db2038 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391837 (https://phabricator.wikimedia.org/T178359) [15:16:42] !log milimetric@tin Finished deploy [analytics/refinery@4ef15d3]: Mainly deploying the interlanguage navigation dataset (duration: 14m 29s) [15:16:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:01] (03CR) 10Jcrespo: [C: 031] db-codfw.php: Repool db2038 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391837 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [15:19:41] (03CR) 10Marostegui: [C: 032] db-codfw.php: Repool db2038 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391837 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [15:20:59] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: drop addr/alias redis_exporter labels [puppet] - 10https://gerrit.wikimedia.org/r/391834 (https://phabricator.wikimedia.org/T148637) (owner: 10Filippo Giunchedi) [15:21:22] (03Merged) 10jenkins-bot: db-codfw.php: Repool db2038 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391837 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [15:21:31] (03CR) 10jenkins-bot: db-codfw.php: Repool db2038 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391837 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [15:21:57] (03CR) 10Muehlenhoff: "Why do we need this to begin with? We don't have any precise hosts for a while now. Adding DBAs for comments." [puppet] - 10https://gerrit.wikimedia.org/r/385119 (owner: 10Dzahn) [15:22:15] /query _joe_ [15:22:18] nope [15:22:24] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2038 - T178359 (duration: 00m 49s) [15:22:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:29] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [15:27:36] (03PS10) 10Dzahn: misc PHP apps: convert roles to profiles [puppet] - 10https://gerrit.wikimedia.org/r/391610 [15:28:00] (03CR) 10Dzahn: "one of the few things using this is quarry" [puppet] - 10https://gerrit.wikimedia.org/r/385119 (owner: 10Dzahn) [15:29:19] (03CR) 10Dzahn: "and it's that often people on cloud VPS instances would like to just get a mysql server installed by applying a role and historically it's" [puppet] - 10https://gerrit.wikimedia.org/r/385119 (owner: 10Dzahn) [15:33:34] !log rebooting seaborgium (slapd) for update to 4.9.51 [15:33:38] (03PS2) 10Jcrespo: Setup s8 replica set on codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391835 (https://phabricator.wikimedia.org/T177208) [15:33:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:50] dcausse: sorry, traveling [15:34:00] Just saw your message [15:34:02] zeljkof: np! [15:35:00] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391839 [15:35:09] (03PS3) 10Jcrespo: mariadb: Setup s8 replica set on codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391835 (https://phabricator.wikimedia.org/T177208) [15:36:44] (03PS11) 10Dzahn: misc PHP apps: convert roles to profiles [puppet] - 10https://gerrit.wikimedia.org/r/391610 [15:39:19] (03CR) 10Marostegui: mariadb: Setup s8 replica set on codfw (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391835 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo) [15:39:41] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391839 (owner: 10Marostegui) [15:39:44] (03PS2) 10Addshore: Add seperate ensure for testwiki in wikidata crons [puppet] - 10https://gerrit.wikimedia.org/r/372525 (https://phabricator.wikimedia.org/T173357) [15:40:15] (03CR) 10jerkins-bot: [V: 04-1] Add seperate ensure for testwiki in wikidata crons [puppet] - 10https://gerrit.wikimedia.org/r/372525 (https://phabricator.wikimedia.org/T173357) (owner: 10Addshore) [15:40:54] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391839 (owner: 10Marostegui) [15:40:57] (03CR) 10Muehlenhoff: "I don't think that's a good idea. precise-wikimedia is suited _only_ for use with precise and anyone using precise is without security sup" [puppet] - 10https://gerrit.wikimedia.org/r/385119 (owner: 10Dzahn) [15:41:04] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391839 (owner: 10Marostegui) [15:41:52] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase db1034 weight (duration: 00m 49s) [15:41:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:06] (03CR) 10Rush: [C: 031] apt: unattended upgrades for wikimedia packages by default [puppet] - 10https://gerrit.wikimedia.org/r/389480 (https://phabricator.wikimedia.org/T177920) (owner: 10Arturo Borrero Gonzalez) [15:43:11] (03PS9) 10Rush: apt: unattended upgrades for wikimedia packages by default [puppet] - 10https://gerrit.wikimedia.org/r/389480 (https://phabricator.wikimedia.org/T177920) (owner: 10Arturo Borrero Gonzalez) [15:43:27] (03PS12) 10Dzahn: misc PHP apps: convert roles to profiles [puppet] - 10https://gerrit.wikimedia.org/r/391610 [15:44:18] (03CR) 10Jcrespo: mariadb: Setup s8 replica set on codfw (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391835 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo) [15:45:56] (03CR) 10Arturo Borrero Gonzalez: [C: 032] apt: unattended upgrades for wikimedia packages by default [puppet] - 10https://gerrit.wikimedia.org/r/389480 (https://phabricator.wikimedia.org/T177920) (owner: 10Arturo Borrero Gonzalez) [15:46:03] why is the hieradata structure "role/common" but "common/profile" and not "profile/common"? was the old one wrong in the first place? [15:47:05] 10Operations, 10ops-eqiad, 10DBA: Rack and setup db1109 and db1110 - https://phabricator.wikimedia.org/T180700#3766852 (10Marostegui) [15:47:41] !log rebooting ores1* for update to 4.9.51 [15:47:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:58] (03PS13) 10Dzahn: misc PHP apps: convert roles to profiles [puppet] - 10https://gerrit.wikimedia.org/r/391610 [15:50:58] (03PS1) 10Marostegui: install_server: Allow install db1109 and db1110 [puppet] - 10https://gerrit.wikimedia.org/r/391845 (https://phabricator.wikimedia.org/T180700) [15:54:36] (03CR) 10Marostegui: mariadb: Setup s8 replica set on codfw (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391835 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo) [15:54:47] (03PS4) 10Jcrespo: mariadb: Setup s8 replica set on codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391835 (https://phabricator.wikimedia.org/T177208) [15:55:39] (03PS5) 10Jcrespo: mariadb: Setup s8 replica set on codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391835 (https://phabricator.wikimedia.org/T177208) [15:57:53] (03CR) 10Jcrespo: "I do not think the mysql package works anymore in stretch. But sure, go ahead." [puppet] - 10https://gerrit.wikimedia.org/r/385119 (owner: 10Dzahn) [15:59:28] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391847 [15:59:39] (03PS1) 10Alexandros Kosiaris: Add AAAA/PTR records for kubernetes masters [dns] - 10https://gerrit.wikimedia.org/r/391848 [15:59:50] (03CR) 10Marostegui: [C: 032] install_server: Allow install db1109 and db1110 [puppet] - 10https://gerrit.wikimedia.org/r/391845 (https://phabricator.wikimedia.org/T180700) (owner: 10Marostegui) [16:00:02] 10Operations, 10Cloud-Services, 10Community-Wikimetrics, 10DBA, and 2 others: Evaluate future of wmf puppet module "mysql" - https://phabricator.wikimedia.org/T165625#3766900 (10jcrespo) [16:00:50] (03Abandoned) 10Herron: puppet: point codfw mediawiki::appservers at puppetmaster2001 [puppet] - 10https://gerrit.wikimedia.org/r/391627 (https://phabricator.wikimedia.org/T177254) (owner: 10Herron) [16:02:11] (03PS1) 10Jcrespo: Remove mysql module from WMF [puppet] - 10https://gerrit.wikimedia.org/r/391849 (https://phabricator.wikimedia.org/T162070) [16:02:43] (03CR) 10Jcrespo: "This is my proposed patch instead: https://gerrit.wikimedia.org/r/391849" [puppet] - 10https://gerrit.wikimedia.org/r/385119 (owner: 10Dzahn) [16:02:59] (03CR) 10Marostegui: mariadb: Setup s8 replica set on codfw (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391835 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo) [16:04:05] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391847 (owner: 10Marostegui) [16:05:16] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391847 (owner: 10Marostegui) [16:05:55] (03CR) 10Jcrespo: mariadb: Setup s8 replica set on codfw (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391835 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo) [16:06:24] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391847 (owner: 10Marostegui) [16:06:28] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase db1034 weight (duration: 00m 48s) [16:06:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:28] (03CR) 10Marostegui: [C: 031] mariadb: Setup s8 replica set on codfw (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391835 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo) [16:08:02] (03PS1) 10Alexandros Kosiaris: Remove cp1056, cp1057 IPv6 PTR records [dns] - 10https://gerrit.wikimedia.org/r/391852 [16:09:21] (03CR) 10Alexandros Kosiaris: "@cmjohnson, adding you in case you remember what Ib9dc1758edf was for and this needs to be amended" [dns] - 10https://gerrit.wikimedia.org/r/391852 (owner: 10Alexandros Kosiaris) [16:10:17] (03CR) 10Alexandros Kosiaris: [C: 032] Add AAAA/PTR records for kubernetes masters [dns] - 10https://gerrit.wikimedia.org/r/391848 (owner: 10Alexandros Kosiaris) [16:10:24] (03CR) 10Addshore: [C: 031] Wikidata dispatcher: Choose a better value for --randomness [puppet] - 10https://gerrit.wikimedia.org/r/387282 (owner: 10Hoo man) [16:10:42] (03CR) 10Addshore: [C: 031] "Running with a low randomness is essentially how I made the queue recover last time." [puppet] - 10https://gerrit.wikimedia.org/r/387282 (owner: 10Hoo man) [16:13:01] (03CR) 10Thiemo Mättig (WMDE): [C: 031] Wikidata dispatcher: Choose a better value for --randomness [puppet] - 10https://gerrit.wikimedia.org/r/387282 (owner: 10Hoo man) [16:13:58] (03CR) 10Dzahn: "> I don't think that's a good idea. precise-wikimedia is suited _only_ for use with precise and anyone using precise is without security s" [puppet] - 10https://gerrit.wikimedia.org/r/385119 (owner: 10Dzahn) [16:14:45] (03CR) 10Dzahn: ">I do not think the mysql package works anymore in stretch" [puppet] - 10https://gerrit.wikimedia.org/r/385119 (owner: 10Dzahn) [16:15:31] (03CR) 10Dzahn: "also: https://phabricator.wikimedia.org/T165625 https://phabricator.wikimedia.org/T162070" [puppet] - 10https://gerrit.wikimedia.org/r/385119 (owner: 10Dzahn) [16:15:59] (03CR) 10Alexandros Kosiaris: mysql: don't install precise sources.list if on stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/385119 (owner: 10Dzahn) [16:16:22] (03PS1) 10Herron: puppet: point codfw mw systems at puppet 4 master puppetmaster2001 [puppet] - 10https://gerrit.wikimedia.org/r/391856 (https://phabricator.wikimedia.org/T177254) [16:17:02] (03CR) 10Dzahn: mysql: don't install precise sources.list if on stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/385119 (owner: 10Dzahn) [16:17:12] (03Abandoned) 10Dzahn: mysql: don't install precise sources.list if on stretch [puppet] - 10https://gerrit.wikimedia.org/r/385119 (owner: 10Dzahn) [16:17:17] (03CR) 10Muehlenhoff: mysql: don't install precise sources.list if on stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/385119 (owner: 10Dzahn) [16:17:49] (03CR) 10Dzahn: "quarry uses this, afaict" [puppet] - 10https://gerrit.wikimedia.org/r/391849 (https://phabricator.wikimedia.org/T162070) (owner: 10Jcrespo) [16:18:02] 10Operations, 10Phabricator, 10Traffic, 10Zero: Missing IP addresses for Maroc Telecom - https://phabricator.wikimedia.org/T174342#3766948 (10Dispenser) 05Open>03Invalid The team squandered a perfect opportunity where a WP0 pirate broke the ISP blackholing, registered an account on mediawiki.org, and f... [16:19:15] 10Operations, 10Phabricator, 10Traffic, 10Zero: Missing IP addresses for Maroc Telecom - https://phabricator.wikimedia.org/T174342#3766950 (10Aklapper) @Dispenser: See T174342#3737108 - I don't know who "the team" and why this task would be invalid. [16:21:50] !log restarting apache on labs puppet masters to pick up openssl updates [16:21:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:14] (03PS6) 10Jcrespo: mariadb: Setup s8 replica set on codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391835 (https://phabricator.wikimedia.org/T177208) [16:24:16] (03PS1) 10Jcrespo: mariadb: Depool db1071, pool db1104 as api [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391857 (https://phabricator.wikimedia.org/T177208) [16:24:32] (03PS14) 10Dzahn: misc PHP apps: convert roles to profiles [puppet] - 10https://gerrit.wikimedia.org/r/391610 [16:24:58] (03CR) 10Marostegui: [C: 031] mariadb: Depool db1071, pool db1104 as api [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391857 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo) [16:25:22] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1071, pool db1104 as api [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391857 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo) [16:25:38] (03CR) 10Cmjohnson: [C: 031] Remove cp1056, cp1057 IPv6 PTR records [dns] - 10https://gerrit.wikimedia.org/r/391852 (owner: 10Alexandros Kosiaris) [16:25:57] (03PS2) 10Jcrespo: mariadb: Depool db1071, pool db1104 as api [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391857 (https://phabricator.wikimedia.org/T177208) [16:27:02] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Knud Winckelmann → KnudW: supervision needed - https://phabricator.wikimedia.org/T180703#3766973 (10alanajjar) [16:27:24] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Knud Winckelmann → KnudW: supervision needed - https://phabricator.wikimedia.org/T180703#3766985 (10alanajjar) [16:27:48] (03CR) 10Ayounsi: [WIP] Puppetize Netbox (0318 comments) [puppet] - 10https://gerrit.wikimedia.org/r/387880 (https://phabricator.wikimedia.org/T170144) (owner: 10Ayounsi) [16:27:58] (03CR) 10jenkins-bot: mariadb: Depool db1071, pool db1104 as api [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391857 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo) [16:28:28] (03PS17) 10Ayounsi: Puppetize Netbox [puppet] - 10https://gerrit.wikimedia.org/r/387880 (https://phabricator.wikimedia.org/T170144) [16:29:22] 10Operations, 10Phabricator, 10Traffic, 10Zero: Missing IP addresses for Maroc Telecom - https://phabricator.wikimedia.org/T174342#3766994 (10Dispenser) @Aklapper The Checkuser information is irrecoverably gone and thus the task can no longer be completed and Invalid. You can change it to decline if you t... [16:29:32] !log jynus@tin Synchronized wmf-config/db-eqiad.php: mariadb: Depool db1071, pool db1104 as api (duration: 00m 49s) [16:29:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:25] :Q! [16:30:28] grr [16:31:41] BTW there is a spike of database errors since 16:20 [16:31:44] (03CR) 10Dzahn: [C: 031] keyholder: Use systemd::tmpfile [puppet] - 10https://gerrit.wikimedia.org/r/386621 (owner: 10Muehlenhoff) [16:32:13] (03PS1) 10Marostegui: db-eqiad.php: Restore db1034 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391860 [16:32:16] (03CR) 10Dzahn: [C: 031] "http://puppet-compiler.wmflabs.org/8825/krypton.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/391610 (owner: 10Dzahn) [16:32:33] User::loadFromDatabase Lock wait timeout exceeded; try restarting transaction [16:33:02] which host? [16:33:12] it is not a single host [16:33:16] (03PS15) 10Dzahn: grafana,racktables,scholarships,iegreview: role to profiles [puppet] - 10https://gerrit.wikimedia.org/r/391610 [16:33:18] it is an api call [16:33:41] SELECT user_id,user_name,user_real_name,user_email,user_touched,user_token,user_email_authenticated,user_email_token,user_email_token_expires,user_registration,user_editcount FROM `user` WHERE user_id = 'xxxxxx' LIMIT 1 FOR UPDATE [16:34:51] it is always db1054 from what I can see, s2 master [16:35:03] yeah, it is a select for update [16:35:28] always and only ptwiki… [16:37:02] it is the same user, so moving on [16:37:26] looks gone [16:37:48] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Restore db1034 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391860 (owner: 10Marostegui) [16:42:01] (03Merged) 10jenkins-bot: db-eqiad.php: Restore db1034 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391860 (owner: 10Marostegui) [16:42:03] (03CR) 10Jcrespo: "That is why the patch is not finished :-) But I do not think the previous one would work on stretch nor it should be used without reimport" [puppet] - 10https://gerrit.wikimedia.org/r/391849 (https://phabricator.wikimedia.org/T162070) (owner: 10Jcrespo) [16:42:16] (03CR) 10jenkins-bot: db-eqiad.php: Restore db1034 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391860 (owner: 10Marostegui) [16:43:12] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Restore db1034 and db1079 original weight (duration: 00m 48s) [16:43:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:25] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler02/8824/" [puppet] - 10https://gerrit.wikimedia.org/r/391856 (https://phabricator.wikimedia.org/T177254) (owner: 10Herron) [16:48:16] !log upgrade hpsa firmware to 6.06 on restbase2006 - T141756 [16:48:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:24] T141756: audit / test / upgrade hp smartarray P840 firmware - https://phabricator.wikimedia.org/T141756 [16:48:46] !log beginning gradual cutover of codfw mw systems to puppet 4 master puppetmaster2001 [16:48:49] !log stop and restart db1071 for upgrade and reconfiguration [16:48:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:52] (03Draft1) 10Paladox: Gerrit: Set auth.gitBasicAuth to true [puppet] - 10https://gerrit.wikimedia.org/r/391865 [16:48:55] (03PS2) 10Paladox: Gerrit: Set auth.gitBasicAuth to true [puppet] - 10https://gerrit.wikimedia.org/r/391865 [16:48:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:59] (03CR) 10Herron: [C: 032] puppet: point codfw mw systems at puppet 4 master puppetmaster2001 [puppet] - 10https://gerrit.wikimedia.org/r/391856 (https://phabricator.wikimedia.org/T177254) (owner: 10Herron) [16:51:13] (03CR) 10Dzahn: [C: 032] "not touching analytics::burrow, just all others. still "wmf-style: total violations delta -6" and no-op on krypton." [puppet] - 10https://gerrit.wikimedia.org/r/391610 (owner: 10Dzahn) [16:51:24] 10Operations, 10ops-eqiad, 10DBA: Decommission db1022 (Was: db1022 broke while changing topology on s6- evaluate if to fix or directly decommission) - https://phabricator.wikimedia.org/T163778#3209308 (10MoritzMuehlenhoff) JFTR: The host was still showing up in puppetdb (e.g. via https://servermon.wikimedia.... [16:51:26] (03PS16) 10Dzahn: grafana,racktables,scholarships,iegreview: role to profiles [puppet] - 10https://gerrit.wikimedia.org/r/391610 [16:51:59] (03PS3) 10Paladox: Gerrit: Set auth.gitBasicAuth to true [puppet] - 10https://gerrit.wikimedia.org/r/391865 [16:52:06] (03CR) 10Alexandros Kosiaris: [C: 032] "thanks!" [dns] - 10https://gerrit.wikimedia.org/r/391852 (owner: 10Alexandros Kosiaris) [16:53:26] (03CR) 10Alexandros Kosiaris: [C: 032] kubernetes:node Add AAAA record resolving for masters [puppet] - 10https://gerrit.wikimedia.org/r/391833 (owner: 10Alexandros Kosiaris) [16:53:31] (03PS2) 10Alexandros Kosiaris: kubernetes:node Add AAAA record resolving for masters [puppet] - 10https://gerrit.wikimedia.org/r/391833 [16:53:33] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] kubernetes:node Add AAAA record resolving for masters [puppet] - 10https://gerrit.wikimedia.org/r/391833 (owner: 10Alexandros Kosiaris) [16:53:54] (03CR) 10Dzahn: "Notice: /Stage[main]/Role::Webserver_misc_apps/System::Role[webserver_misc_apps]/Motd::Script[role-webserver_misc_apps]/File[/etc/update-m" [puppet] - 10https://gerrit.wikimedia.org/r/391610 (owner: 10Dzahn) [16:54:30] !log reimage restbase2006 - T179422 [16:54:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:37] T179422: Reshape RESTBase Cassandra clusters - https://phabricator.wikimedia.org/T179422 [16:54:40] !log demon@tin Started scap: consistency [16:54:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:05] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [16:57:35] PROBLEM - Check systemd state on kubernetes1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:59:05] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1001 is OK: OK ferm input default policy is set [16:59:28] 10Operations, 10DBA, 10MediaWiki-Configuration, 10Wikidata: Test moving testwikidatawiki database to s8 replica set on Wikimedia - https://phabricator.wikimedia.org/T180694#3767076 (10Ladsgroup) I can help [16:59:35] RECOVERY - Check systemd state on kubernetes1001 is OK: OK - running: The system is fully operational [17:00:05] godog, moritzm, and _joe_: How many deployers does it take to do Puppet SWAT(Max 8 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171116T1700). [17:00:05] hoo: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:55] 10Operations, 10Puppet, 10Wikimedia-Apache-configuration, 10Mobile, 10Readers-Web-Backlog (Tracking): On mobile, http://wikipedia.org/wiki/Foo redirects to https://www.m.wikipedia.org/wiki/Foo which does not exist - https://phabricator.wikimedia.org/T154026#3767077 (10Train2104) This is breaking numerous... [17:01:57] (03PS1) 10Filippo Giunchedi: cassandra: reprovision restbase2006 with cassandra 3 [puppet] - 10https://gerrit.wikimedia.org/r/391867 (https://phabricator.wikimedia.org/T179422) [17:02:18] 10Operations, 10Gerrit, 10ORES, 10Scoring-platform-team, and 3 others: Support git-lfs files in gerrit - https://phabricator.wikimedia.org/T171758#3767081 (10Paladox) LFS is now supported and enabled on a testing repo. As i found out today, lfs dosent seem to support digest which is enabled by default for... [17:03:52] (03CR) 10Mobrovac: [C: 031] cassandra: reprovision restbase2006 with cassandra 3 [puppet] - 10https://gerrit.wikimedia.org/r/391867 (https://phabricator.wikimedia.org/T179422) (owner: 10Filippo Giunchedi) [17:05:29] !log Converting 'others mobile' to size-tiered compaction (T179422) [17:05:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:38] T179422: Reshape RESTBase Cassandra clusters - https://phabricator.wikimedia.org/T179422 [17:06:03] (03PS1) 10Jcrespo: Set db1071 to STATEMENT binlog in preparation to be master of s8 [puppet] - 10https://gerrit.wikimedia.org/r/391868 (https://phabricator.wikimedia.org/T177208) [17:06:33] (03CR) 10Filippo Giunchedi: [C: 032] cassandra: reprovision restbase2006 with cassandra 3 [puppet] - 10https://gerrit.wikimedia.org/r/391867 (https://phabricator.wikimedia.org/T179422) (owner: 10Filippo Giunchedi) [17:07:32] (03PS2) 10Jcrespo: mariadb: Set db1071 to STATEMENT binlog in preparation to be a master [puppet] - 10https://gerrit.wikimedia.org/r/391868 (https://phabricator.wikimedia.org/T177208) [17:08:52] (03CR) 10Rush: "Let's talk about this in the next cloud admin meeting" [puppet] - 10https://gerrit.wikimedia.org/r/390431 (https://phabricator.wikimedia.org/T180254) (owner: 10Arturo Borrero Gonzalez) [17:10:02] 10Operations, 10DBA, 10MediaWiki-Configuration, 10Wikidata: Test moving testwikidatawiki database to s8 replica set on Wikimedia - https://phabricator.wikimedia.org/T180694#3767125 (10jcrespo) Thanks, Ladsgroup, for starters we were thinking of preparing the topology changes for s8 on codfw (which if it br... [17:10:51] 10Operations, 10Puppet, 10Wikimedia-Apache-configuration, 10Mobile, 10Readers-Web-Backlog (Tracking): On mobile, http://wikipedia.org/wiki/Foo redirects to https://www.m.wikipedia.org/wiki/Foo which does not exist - https://phabricator.wikimedia.org/T154026#3767127 (10BBlack) Do we have answers about wha... [17:12:47] (03PS3) 10Jcrespo: mariadb: Set db1071 to STATEMENT binlog in preparation to be a master [puppet] - 10https://gerrit.wikimedia.org/r/391868 (https://phabricator.wikimedia.org/T177208) [17:13:23] (03PS3) 10BBlack: DNS: Only send eqsin countries to ulsfo [dns] - 10https://gerrit.wikimedia.org/r/391357 (owner: 10Ayounsi) [17:14:17] (03CR) 10Jcrespo: [C: 032] mariadb: Set db1071 to STATEMENT binlog in preparation to be a master [puppet] - 10https://gerrit.wikimedia.org/r/391868 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo) [17:14:50] (03PS4) 10Paladox: Gerrit: Set auth.gitBasicAuth to true [puppet] - 10https://gerrit.wikimedia.org/r/391865 (https://phabricator.wikimedia.org/T171758) [17:15:04] (03PS5) 10Paladox: Gerrit: Set auth.gitBasicAuth to true [puppet] - 10https://gerrit.wikimedia.org/r/391865 (https://phabricator.wikimedia.org/T171758) [17:15:18] 10Operations, 10Analytics, 10DBA, 10Patch-For-Review, 10User-Elukey: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3767146 (10elukey) I am currently reviewing what tables to drop on db1047 and which ones to copy over to db1108, and this is what I gath... [17:15:48] !log Converting 'commons mobile' to size-tiered compaction (T179422) [17:15:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:54] T179422: Reshape RESTBase Cassandra clusters - https://phabricator.wikimedia.org/T179422 [17:16:02] 10Operations, 10monitoring, 10netops, 10User-fgiunchedi: Backfill librenms data in graphite with historical RRDs - https://phabricator.wikimedia.org/T173698#3767149 (10ayounsi) Looks like the tools from the python-whisper package are good enough to tackle the conversion and backfilling ( https://github.com... [17:16:53] !log reimage restbase2006 - T179422 [17:16:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:09] pupetswat still in-progress? [17:17:28] 10Operations, 10Analytics, 10DBA, 10Patch-For-Review, 10User-Elukey: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3767179 (10jcrespo) ops are ours, we can handle that- just leave things as you found them. test is probably a mistake and probably shoul... [17:18:46] !log Converting 'wikipedia parsoid' to size-tiered compaction (T179422) [17:18:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:58] !log Temporary GeoDNS routing changes (eqsin traffic simulation using ulsfo) - https://gerrit.wikimedia.org/r/#/c/391357/ - expecting ~24h, West Asia latencies will probably increase, spike in cache misses, etc... [17:19:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:52] !log demon@tin Finished scap: consistency (duration: 25m 12s) [17:19:53] bblack: no, completely forgot [17:19:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:58] it's ok [17:20:05] PROBLEM - SSH on db1063 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:20:16] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 5 others: Select candidate jobs for transferring to the new infrastucture - https://phabricator.wikimedia.org/T175210#3767196 (10Pchelolo) Out of the IRC discussion we've got 3 candidates for the next migration: - `wikibase-UpdateUsagesForPage` - sup... [17:20:21] I just didn't want to stomp on the middle of some puppetswat change deploying [17:20:27] (03CR) 10Ayounsi: [C: 032] DNS: Only send eqsin countries to ulsfo [dns] - 10https://gerrit.wikimedia.org/r/391357 (owner: 10Ayounsi) [17:20:37] (03CR) 10Ayounsi: [C: 032] DNS: Only send eqsin countries to ulsfo [dns] - 10https://gerrit.wikimedia.org/r/391357 (owner: 10Ayounsi) [17:21:28] do we have a network issue? [17:21:38] some things are broken [17:21:39] jynus: not that I know of, why do you ask? [17:22:07] lag on multiple s5 hosts [17:22:20] is wikidata up? [17:22:42] it is up, but no recentchanges [17:23:07] is the master down? [17:23:29] wikidata is in read only [17:23:43] 17:20 < icinga-wm> PROBLEM - SSH on db1063 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:23:46] ^ ? [17:23:56] cannot ssh [17:24:00] did it crash? [17:24:55] rip [17:25:39] I don't know, do you need me to look? [17:25:41] PROBLEM - MariaDB Slave Lag: s5 on db1087 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 609.26 seconds [17:25:41] PROBLEM - MariaDB Slave Lag: s5 on db2085 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 610.64 seconds [17:25:42] PROBLEM - MariaDB Slave Lag: s5 on db2045 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 611.36 seconds [17:25:42] PROBLEM - MariaDB Slave Lag: s5 on db2075 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 611.45 seconds [17:25:45] PROBLEM - MariaDB Slave Lag: s5 on db2059 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 621.11 seconds [17:25:45] yes [17:25:55] PROBLEM - MariaDB Slave Lag: s5 on db2083 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 624.20 seconds [17:25:56] PROBLEM - MariaDB Slave Lag: s5 on db2080 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 627.14 seconds [17:25:56] PROBLEM - MariaDB Slave Lag: s5 on db2082 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 629.88 seconds [17:26:05] PROBLEM - MariaDB Slave Lag: s5 on db2089 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 636.43 seconds [17:26:11] PROBLEM - MariaDB Slave Lag: s5 on db1104 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 637.07 seconds [17:26:11] PROBLEM - MariaDB Slave Lag: s5 on db2038 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 637.53 seconds [17:26:12] PROBLEM - MariaDB Slave Lag: s5 on db2084 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 640.65 seconds [17:26:18] whoever is logged into 63, can you tell me what is going on? [17:26:26] PROBLEM - MariaDB Slave Lag: s5 on db2081 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 657.87 seconds [17:26:26] PROBLEM - MariaDB Slave Lag: s5 on db2023 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 660.23 seconds [17:26:31] I am here [17:26:31] I'm logging into the console now [17:26:35] PROBLEM - MariaDB Slave Lag: s5 on db2086 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 666.53 seconds [17:26:35] How can i help [17:26:36] PROBLEM - MariaDB Slave Lag: s5 on db2079 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 667.15 seconds [17:26:50] we need to know what is going on with db1063 [17:26:51] no more sessions, so someone's already on the console.... [17:26:58] who? [17:26:58] 10Operations, 10Traffic: VCL: handling of uncacheable responses in wikimedia-common - https://phabricator.wikimedia.org/T180712#3767222 (10ema) [17:27:07] 10Operations, 10Traffic: VCL: handling of uncacheable responses in wikimedia-common - https://phabricator.wikimedia.org/T180712#3767237 (10ema) p:05Triage>03Normal [17:27:15] ssh root@db1063.mgmt.eqiad.wmnet [...] No more sessions are available for this type of connection! [17:27:15] we need to failover [17:27:28] I don't know who [17:27:35] should we wait for the master to come back? [17:27:44] from what? [17:27:50] we do not know what is going on with it [17:27:52] we don't know if it rebooted? [17:27:53] right [17:28:00] i just saw that no one can loging to the idrac [17:28:00] it could have the neotwork down [17:28:01] PROBLEM - MariaDB Slave IO: s5 on db1087 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1063.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1063.eqiad.wmnet (110 Connection timed out) [17:28:10] PROBLEM - MariaDB Slave IO: s5 on db1104 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1063.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1063.eqiad.wmnet (110 Connection timed out) [17:28:23] I see kernel stack traces in the remote syslog server [17:28:34] !log Converting 'enwiki parsoid' to size-tiered compaction (T179422) [17:28:37] Nov 16 17:15:23 db1063 kernel: [17198753.219192] megaraid_sas 0000:03:00.0: Iop2SysDoorbellIntfor scsi0 [17:28:40] Nov 16 17:15:24 db1063 kernel: [17198754.220718] megaraid_sas 0000:03:00.0: Found FW in FAULT state, will reset adapter scsi0. [17:28:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:41] T179422: Reshape RESTBase Cassandra clusters - https://phabricator.wikimedia.org/T179422 [17:28:43] Nov 16 17:15:24 db1063 kernel: [17198754.220724] megaraid_sas 0000:03:00.0: resetting fusion adapter scsi0. [17:28:48] and I cannot pool db1071 because it was in the middle of a reboot [17:28:53] so it doesn't have the latest data [17:30:08] and I suppose no other slave we could promote [17:30:11] and the idrac is totally gone, awesome [17:30:20] marostegui: it wont allow login at all? [17:30:31] or it wont allow serial console? [17:30:35] PROBLEM - MariaDB Slave Lag: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 905.20 seconds [17:30:36] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] [17:30:45] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1063.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1063.eqiad.wmnet (110 Connection timed out) [17:31:27] Hi! How long will the dewiki database been locked? [17:31:38] see topic [17:31:40] doctaxon: no ETA, we got problems [17:31:46] PROBLEM - Host db1063 is DOWN: PING CRITICAL - Packet loss = 100% [17:32:01] can we get cmjohnson1 to give it a look or a reboot? [17:32:22] cmjohnson1: are you at the dc? [17:32:23] 10Operations, 10DBA: db1063 crashed - https://phabricator.wikimedia.org/T180714#3767261 (10jcrespo) [17:32:23] <_joe_> marostegui: ack [17:32:31] PROBLEM - MariaDB Slave Lag: s5 on db1099 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1019.77 seconds [17:32:33] at least give it a look [17:32:35] it seems pretty peculiar that 1063 died while 1071 was rebooting? [17:32:45] PROBLEM - MariaDB Slave IO: s5 on db2023 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1063.eqiad.wmnet:3306 - retry-time: 60 retries: 86400 message: Cant connect to MySQL server on db1063.eqiad.wmnet (110 Connection timed out) [17:32:47] seems unlikely to be a totally random coincidence anyways [17:32:49] (03PS7) 10Jcrespo: mariadb: Setup s8 replica set on codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391835 (https://phabricator.wikimedia.org/T177208) [17:32:52] (03PS1) 10Jcrespo: mariadb: Failover db1063 to db1070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391869 (https://phabricator.wikimedia.org/T180714) [17:32:53] mutante: i am out for lunch ...i will head back now [17:33:12] PROBLEM - MariaDB Slave IO: s5 on db1070 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1063.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1063.eqiad.wmnet (110 Connection timed out) [17:33:36] PROBLEM - MariaDB Slave IO: s5 on dbstore1002 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1063.eqiad.wmnet:3306 - retry-time: 60 retries: 86400 message: Cant connect to MySQL server on db1063.eqiad.wmnet (110 Connection timed out) [17:33:45] PROBLEM - SSH on labtestservices2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:33:52] PROBLEM - MariaDB Slave IO: s5 on db1106 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1063.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1063.eqiad.wmnet (110 Connection timed out) [17:34:01] PROBLEM - MariaDB Slave IO: s5 on db1099 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1063.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1063.eqiad.wmnet (110 Connection timed out) [17:34:05] can someone silence the alerts? [17:34:10] got it [17:34:20] tries to call chris [17:34:27] chris is already aware [17:34:29] mutante: he's aware [17:34:30] ok [17:34:35] PROBLEM - MariaDB Slave Lag: s5 on db2066 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1144.77 seconds [17:34:38] thanks akosiaris [17:34:44] I am checking all slaves to see which one is the one more ahead [17:35:03] (I have no idea why labtestservices2003 dropped off but it was unreachable and now seems back in case that seems relevant ever) [17:35:43] I think they are all at 234464548 [17:36:18] marostegui: I can repoint the replicas to db1070 [17:36:22] but I want your ok [17:36:24] it is row [17:36:25] jynus: yes, that matches what I am seeeing too [17:36:33] let me check [17:36:46] 10Operations, 10DBA, 10Patch-For-Review: db1063 crashed - https://phabricator.wikimedia.org/T180714#3767261 (10fgiunchedi) I retrieved the kernel logs from syslog servers at the time of the incident: {P6332} [17:36:52] !log stopping slave on db1070 [17:36:56] RECOVERY - SSH on labtestservices2003 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0) [17:36:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:09] S5 related alerts silenced for the next 2 hours [17:37:18] jynus: i will submit the patch for STATEMENT [17:37:25] position on db1070 is db1070-bin.001476:361022356 [17:37:28] no need [17:37:32] we will go all in [17:37:35] sure [17:37:41] let's stop puppet [17:37:54] I will stop slave and repoint all other slaves to the above position [17:37:56] (03PS18) 10Ayounsi: Puppetize Netbox [puppet] - 10https://gerrit.wikimedia.org/r/387880 (https://phabricator.wikimedia.org/T170144) [17:38:08] db1070-bin.001476:361022356 -> confirmed [17:38:22] puppet disabled [17:38:51] cmjohnson1: before powering off db1063 let us know, and remove the network cable please [17:39:19] marostegui: assuming he's still otw back from lunch, want me to shut off the switch port? [17:39:26] bblack: yeah, sounds good [17:39:36] <_joe_> marostegui: should someone prepare the mw-config patch for you? [17:39:40] at least till we are good here, to avoid brain splits [17:39:49] (03PS6) 10Ema: vcl: distinguish between hfp and hfm [puppet] - 10https://gerrit.wikimedia.org/r/391171 (https://phabricator.wikimedia.org/T180434) [17:39:51] (03CR) 10Ayounsi: [C: 032] Puppetize Netbox [puppet] - 10https://gerrit.wikimedia.org/r/387880 (https://phabricator.wikimedia.org/T170144) (owner: 10Ayounsi) [17:40:03] !log Decommissioning Cassandra, restbase1014-b.eqiad.wmnet (T179422) [17:40:05] PROBLEM - SSH on labtestservices2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:40:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:10] _joe_: not much needed now, jynus did it - I will merge when he is good with it [17:40:11] T179422: Reshape RESTBase Cassandra clusters - https://phabricator.wikimedia.org/T179422 [17:40:17] <_joe_> ok [17:40:32] <_joe_> can we stop merging things? we're in the middle of an outage [17:41:17] we need the patch for the heartbeat [17:41:21] (03CR) 10Ema: vcl: distinguish between hfp and hfm (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/391171 (https://phabricator.wikimedia.org/T180434) (owner: 10Ema) [17:41:22] and the one for mediawiki [17:41:24] (03PS1) 10Marostegui: db1070: New master in s5 [puppet] - 10https://gerrit.wikimedia.org/r/391870 (https://phabricator.wikimedia.org/T180714) [17:41:29] jynus: I am doing the puppet changes [17:41:36] RECOVERY - MariaDB Slave IO: s5 on db1087 is OK: OK slave_io_state Slave_IO_Running: Yes [17:41:54] I thought I had those silenced :-( [17:41:56] !log disable port ge-5/0/39 on asw-c-eqiad (db1063) [17:42:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:14] akosiaris: the recoveries come in regardless iirc [17:42:48] Physical interface: ge-5/0/39, Administratively down, Physical link is Down [17:43:00] (03CR) 10Ladsgroup: "I'm not sure, if this repo is going to be used at one point in future, it's needed but if it's obsolete, feel free to abandon this patch s" [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/357369 (https://phabricator.wikimedia.org/T167114) (owner: 10Ladsgroup) [17:43:04] so you should be safe from a surprise reboot into service [17:43:29] bblack: thanks! [17:43:42] (03PS2) 10Marostegui: db1070: New master in s5 [puppet] - 10https://gerrit.wikimedia.org/r/391870 (https://phabricator.wikimedia.org/T180714) [17:43:50] (03PS3) 10Marostegui: db1070: New master in s5 [puppet] - 10https://gerrit.wikimedia.org/r/391870 (https://phabricator.wikimedia.org/T180714) [17:44:16] RECOVERY - MariaDB Slave IO: s5 on db1099 is OK: OK slave_io_state Slave_IO_Running: Yes [17:44:19] jynus: that is the puppet change - puppet is stopped on db1070 anyways, i will wait for a review before merging [17:44:26] I will merge mediawiki-patch once we are also done [17:44:27] grrrr [17:44:35] !log merging netbox CR [17:44:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:55] (03PS1) 10Ayounsi: Netbox deployment followup [puppet] - 10https://gerrit.wikimedia.org/r/391871 [17:45:19] XioNoX: would be possible to leave those merges for a bit later? we have the outage going on [17:45:42] (03CR) 10Ayounsi: [C: 032] Netbox deployment followup [puppet] - 10https://gerrit.wikimedia.org/r/391871 (owner: 10Ayounsi) [17:46:45] RECOVERY - MariaDB Slave IO: s5 on db1104 is OK: OK slave_io_state Slave_IO_Running: Yes [17:47:09] PROBLEM - puppet last run on netmon1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:47:17] marostegui: I merged the main one :/ but can hold off on the followups fixes [17:47:22] cheers [17:47:22] 10Operations, 10CirrusSearch, 10Discovery, 10Elasticsearch, and 2 others: Created dedicated elastic component in our APT repository - https://phabricator.wikimedia.org/T179964#3767363 (10debt) 05Open>03Resolved a:03debt [17:47:24] RECOVERY - MariaDB Slave IO: s5 on db1106 is OK: OK slave_io_state Slave_IO_Running: Yes [17:47:35] I 've followed a different approach in icinga now, hopefully this time I got them all [17:47:40] we are good [17:47:46] oh ? nice! [17:47:53] we can deploy 70 as master [17:47:56] ok [17:47:57] ACKNOWLEDGEMENT - puppet last run on netmon1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues Ayounsi deploying netbox [17:47:59] review my change if you like [17:48:19] RECOVERY - MariaDB Slave IO: s5 on db2023 is OK: OK slave_io_state Slave_IO_Running: Yes [17:48:29] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/compiler02/8826/" [puppet] - 10https://gerrit.wikimedia.org/r/391870 (https://phabricator.wikimedia.org/T180714) (owner: 10Marostegui) [17:48:31] link? [17:48:42] https://gerrit.wikimedia.org/r/391870 [17:48:45] https://gerrit.wikimedia.org/r/#/c/391870/3 [17:49:07] I do not care about pupet right now [17:49:11] we need mediawiki back [17:49:18] ok, let me deploy your change [17:49:21] https://gerrit.wikimedia.org/r/#/c/391869/ [17:49:39] we will do puppet later [17:49:40] we should remove it from the vslow service [17:49:42] but we can do that later [17:49:55] (03CR) 10Jcrespo: [C: 032] mariadb: Failover db1063 to db1070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391869 (https://phabricator.wikimedia.org/T180714) (owner: 10Jcrespo) [17:50:00] (03PS2) 10Jcrespo: mariadb: Failover db1063 to db1070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391869 (https://phabricator.wikimedia.org/T180714) [17:50:10] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [17:50:20] PROBLEM - puppet last run on netmon2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:51:08] (03CR) 10Jcrespo: [C: 032] db1070: New master in s5 [puppet] - 10https://gerrit.wikimedia.org/r/391870 (https://phabricator.wikimedia.org/T180714) (owner: 10Marostegui) [17:51:39] PROBLEM - Host mw2251 is DOWN: PING CRITICAL - Packet loss = 100% [17:51:59] is network maintenance going on? [17:52:19] RECOVERY - Host mw2251 is UP: PING OK - Packet loss = 0%, RTA = 36.08 ms [17:52:19] ACKNOWLEDGEMENT - puppet last run on netmon2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues Ayounsi deploying netbox [17:52:29] jynus: no [17:52:41] * akosiaris looking into that mw2251 thing [17:53:09] box rebooted [17:53:30] (03CR) 10jenkins-bot: mariadb: Failover db1063 to db1070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391869 (https://phabricator.wikimedia.org/T180714) (owner: 10Jcrespo) [17:53:42] heh talking about coincidences [17:53:48] Nov 16 11:04:45 mw2251 kernel: [501074.432068] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 [17:53:51] marostegui: you changed the socket [17:53:55] logs are full of this and related stuff [17:53:58] can you commit the change without removing the socket? [17:54:02] I need hearytbeat up [17:54:03] yep [17:54:09] and it doesn't work without it [17:54:17] I will merge mediawiki [17:54:41] go for it [17:55:22] awesome git review failing for me now [17:55:59] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Failover db1063 to db1070 (duration: 00m 46s) [17:56:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:19] PROBLEM - High lag on wdqs1003 is CRITICAL: CRITICAL: 34.48% of data above the critical threshold [1800.0] [17:56:21] PROBLEM - puppet last run on db1070 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[pt-heartbeat] [17:56:29] PROBLEM - High lag on wdqs2003 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1800.0] [17:56:30] PROBLEM - High lag on wdqs1005 is CRITICAL: CRITICAL: 34.48% of data above the critical threshold [1800.0] [17:56:34] we need hearbeat back for edits to come back [17:56:42] wdqs is likely related to wikidata readonly status [17:56:48] jynus: let's start it manually? [17:56:59] PROBLEM - High lag on wdqs1004 is CRITICAL: CRITICAL: 36.67% of data above the critical threshold [1800.0] [17:57:00] PROBLEM - High lag on wdqs2002 is CRITICAL: CRITICAL: 36.67% of data above the critical threshold [1800.0] [17:57:08] yeah, wdqs is just checking for last change time [17:57:09] PROBLEM - puppet last run on naos is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Scap_source[netbox/deploy] [17:57:10] !log silencing wdqs [17:57:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:21] it doesn't have any replication position [17:57:41] well, it has… but that's not used here [17:58:00] PROBLEM - High lag on wdqs2001 is CRITICAL: CRITICAL: 34.48% of data above the critical threshold [1800.0] [17:58:13] really? [17:58:54] (03PS1) 10Marostegui: db1070: old socket location backdb1070: old socket location backdb1070: old socket location backdb1070: old socket location backdb1070: old socket location backdb1070: old socket location backdb1070: old socket location backdb1070: old socket location bac [puppet] - 10https://gerrit.wikimedia.org/r/391872 [17:58:57] what? XD [17:59:17] (03CR) 10jerkins-bot: [V: 04-1] db1070: old socket location backdb1070: old socket location backdb1070: old socket location backdb1070: old socket location backdb1070: old socket location backdb1070: old socket location backdb1070: old socket location backdb1070: old socket location bac [puppet] - 10https://gerrit.wikimedia.org/r/391872 (owner: 10Marostegui) [17:59:31] (03PS2) 10Marostegui: db1070: Old socket location back [puppet] - 10https://gerrit.wikimedia.org/r/391872 [17:59:36] jynus: ^ [17:59:49] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Scap_source[netbox/deploy] [18:00:04] gwicke, cscott, arlolra, subbu, halfak, and Amir1: (Dis)respected human, time to deploy Services – Graphoid / Parsoid / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171116T1800). Please do the needful. [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:00:04] (03CR) 10Jcrespo: [C: 032] db1070: Old socket location back [puppet] - 10https://gerrit.wikimedia.org/r/391872 (owner: 10Marostegui) [18:00:06] what on earth what commit message about ? [18:00:20] akosiaris: my vim did somethjing weird [18:00:29] jynus: merged, we should be good to enable puppet back [18:00:31] in db1070 [18:01:00] RECOVERY - MariaDB Slave Lag: s5 on db2089 is OK: OK slave_sql_lag Replication lag: 2.32 seconds [18:01:09] RECOVERY - MariaDB Slave Lag: s5 on db2038 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [18:01:15] RECOVERY - MariaDB Slave Lag: s5 on db1099 is OK: OK slave_sql_lag Replication lag: 0.34 seconds [18:01:15] RECOVERY - MariaDB Slave Lag: s5 on db2084 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [18:01:22] ok, I am going to set the master as read-write [18:01:35] RECOVERY - MariaDB Slave Lag: s5 on db1087 is OK: OK slave_sql_lag Replication lag: 0.11 seconds [18:01:35] RECOVERY - puppet last run on db1070 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [18:01:36] RECOVERY - MariaDB Slave Lag: s5 on db2081 is OK: OK slave_sql_lag Replication lag: 0.22 seconds [18:01:36] ah nice [18:01:39] RECOVERY - MariaDB Slave Lag: s5 on db2023 is OK: OK slave_sql_lag Replication lag: 0.02 seconds [18:01:42] jynus: ok! [18:01:49] RECOVERY - MariaDB Slave Lag: s5 on db2086 is OK: OK slave_sql_lag Replication lag: 0.20 seconds [18:01:50] RECOVERY - MariaDB Slave Lag: s5 on db2045 is OK: OK slave_sql_lag Replication lag: 0.35 seconds [18:01:50] RECOVERY - MariaDB Slave Lag: s5 on db2075 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [18:01:50] RECOVERY - MariaDB Slave Lag: s5 on db2085 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [18:01:59] RECOVERY - MariaDB Slave Lag: s5 on db2059 is OK: OK slave_sql_lag Replication lag: 0.30 seconds [18:02:00] RECOVERY - MariaDB Slave Lag: s5 on db2083 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [18:02:00] RECOVERY - MariaDB Slave Lag: s5 on db2080 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [18:02:09] RECOVERY - MariaDB Slave Lag: s5 on db2079 is OK: OK slave_sql_lag Replication lag: 0.31 seconds [18:02:09] RECOVERY - MariaDB Slave Lag: s5 on db2066 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [18:02:09] RECOVERY - MariaDB Slave Lag: s5 on db2082 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [18:02:14] let's monitor replication now [18:02:15] RECOVERY - MariaDB Slave Lag: s5 on db1104 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [18:02:30] why on earth is icinga still informing us of s5 is beyond me... [18:02:51] I 've scheduled extended downtime for anything that had s5 .... [18:03:24] Column 8 of table 'wikidatawiki.recentchanges' cannot be converted from type 'tinyint' to type 'bigint [18:03:29] No ORES deployment today. [18:03:58] is there any schema change ongoing? [18:04:00] awight: there would'nt be any anyway. got an outage currently [18:04:12] (03PS1) 10Marostegui: db-eqiad.php: Pool db1100 as vslow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391873 (https://phabricator.wikimedia.org/T180714) [18:04:15] damn icinga... this makes nosense "This service has been scheduled for fixed downtime from 2017-11-16 17:47:08 to 2017-11-16 19:47:08. Notifications for the service will not be sent out during that time period." [18:04:15] jynus: yes, that is part of the schema change from refactored comments [18:04:19] crap [18:04:19] ha [18:04:21] more fun [18:04:28] akosiaris: alright, “perfect” ;-) [18:04:30] which server crashed? [18:04:32] can we survive with only 4 servers? [18:04:40] db1063 crashed [18:04:50] db1071 was in the middle of a reboot [18:04:51] no, i mean replication [18:04:53] so it stopped [18:04:55] all [18:05:00] except 82 [18:05:04] 87 [18:05:07] and 95 [18:05:18] yeah, the ones that didn't get the schema change [18:05:34] so what is the way to move forward? [18:05:40] do we apply the schema change? [18:05:48] recentchanges is fast [18:05:50] do we reboot the master? [18:05:51] but logging table will fail [18:05:55] and it takes 8 hours to alter it [18:06:21] if we reboot the server, it has lost writes [18:06:29] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1948 bytes in 0.085 second response time [18:06:34] should we update that we're writable now? [18:06:47] (do we expect to stay that way at this point?) [18:06:47] bblack: not sure for how long we will be [18:07:06] we are not in a good state [18:07:12] ok [18:07:24] jynus: we can try one thing….serve queries from codfw [18:07:25] basically we are only with 3 out of 11 servers [18:07:29] no [18:07:46] let's do the schema change on the hosts that failed [18:07:53] on recentchanges [18:07:55] so change them back? [18:07:57] yes [18:08:15] https://gerrit.wikimedia.org/r/#/c/357892/14/maintenance/archives/patch-comment-table.sql [18:08:20] <_joe_> can I help in any way people? [18:08:48] let's compare the tables db1082 is not altered [18:08:58] which is the table that complains? [18:09:02] marostegui: i tried just replacing the battery but the error about memory/battery problems were detected persisted...i'm going to have to replace the entire card [18:09:07] ADD COLUMN rc_comment_id bigint ? [18:09:12] I should be able to import the configuration to the new card [18:09:25] jynus: recentchanges and logging will complain [18:09:40] Yeah, editing seems possible but not recent changes. [18:09:42] we cannot go back [18:10:02] the changes are already in ROW [18:10:10] we have to commit to that [18:10:42] the other option would be to rollback the edits that happened in the last 30 minutes [18:11:05] <_joe_> jynus: what's the first option? [18:11:19] <_joe_> to perform the schema change on the current master? [18:11:26] do the schema change that is going to take 8 hours and "maybe fail/complain" [18:11:27] ? [18:12:06] so there's a partially-applied schema change across this cluster? [18:12:14] yes, we have to alter the master [18:12:17] were the reboots related to the schema change, or just kernel stuff? [18:12:28] we didn't reboot anything [18:12:29] bblack: nope, nothing to do with the schema changes [18:12:43] the only option is to alter the master I am afraid [18:12:57] that doesn't fix the events retroactively [18:12:58] <_joe_> marostegui: won't that mean we will be mostly unable to write for those 8 hours? [18:13:15] we have to revert the schema changes on the slaves [18:13:21] I meant the earlier db1071 reboot [18:13:21] and catch up [18:13:25] jynus: the changes will go thru replication [18:13:29] no [18:13:32] <_joe_> jynus: that seems the safest option [18:13:39] that will not work [18:13:44] why not? [18:13:46] the changes are already on the binary log [18:13:50] with the current schema [18:13:55] <_joe_> reverting the schema change on the slaves seems the best option to me [18:13:56] altering the master doesn't change [18:14:03] paste event, only new [18:14:08] ah, right [18:14:09] we could had this convesation [18:14:10] yes [18:14:11] so our option is basically readonly for hours, or rollback wikidata/dewiki by 30 mins of changes from before the incident? [18:14:14] 30 minutesr ago [18:14:16] no [18:14:20] we keep read-write [18:14:27] in degraded mode [18:14:37] and perform the schema change rollback on the other hosts [18:14:49] <_joe_> ok that seems the best option to me [18:14:54] degraded as in fewer read slaves and higher load than usual? [18:14:58] yes [18:15:02] yes [18:15:04] or whatever is happening now [18:15:05] <_joe_> go with that :) [18:15:06] +1 [18:15:11] how mad is wikidata right now? [18:15:13] *bad [18:15:21] <_joe_> define bad [18:15:28] laggy? [18:15:32] what is broken from the user point of view? [18:15:36] recentchanges? [18:15:37] maybe dropping the column will be faster? [18:15:38] replication is 9mins back [18:15:45] we maybe can do that [18:15:49] already told #wikipedia-de to hold their breath and that it might go back to readonly for a bit. they were all like "yay, it works again" [18:15:50] what? [18:16:05] replication is 9mins back ? [18:16:05] (03Abandoned) 10Rush: Whitelist term_full_entity_id in wb_terms table [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/357369 (https://phabricator.wikimedia.org/T167114) (owner: 10Ladsgroup) [18:16:22] so we have the master, db1070 [18:16:28] db1082 [18:16:29] jynus: yeah sorry, straight translation from greek. I mean we are lagging already 9 mins [18:16:32] (03CR) 10Rush: [C: 04-1] "just to show we are intentionally holding off" [puppet] - 10https://gerrit.wikimedia.org/r/390431 (https://phabricator.wikimedia.org/T180254) (owner: 10Arturo Borrero Gonzalez) [18:16:33] db1087 [18:16:40] db1095 up [18:16:47] that should be enough to keep wikidata up [18:16:56] with api probably failing from time to time [18:17:11] then we revert the schema changes [18:17:25] yeah, let's check what to revert [18:17:34] logging in and browsing around reading wikidata items, things seem sane, but I'm not really a wikidata power user or anything [18:17:35] and as fast as they come up we pool them [18:17:43] do we have any wikidata folks in here right now? [18:17:48] o/ [18:17:50] hoo: around? [18:17:54] Or dev folks? [18:17:54] I think that is better than 8 hours of read only [18:17:59] apergos: yes [18:18:08] what's up? [18:18:16] marostegui: let's pool db1095 [18:18:19] (08:17:43 μμ) bblack: do we have any wikidata folks in here right now? [18:18:21] sjoerddebruin: either that can give us a realstic estimate of current damage to wikidata user experience [18:18:23] sounds good [18:18:27] and try to fix the configuration with the 4 hosts we have [18:18:29] he was asking so I pinged.... [18:18:31] and then we revert [18:18:32] let me check the reverts [18:18:46] we can do comments on one [18:18:50] and see how bad it is [18:19:05] s/comments/recentchanges/ [18:19:15] recentchagnes and logging will fail for sure [18:19:22] i had this issue on db1102 some days ago [18:19:23] we can do those [18:19:26] so I am looking for the logs [18:19:29] to see what failed exactly [18:19:32] Editing is currently giving timeouts. [18:19:54] editing doesn't work? [18:19:56] strange [18:20:18] item editing, that mostly goes through the API I think? [18:20:27] <_joe_> it doesn't work *at all* or just sometimes? [18:20:38] on wikidata, my watchlist tells me "Due to high database server lag, changes newer than 1,085 seconds may not be shown in this list. [18:20:48] ok [18:20:55] if editing doesn't work [18:21:01] that is a different option [18:21:08] I'm kind of assuming dewiki is similarly impacted as well and we're just talking about wikidata as 1/2 problem sites/projects [18:21:09] we can failover to another host [18:21:22] jynus: from the logs: recentchanges and logging only [18:21:23] that has the schema change and lose edits [18:21:28] Editing non-mainspace on Wikidata seems to work. [18:21:37] why? [18:22:11] <_joe_> should we remove the servers that fail to replicate from mediawiki-config? [18:22:15] <_joe_> or is that already done? [18:22:19] nope, not done [18:22:28] so maybe tha tis the issue? [18:22:33] <_joe_> that is the issue imho [18:22:42] <_joe_> we have waitforslaves() in a ton of places [18:22:52] error log currently being flooded with things waiting for rep [18:22:58] the thing is- now it is the moment to think [18:23:02] <_joe_> that's what I was saying [18:23:03] that probably kills editing [18:23:18] if we go with the servers that failed or the ones that didn't [18:23:23] The edit seems to save, but the UI gives an timeout error. [18:23:25] and revert everthing in the last hour [18:23:30] When you retry, you get a edit conflict. [18:23:41] That is really weird. [18:23:54] I sure hope it won't publish 3 times the message I was trying to post at de.wp :( [18:24:00] jynus: i would go for the ones that didn't, and revert the others [18:24:04] <_joe_> Elitre: it's possible [18:24:09] RECOVERY - High lag on wdqs2002 is OK: OK: Less than 30.00% above the threshold [600.0] [18:24:09] 10Operations, 10ops-codfw: mw2251 hardware error - https://phabricator.wikimedia.org/T180724#3767521 (10akosiaris) [18:24:10] RECOVERY - High lag on wdqs2003 is OK: OK: Less than 30.00% above the threshold [600.0] [18:24:10] RECOVERY - High lag on wdqs1004 is OK: OK: Less than 30.00% above the threshold [600.0] [18:24:11] RECOVERY - High lag on wdqs1005 is OK: OK: Less than 30.00% above the threshold [600.0] [18:24:22] -_- [18:24:25] so ok to drop the edits in the last hour? [18:24:46] <_joe_> jynus: I don't think so, but if we have no other option... [18:24:47] !log akosiaris@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=mw2251.codfw.wmnet [18:24:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:02] !log set mw2251 to inactive. T180724 [18:25:07] <_joe_> I'd start by setting s5 read-only now? [18:25:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:08] T180724: mw2251 hardware error - https://phabricator.wikimedia.org/T180724 [18:25:17] _joe_: agreed [18:25:18] <_joe_> but tbh, I'd depool the servers that can't replicate [18:25:19] Number of connections on the appservers seems rising [18:25:30] so ro sounds good [18:25:30] <_joe_> and see if we can live with what we have now [18:25:36] so we pool db1092 instead? [18:25:42] Not sure how to check the actual number of edits going on on Wikidata in the last hour. [18:26:03] https://grafana.wikimedia.org/dashboard/db/wikidata-edits?refresh=1m&orgId=1 [18:26:09] RECOVERY - High lag on wdqs1003 is OK: OK: Less than 30.00% above the threshold [600.0] [18:26:10] RECOVERY - High lag on wdqs2001 is OK: OK: Less than 30.00% above the threshold [600.0] [18:26:24] sjoerddebruin: you can see if in grafana [18:26:27] ok, I take the decision [18:26:31] based on recentchanges [18:26:36] let's do what manuel says [18:26:40] let's failover again [18:26:42] Yeah, but recentchanges isn't updating addshore [18:26:48] to the servers with the schema change [18:26:49] <_joe_> hey [18:26:52] sjoerddebruin: then that will not be accurate [18:27:00] or we can check schema changes on the database [18:27:07] <_joe_> I think we need to get some permission before we rollback one hour of edits [18:27:14] I have set s5 master db1070 in read only to start with [18:27:18] we do not lose them, _joe_ [18:27:25] let's check dewiki, too [18:27:37] <_joe_> we have to re-insert them somehow later? [18:27:40] yes [18:27:44] <_joe_> ok [18:27:44] I don't think we should failover to the server with the schema change, I said we should revert the schema change [18:28:05] what? [18:28:15] applying a schema change takes 8 hours, you said [18:28:29] yeah, but I am worried about re-inserting the edits back [18:28:35] will that work as expected? [18:28:59] the edits are ongoing [18:29:01] from a mediawiki consistency point of view I mean [18:29:05] ok, I retract myself [18:29:09] and how much time will it take to happen ? if it's going to take another 8 hours... damn [18:29:10] so let's keep the current state [18:29:10] <_joe_> I would prefer us to try to exclude from mediawiki-config the servers with the schema change, see if we can hold the fort that way, and revert the schema change on those machines. But I trust your judgement if you think it would not work [18:29:18] and do what joe says [18:29:21] that will fix recentchanges [18:29:25] 10Operations, 10ops-codfw: mw2251 hardware error - https://phabricator.wikimedia.org/T180724#3767548 (10RobH) @papaul: This system has been depooled (by Alex) so it can be powered down (via os commands) at any time for you to troubleshoot. Please reboot it into the Dell ePSA tests (they are built into the sy... [18:30:02] jynus: we can also try to avoid INPLACE and do locking, that might be faster for the alter table [18:30:18] mmm [18:30:25] there are only 555 edits in the last hour [18:30:32] <_joe_> uhm [18:30:41] 456 on dewiki [18:30:46] we can literally reinsert them all [18:30:58] remember i set db1070 to read_only a few minutes ago [18:31:03] till we decide what to do [18:31:09] <_joe_> marostegui: yeah that was a sensible thing to do [18:31:19] <_joe_> maybe set s5 to read-only in mediawiki-config is gentler [18:31:30] shall I try to revert db1104 for instance? [18:31:31] mediawiki detects it [18:31:47] the schema change is irrelevant now [18:32:03] either we go with one group of servers or the other [18:32:39] I do not know what to do, but we cannot keep it as now [18:32:55] so, what is going to be faster to take us to a normal state [18:33:08] which groups of servers and what actions to follow up with [18:33:09] and avoid losing data [18:33:35] keeping the ones that are now is easier and doesn't lose data [18:33:36] jynus: the 555 edits you are talking about, could they be easily replayed on 1 of the "other" nodes ? [18:33:56] yes, but no right now [18:34:00] jynus: we can shutdown a host that is good now and start cloning others [18:34:01] assuming those "other" nodes actually provide us with some benefit [18:34:03] <_joe_> so my advice would be: if we want to failover again, we need to stay read only until we've reinserted the edits [18:34:24] marostegui: start the schema change reverting on one of the failed nodes [18:34:35] jynus: ok, will take db1104 [18:34:36] start on recentchanges [18:34:44] depending how it goes, we will chose one or another [18:34:46] as soon as an insert happens on logging it will fail too [18:34:53] <_joe_> jynus: seems reasonable [18:35:28] ok, doing it [18:36:24] wikidatwiki.recentchanges is now being done, as it is the one that first failed, it is only 8GB, so it shouldn't take too long [18:36:38] I will prepare the depool while in read only [18:36:44] how long is not long (ballpark)? [18:37:02] 10 minutes or so I guess [18:37:22] <_joe_> apergos: let's not give ETAs right now [18:37:50] <_joe_> we're in a delicate situation, and data integrity is more important than time to finish the outage right now imho [18:37:52] not giving them elsewhere, this is for me to understand what's going on [18:39:46] * volans|off around still reading backlog [18:39:54] should we !log the switch back to readonly? people are reading the Twitter feed [18:39:59] half way thru the alter [18:40:15] (03PS1) 10Jcrespo: mariadb: Pool only the servers with working replication [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391876 (https://phabricator.wikimedia.org/T180714) [18:40:33] once it is done, i will start replication and it should fail on another table, either dewiki.logging or dewiki.recentchanges or wikidata.logging but not wikidata.recentchanges again [18:40:37] <_joe_> mutante: yeah it should've been logged already, if it's not, let's [18:40:40] sure [18:40:46] but I want to test it [18:40:53] i didn't log it, i forgot [18:41:00] !log Set s5 master read_only [18:41:05] I am deploying https://gerrit.wikimedia.org/r/#/c/391876/ [18:41:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:30] (03CR) 10Marostegui: [C: 031] mariadb: Pool only the servers with working replication [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391876 (https://phabricator.wikimedia.org/T180714) (owner: 10Jcrespo) [18:41:31] !log dewikipedia and wikidata currently back to readonly-mode while wikidata is being worked on [18:41:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:05] (03CR) 10Jcrespo: [V: 032 C: 032] mariadb: Pool only the servers with working replication [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391876 (https://phabricator.wikimedia.org/T180714) (owner: 10Jcrespo) [18:42:17] <_joe_> jynus: let's see if they hold the load [18:42:20] (03CR) 10jenkins-bot: mariadb: Pool only the servers with working replication [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391876 (https://phabricator.wikimedia.org/T180714) (owner: 10Jcrespo) [18:42:29] _joe_: that was my fear [18:42:29] it will be a good stress test for that HW [18:42:38] and why potentially the other option [18:43:00] jynus: as you mentioned, db1095 might be able to help with some small load too [18:43:17] <_joe_> jynus: if we don't handle the load, we can go the other route [18:43:21] * marostegui would have loved to see a why to disable services: ie: disable vslow [18:43:30] db1095 is sanitarim [18:43:36] ah yes [18:43:46] we can pool db1070 when we fix it [18:43:55] but it would fail like the others [18:43:58] <_joe_> db1071 I guess? [18:43:58] we can pool the master [18:44:02] yes, sorry [18:44:03] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Pool only 'good' servers (duration: 00m 48s) [18:44:05] alter done [18:44:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:24] but without partitioned slaves [18:44:25] and replication advanced till wikidatawiki.logging [18:44:28] it will be hard [18:44:31] and that is 300G table [18:44:35] the one that takes 8 hours [18:45:08] recentchanges works now [18:45:24] should I start reverting all the tables across all the hosts that failed? [18:45:30] so either we reeenable read write as is [18:45:38] and do what you just said [18:45:46] or we failover, now is the moment to decide [18:45:55] PROBLEM - cassandra-c service on restbase2006 is CRITICAL: NRPE: Command check_cassandra-c-state not defined [18:46:30] I would not failover, continue on degraded state and start reverting all the tables at once and leave a start slave at the end, so as soon as they are done, they will start catching up [18:46:37] but that is my opinion [18:46:39] anyone else? [18:46:44] ok, let's do that [18:46:49] let's reenable read-write [18:46:53] restbase2006 is me - freshly reimaged and I've silenced it [18:46:55] and see that edits work [18:46:58] if we have an issue with load, what's the consequence for service? [18:47:07] given we may have that issue for 8 hours [18:47:19] Anyone able to post an update on @Wikidata on Twitter? [18:47:30] jynus: i will set write only OFF [18:47:33] <_joe_> pigsonthewing: not anyone on this channel [18:47:44] right now it is 15516 QPS [18:47:49] that is more than doable [18:48:02] !log dewikipedia and wikidata currently back to writable [18:48:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:33] edits coming back on rcs [18:48:42] api will likely be failing a lot [18:48:45] PROBLEM - Host mw2251 is DOWN: PING CRITICAL - Packet loss = 100% [18:49:02] thing like "give me contributions if this user between feb and march" [18:49:09] dewiki write works well [18:49:14] @ _joe_ OK, thanks. [18:49:36] i am preparing the revert all at once, will share it here for a second review [18:49:37] probably this was the worst time for the failover to happen [18:49:50] of course... murphy strikes again [18:50:00] because we had lots of things in the air for s8 [18:50:14] lots of new servers, ongoing schma changes [18:50:28] and the best server to failover to [18:50:34] was being rebooted [18:50:42] Editing on Wikidata seems working again, still slowly or "failing" due to most actions going trough API. [18:51:04] pigsonthewing: that !log line above goes to Twitter though [18:51:05] fwiw I prefer api issues rather than the possibility of losing the hour's worth of edits, even if maybe the inserts would go ok, maybe they wouldn't, unforseen complication etc [18:51:08] yeah we should probably look at all that context, after we're done getting back to truly-normal [18:51:09] sjoerddebruin: This is not using the API servers mentioned above [18:51:20] so editing through API should actually work as well as editing via EditPage [18:51:34] So: set session sql_log_bin=0; alter table dewiki.recentchanges drop column rc_comment_id; alter table dewiki.logging drop column log_comment_id; alter table wikidatawiki.logging drop column log_comment_id ; alter table wikidatawiki.recentchanges drop column rc_comment_id ; start slave ; [18:51:44] That works on my local environment without any syntax error or anything [18:52:15] hoo: why is it throwing timeouts then? [18:52:24] yeah, that looks ok [18:52:52] do we still need to remove some servers from mediawiki-config as well, to stop the slave-checking stuff at that level? [18:53:16] err no, I think I see the commit above that already did that [18:53:22] ok, to be run at: db1092 db1096 db1099 db1100 db1104 db1106 [18:53:29] I will levae codfw master for tomorrow [18:53:42] sjoerddebruin: i didnt mean to remove your comment on meta, i added my own but that part was by accident [18:53:55] :( [18:54:16] BTW, what is the deal with db1063? [18:54:30] did it come back? [18:54:35] jynus: chris replaced the raid card, it's back up, network unplugged [18:54:54] sjoerddebruin: fixed..i hope [18:56:31] (also don't forget I disabled the switch port too, we'll have to re-enable that when we plug net back in) [18:56:58] marostegui probably you can run those in parallel so it goes faster? [18:57:53] jynus: i tried that on some hosts that were already done and it wasn't much faster :( [18:58:00] when I was deploying the change I mean [18:58:09] the tables apart from wikidata logging, are pretty small [18:58:17] ok [18:58:32] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1109 and db1110 - https://phabricator.wikimedia.org/T180700#3767641 (10Cmjohnson) [18:59:56] (03PS1) 10Cmjohnson: Adding mgmt dns for db1109 and 1110 T180700 [dns] - 10https://gerrit.wikimedia.org/r/391879 [18:59:58] so current status: we didn't have to lose edits, dewiki+wikidata are editable, we're perf-degraded while working on the rest of the recovery, but expecting to pull through? [19:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Your horoscope predicts another unfortunate Morning SWAT (Max 8 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171116T1900). [19:00:04] No GERRIT patches in the queue for this window AFAICS. [19:00:05] ok, so it is now running on all the broken servers [19:00:10] is that roughly-accurate? [19:00:31] bblack: yep [19:00:45] (03CR) 10Cmjohnson: [C: 032] Adding mgmt dns for db1109 and 1110 T180700 [dns] - 10https://gerrit.wikimedia.org/r/391879 (owner: 10Cmjohnson) [19:00:57] what do we think about deployment windows? proceed, not? [19:01:07] apergos: i would say no [19:01:10] I'd say we hold everything till this is sorted out [19:01:16] +1 for holding [19:01:18] marostegui: also db1071 ? [19:01:22] let's notify folks [19:01:26] yep [19:01:31] jynus: is it up already? [19:01:36] it is down [19:01:45] no_justification: see above^^, can you ping your folks and tell em hold it all? [19:01:50] but delayed even before the crash [19:02:02] also it has you listed as rainbowsprinkles :-P [19:02:05] I can put it up [19:02:06] It's my window :p [19:02:10] jynus: let's start mysql without replication and do the revert then [19:02:12] but I remember you saying [19:02:18] exactly [19:02:33] There's nothing on the SWAT window, and I'm holding the train [19:02:38] great [19:03:10] jynus: i will do it [19:03:21] I am starting it [19:03:25] started now [19:03:39] !log demon@tin Locking from deployment [ALL REPOSITORIES]: Dealing with outage, no deploys for now (planned duration: 60m 00s) [19:03:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:53] jynus: we did it at the same time, we have now 2 processes :) [19:03:58] what? [19:04:12] wow [19:04:21] nice race condition [19:04:22] impossible [19:04:40] transactions ftw :P [19:04:46] <_joe_> yeah [19:04:49] it cannot bind to the same socket [19:04:50] wut [19:04:50] port [19:04:52] etc [19:04:58] root@db1071:~# netstat -putan | grep 3306 [19:04:58] tcp6 0 0 :::3306 :::* LISTEN 26845/mysqld [19:05:01] tcp6 0 0 10.64.48.26:3306 10.64.0.15:56063 ESTABLISHED 26845/mysqld [19:05:02] !log demon@tin Unlocked for deployment [ALL REPOSITORIES]: Dealing with outage, no deploys for now (duration: 01m 22s) [19:05:04] tcp6 0 0 10.64.48.26:3306 10.64.0.15:56209 ESTABLISHED 26845/mysqld [19:05:07] XD [19:05:07] bug [19:05:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:09] wtf [19:05:10] did we just find a big? [19:05:59] !log demon@tin Locking from deployment [ALL REPOSITORIES]: No deploys, recovering from downtime (planned duration: 360m 00s) [19:06:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:31] 71116 19:05:30 [ERROR] mysqld: Can't lock aria control file '/srv/sqldata/aria_log_control' for exclusive use, error: 11. Will retry for 30 seconds Killed [19:06:35] there is only one [19:06:39] the other is fake [19:06:51] actually it no longers appear on the processlist [19:06:58] I locked deployments for the next ~6h to prevent accidental mistakes by people. If anyone needs to deploy, lemme know and I'll unlock it. [19:06:59] but it does on netstat [19:06:59] PID 27215 seemd the one failing to me [19:07:04] when I looked at it [19:07:17] it did ok [19:07:21] I saw 26845 and 27215 [19:07:25] it just retried for a while [19:07:29] but it did not start start [19:07:44] going to start the revert there then [19:09:46] db1082 is doing 12qps [19:09:59] which is more than ok [19:10:09] and db1087 16k [19:10:09] (03PS1) 10Cmjohnson: Adding production dns for db1109 and db1110 [dns] - 10https://gerrit.wikimedia.org/r/391882 [19:10:42] should we try to get db1063 back? [19:10:56] it needs to be cloned anyways, but it would be good to know what happened [19:11:34] the problem is those api servers [19:11:39] let me see the error rate [19:12:26] cmjohnson1: what's the status of db1063? is it off now? [19:12:31] quite large, although on cebwiki [19:12:35] db1063 is on but network cable is unplugged [19:12:45] and switch port closed [19:13:10] (and the faulty raid card was replaced) [19:13:20] ah nice [19:13:24] I can loging via idrac now [19:13:36] !log reset slave all on db1070 @ db1063-bin.001382:234464548 [19:13:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:36] RECOVERY - MariaDB Slave IO: s5 on db1070 is OK: OK slave_io_state not a slave [19:14:41] mysql is down, so it should be fine to replug the cable and enable the switch port back I would say, cmjohnson1 and bblack [19:14:56] okay, I will do both now [19:14:59] no_justification: Any chance to either revert group1 Wikipedias back or to deploy a JS related hot fix? [19:15:01] thanks! [19:16:53] hoo: Consensus was no deploys right now [19:17:41] no_justification: Ok, so either tomorrow or sit it out over the weekend [19:17:56] We'll have to play it by ear, yeah [19:22:20] RECOVERY - SSH on db1063 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [19:22:30] RECOVERY - Host db1063 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [19:22:58] marostegui: can i take it down more time [19:23:05] the console2 option was not working [19:23:14] cmjohnson1: yes [19:23:16] all yours [19:23:30] I did: chmod -x /etc/init.d/mysql just in case [19:23:35] to avoid any stuff [19:24:35] (03PS2) 10Ayounsi: Netbox deployment followup [puppet] - 10https://gerrit.wikimedia.org/r/391871 [19:25:31] I wonder if chmod -x on a legacy initscript even does what we think it does with systemd in control [19:26:04] these probably aren't jessie though I guess [19:26:22] actually, it is jessie [19:26:24] so good one [19:26:55] but we are not using systemd on those [19:27:18] we disabled the backwards compatibilty on jessie, too [19:27:30] so it is not a systemd but really runs init.d [19:27:41] ah! [19:27:44] we embraced fully systemd from jessie [19:27:52] as we needed 10.1 compatibility [19:27:57] *stretch [19:28:30] well, I think 10.1 on jessie also uses it, but we do not have any of those [19:28:46] marostegui all yours..thx [19:28:51] cmjohnson1: thanks! [19:29:14] jynus: we should revert the change on codfw master too [19:29:30] i am thinking about actually doing it now, no need to wait for ir [19:29:32] it [19:29:46] shouldn't we do it on all s5 servers ? [19:29:53] (03CR) 10Cmjohnson: [C: 032] Adding production dns for db1109 and db1110 [dns] - 10https://gerrit.wikimedia.org/r/391882 (owner: 10Cmjohnson) [19:30:28] all that were applied, I mean [19:30:50] jynus: yeah yeah, sorry, for codfw master I meant with replication enabled, sorry I wasn't clear [19:32:00] PROBLEM - puppet last run on eventlog2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:32:30] RECOVERY - SSH on labtestservices2003 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0) [19:33:18] ah ok [19:33:57] I will do it tomorrow probably [19:38:11] RECOVERY - cassandra-c service on restbase2006 is OK: OK - cassandra-c is active [19:40:00] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [19:41:43] hi there [19:42:04] Is there a sysadmin who can check the status of 2 global renames, it seems they are stuck [19:42:20] PROBLEM - mysqld processes on db1063 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [19:42:24] PROBLEM - HHVM rendering on mw2123 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:42:25] ah right... [19:42:31] I will silence it [19:42:40] 10Operations, 10Trending-Service, 10Reading-Infrastructure-Team-Backlog (Kanban), 10Services (designing): Turn off Trending Service - https://phabricator.wikimedia.org/T180384#3767834 (10Jdlrobson) @Pchelolo asked me a few questions > are you up for being a maintainer of it? I am, although one of the big... [19:43:15] RECOVERY - HHVM rendering on mw2123 is OK: HTTP OK: HTTP/1.1 200 OK - 74076 bytes in 0.305 second response time [19:45:27] (03PS1) 10Ayounsi: Add CNAME for netbox.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/391887 [19:46:40] !log Bootstapping Cassandra restbase2006-a (T179422) [19:46:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:46] T179422: Reshape RESTBase Cassandra clusters - https://phabricator.wikimedia.org/T179422 [19:47:19] PROBLEM - MariaDB Slave Lag: s5 on db1104 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 6301.56 seconds [19:47:20] PROBLEM - MariaDB Slave Lag: s5 on db2066 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 6303.65 seconds [19:47:20] PROBLEM - MariaDB Slave Lag: s5 on db2083 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 6304.74 seconds [19:47:20] (03CR) 10Ayounsi: [C: 032] Add CNAME for netbox.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/391887 (owner: 10Ayounsi) [19:47:26] PROBLEM - MariaDB Slave Lag: s5 on db1106 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 6305.23 seconds [19:47:26] PROBLEM - MariaDB Slave SQL: s5 on db2023 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1677, Errmsg: Column 8 of table wikidatawiki.recentchanges cannot be converted from type tinyint to type bigint(20) unsigned [19:47:26] PROBLEM - MariaDB Slave Lag: s5 on db2052 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 6309.28 seconds [19:47:27] PROBLEM - MariaDB Slave Lag: s5 on db2080 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 6309.34 seconds [19:47:36] PROBLEM - MariaDB Slave Lag: s5 on db2038 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 6326.70 seconds [19:47:37] PROBLEM - MariaDB Slave Lag: s5 on db2082 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 6327.58 seconds [19:47:37] PROBLEM - MariaDB Slave Lag: s5 on db2084 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 6330.10 seconds [19:47:37] PROBLEM - MariaDB Slave Lag: s5 on db2089 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 6330.12 seconds [19:47:40] they are getting enabled back? [19:47:43] PROBLEM - MariaDB Slave SQL: s5 on db1104 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1677, Errmsg: Column 10 of table wikidatawiki.logging cannot be converted from type tinyblob to type bigint(20) unsigned [19:47:52] yeah the 2 hours probably passed [19:47:56] PROBLEM - MariaDB Slave Lag: s5 on db2081 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 6342.34 seconds [19:48:02] PROBLEM - MariaDB Slave SQL: s5 on db1099 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1677, Errmsg: Column 8 of table wikidatawiki.recentchanges cannot be converted from type tinyint to type bigint(20) unsigned [19:48:02] PROBLEM - MariaDB Slave Lag: s5 on db2023 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 6348.21 seconds [19:48:11] I 'll reschedule them for .... what ? 20 hours ? [19:48:16] 12 hours ? [19:48:16] PROBLEM - MariaDB Slave Lag: s5 on db2086 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 6364.34 seconds [19:48:17] PROBLEM - MariaDB Slave Lag: s5 on db2079 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 6367.51 seconds [19:48:17] PROBLEM - MariaDB Slave Lag: s5 on db2045 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 6368.41 seconds [19:48:22] yeah, let's give them 20h [19:48:24] PROBLEM - MariaDB Slave SQL: s5 on db1106 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1677, Errmsg: Column 8 of table wikidatawiki.recentchanges cannot be converted from type tinyint to type bigint(20) unsigned [19:48:24] PROBLEM - MariaDB Slave Lag: s5 on db2075 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 6369.35 seconds [19:48:28] PROBLEM - MariaDB Slave Lag: s5 on db1099 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 6369.91 seconds [19:48:38] PROBLEM - MariaDB Slave Lag: s5 on db2085 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 6386.52 seconds [19:48:38] PROBLEM - MariaDB Slave Lag: s5 on db2059 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 6387.88 seconds [19:49:02] 10Operations, 10Phabricator, 10Traffic, 10Zero: Missing IP addresses for Maroc Telecom - https://phabricator.wikimedia.org/T174342#3767879 (10Aklapper) 05Invalid>03Open My point was that Checkuser info isn't the only source as I've posted IP ranges in T174342#3737108 and T174342#3766633 which makes thi... [19:49:37] !log schedule extra downtime for s5 slaves [19:49:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:39] RECOVERY - puppet last run on netmon1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:54:39] !log ayounsi@tin Started deploy [netbox/deploy@19f4f65]: (no justification provided) [19:54:43] !log ayounsi@tin Finished deploy [netbox/deploy@19f4f65]: (no justification provided) (duration: 00m 04s) [19:54:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:27] !log demon@tin Unlocked for deployment [ALL REPOSITORIES]: No deploys, recovering from downtime (duration: 50m 28s) [19:56:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:05] no_justification: How many deployers does it take to do MediaWiki train deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171116T2000). [20:00:05] No GERRIT patches in the queue for this window AFAICS. [20:01:33] (03PS1) 10Jcrespo: mariadb: Depool full all delayed servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391889 (https://phabricator.wikimedia.org/T180714) [20:01:59] RECOVERY - puppet last run on eventlog2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [20:02:46] (03PS2) 10Jcrespo: mariadb: Depool fullyw all delayed servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391889 (https://phabricator.wikimedia.org/T180714) [20:02:47] !log deploy a9613b44b70cae79aefdd3422bba627c6516e1a0 to hotfix T180706 [20:02:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:53] T180706: Phabricator search hugely degraded in quality - https://phabricator.wikimedia.org/T180706 [20:03:18] (03PS3) 10Jcrespo: mariadb: Depool fully all delayed servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391889 (https://phabricator.wikimedia.org/T180714) [20:03:22] !log demon@tin Locking from deployment [operations/mediawiki-config]: No deploys, recovering from downtime (more narrow locking) (planned duration: 360m 00s) [20:03:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:01] (03CR) 10Volans: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391889 (https://phabricator.wikimedia.org/T180714) (owner: 10Jcrespo) [20:08:40] (03CR) 10Jcrespo: [C: 032] mariadb: Depool fully all delayed servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391889 (https://phabricator.wikimedia.org/T180714) (owner: 10Jcrespo) [20:08:56] (03CR) 10jenkins-bot: mariadb: Depool fully all delayed servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391889 (https://phabricator.wikimedia.org/T180714) (owner: 10Jcrespo) [20:10:12] !log demon@tin Unlocked for deployment [operations/mediawiki-config]: No deploys, recovering from downtime (more narrow locking) (duration: 06m 49s) [20:10:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:17] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Pool only 'good' servers - second try (duration: 00m 49s) [20:11:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:38] (03PS1) 10Madhuvishy: public_dumps: Define directory for xmldatadumps [puppet] - 10https://gerrit.wikimedia.org/r/391892 (https://phabricator.wikimedia.org/T171541) [20:14:12] (03CR) 10Hashar: "recheck" [dumps/dcat] - 10https://gerrit.wikimedia.org/r/390994 (https://phabricator.wikimedia.org/T180328) (owner: 10Hashar) [20:14:43] (03CR) 10Madhuvishy: [C: 032] public_dumps: Define directory for xmldatadumps [puppet] - 10https://gerrit.wikimedia.org/r/391892 (https://phabricator.wikimedia.org/T171541) (owner: 10Madhuvishy) [20:15:49] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Pool only 'good' servers - third try (duration: 00m 49s) [20:15:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:43] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1953 bytes in 0.107 second response time [20:25:18] (03Merged) 10jenkins-bot: Add linters for i18n json files [dumps/dcat] - 10https://gerrit.wikimedia.org/r/390994 (https://phabricator.wikimedia.org/T180328) (owner: 10Hashar) [20:25:35] (03PS1) 10Ayounsi: Netbox - fix static path [puppet] - 10https://gerrit.wikimedia.org/r/391893 [20:25:44] (03CR) 10Hashar: "There is some delay in the merge because I screwed up permissions in Gerrit." [dumps/dcat] - 10https://gerrit.wikimedia.org/r/390994 (https://phabricator.wikimedia.org/T180328) (owner: 10Hashar) [20:26:01] (03PS4) 10ArielGlenn: enable dump rsyncs to/from labstore1006 [puppet] - 10https://gerrit.wikimedia.org/r/391263 (https://phabricator.wikimedia.org/T171541) [20:26:18] (03CR) 10Ayounsi: [C: 032] Netbox - fix static path [puppet] - 10https://gerrit.wikimedia.org/r/391893 (owner: 10Ayounsi) [20:27:23] RECOVERY - puppet last run on naos is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [20:27:58] Editing on Wikidata seems to be back to normal. [20:28:16] (03CR) 10Hashar: "recheck" [dumps/dcat] - 10https://gerrit.wikimedia.org/r/390999 (https://phabricator.wikimedia.org/T180328) (owner: 10Hashar) [20:39:02] (03PS5) 10ArielGlenn: enable dump rsyncs to/from labstore1006 [puppet] - 10https://gerrit.wikimedia.org/r/391263 (https://phabricator.wikimedia.org/T171541) [20:40:03] (03CR) 10ArielGlenn: [C: 032] enable dump rsyncs to/from labstore1006 [puppet] - 10https://gerrit.wikimedia.org/r/391263 (https://phabricator.wikimedia.org/T171541) (owner: 10ArielGlenn) [20:46:29] (03PS1) 10ArielGlenn: add labstore1006 to list of hosts for rolling rsync of xml/sql dumps [puppet] - 10https://gerrit.wikimedia.org/r/391905 (https://phabricator.wikimedia.org/T171541) [20:51:35] (03PS1) 10Madhuvishy: Revert "nfsmount: Add temporary exception to the block-for-export check" [puppet] - 10https://gerrit.wikimedia.org/r/391906 [20:51:42] (03PS2) 10Madhuvishy: Revert "nfsmount: Add temporary exception to the block-for-export check" [puppet] - 10https://gerrit.wikimedia.org/r/391906 [20:51:45] !log one more catchup rsync from ms1001 to labstore1006 kicking off [20:51:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:43] PROBLEM - puppet last run on labstore1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:57:54] going to check that just in case [20:58:41] me, fixing [21:00:54] ACKNOWLEDGEMENT - Host mw2251 is DOWN: PING CRITICAL - Packet loss = 100% Volans Hardware issue: https://phabricator.wikimedia.org/T180724 [21:01:11] (03PS1) 10ArielGlenn: add rsync settings for labstore1007, fallback web/rsyncer [puppet] - 10https://gerrit.wikimedia.org/r/391909 [21:01:34] 10Operations, 10ops-codfw: mw2251 hardware error - https://phabricator.wikimedia.org/T180724#3767521 (10Volans) I've ack'ed the Icinga host down alarm with a link to this task [21:01:53] (03CR) 10ArielGlenn: [C: 032] add rsync settings for labstore1007, fallback web/rsyncer [puppet] - 10https://gerrit.wikimedia.org/r/391909 (owner: 10ArielGlenn) [21:02:45] 10Operations, 10Release Pipeline, 10Release-Engineering-Team (Watching / External): Update Debian package for Blubber - https://phabricator.wikimedia.org/T179984#3768097 (10dduvall) [21:06:43] RECOVERY - puppet last run on labstore1007 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [21:44:49] 10Operations, 10ops-codfw: mw2251 hardware error - https://phabricator.wikimedia.org/T180724#3768152 (10Papaul) Step 1; login to the IDRAC to check log files, log file is showing some memory error on DIMM_A1 " Correctable memory error rate exceeded for DIMM_A1" {F10833214} step 2: 1st ePSA test came out for e... [21:53:24] PROBLEM - HHVM rendering on mw2120 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:54:14] RECOVERY - HHVM rendering on mw2120 is OK: HTTP OK: HTTP/1.1 200 OK - 74042 bytes in 0.311 second response time [22:00:25] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review, 10Reading-Infrastructure-Team-Backlog (Kanban): All Reading Infrastructure engineers should have deploy rights for all services Readers engineering maintains - https://phabricator.wikimedia.org/T180366#3755301 (10RobH) So while this doesn't have a su... [22:14:30] (03PS3) 10Madhuvishy: Revert "nfsmount: Add temporary exception to the block-for-export check" [puppet] - 10https://gerrit.wikimedia.org/r/391906 [22:15:37] (03CR) 10Madhuvishy: [C: 032] Revert "nfsmount: Add temporary exception to the block-for-export check" [puppet] - 10https://gerrit.wikimedia.org/r/391906 (owner: 10Madhuvishy) [22:40:11] (03PS1) 10Rush: phab: remove obsolete portions of email handler [puppet] - 10https://gerrit.wikimedia.org/r/391969 [22:40:47] (03CR) 10jerkins-bot: [V: 04-1] phab: remove obsolete portions of email handler [puppet] - 10https://gerrit.wikimedia.org/r/391969 (owner: 10Rush) [22:41:58] (03PS1) 10Madhuvishy: nfs_mount: Remove showmount based blocking check for nfs shares [puppet] - 10https://gerrit.wikimedia.org/r/391970 (https://phabricator.wikimedia.org/T171508) [22:49:08] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review, 10Reading-Infrastructure-Team-Backlog (Kanban): All Reading Infrastructure engineers should have deploy rights for all services Readers engineering maintains - https://phabricator.wikimedia.org/T180366#3768421 (10Tgr) Thanks for setting this up @Mhol... [22:49:44] (03PS2) 10Rush: phab: remove obsolete portions of email handler [puppet] - 10https://gerrit.wikimedia.org/r/391969 [22:50:10] (03CR) 10jerkins-bot: [V: 04-1] phab: remove obsolete portions of email handler [puppet] - 10https://gerrit.wikimedia.org/r/391969 (owner: 10Rush) [23:10:45] (03CR) 10Madhuvishy: [C: 032] nfs_mount: Remove showmount based blocking check for nfs shares [puppet] - 10https://gerrit.wikimedia.org/r/391970 (https://phabricator.wikimedia.org/T171508) (owner: 10Madhuvishy) [23:15:44] PROBLEM - Router interfaces on cr1-eqdfw is CRITICAL: CRITICAL: No response from remote host 208.80.153.198 for 1.3.6.1.2.1.2.2.1.3 with snmp version 2 [23:16:34] RECOVERY - Router interfaces on cr1-eqdfw is OK: OK: host 208.80.153.198, interfaces up: 35, down: 0, dormant: 0, excluded: 0, unused: 0 [23:21:40] 10Operations, 10DBA, 10Patch-For-Review: db1063 crashed - https://phabricator.wikimedia.org/T180714#3768544 (10greg) [23:23:06] (03PS2) 10Dzahn: base::firewall: rename to profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/383519 (owner: 10Giuseppe Lavagetto) [23:25:46] (03CR) 10Dzahn: "meanwhile a whole bunch of these had been done, classes were deleted/converted to profile etc, needed massive rebase. PS2 was my attempt t" [puppet] - 10https://gerrit.wikimedia.org/r/383519 (owner: 10Giuseppe Lavagetto) [23:27:24] (03PS3) 10Dzahn: base::firewall: rename to profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/383519 (owner: 10Giuseppe Lavagetto) [23:28:37] (03CR) 10Dzahn: "after https://gerrit.wikimedia.org/r/#/q/status:merged+project:operations/puppet+branch:production+topic:%22profile::base::firewall%22 an" [puppet] - 10https://gerrit.wikimedia.org/r/383519 (owner: 10Giuseppe Lavagetto) [23:31:28] (03CR) 10Dzahn: "also https://gerrit.wikimedia.org/r/#/q/topic:profile-firewall+(status:open+OR+status:merged)" [puppet] - 10https://gerrit.wikimedia.org/r/383519 (owner: 10Giuseppe Lavagetto) [23:37:27] 10Operations, 10DBA, 10Patch-For-Review, 10Wikimedia-Incident: db1063 crashed - https://phabricator.wikimedia.org/T180714#3768579 (10greg) [23:38:01] 10Operations, 10DBA, 10Patch-For-Review, 10Wikimedia-Incident: db1063 crashed - https://phabricator.wikimedia.org/T180714#3768584 (10greg) p:05Triage>03Unbreak! [23:39:34] PROBLEM - rsyslog TLS listener on port 6514 on wezen is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer [23:44:05] RECOVERY - Long running screen/tmux on iron is OK: OK: No SCREEN or tmux processes detected. [23:44:24] eh, the wezen thing is unusual [23:45:03] rsyslog service is running [23:45:55] and is listening on 6514 despite what icinga says [23:46:11] well, listening is one thing, SSL handshake another [23:47:05] 10Operations, 10Traffic, 10Patch-For-Review: Renew unified certificates 2017 - https://phabricator.wikimedia.org/T178173#3768594 (10BBlack) [23:48:29] (03CR) 10Krinkle: "See https://phabricator.wikimedia.org/T179093#3765016" [puppet] - 10https://gerrit.wikimedia.org/r/391705 (https://phabricator.wikimedia.org/T179093) (owner: 10Ottomata) [23:48:57] peer did not provide a certificate, not permitted to talk to it [23:49:04] PROBLEM - rsyslog TLS listener on port 6514 on wezen is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer [23:50:14] RECOVERY - rsyslog TLS listener on port 6514 on wezen is OK: SSL OK - Certificate wezen.codfw.wmnet valid until 2021-08-21 20:09:05 +0000 (expires in 1373 days) [23:50:18] !log wezen - systemctl restart rsyslog [23:50:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log