[00:00:22] (03Merged) 10jenkins-bot: Allow translationadmin self-add for beta.wikiversity admins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344060 (https://phabricator.wikimedia.org/T160120) (owner: 10Dereckson) [00:00:31] (03CR) 10jenkins-bot: Allow translationadmin self-add for beta.wikiversity admins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344060 (https://phabricator.wikimedia.org/T160120) (owner: 10Dereckson) [00:03:07] 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: EQIAD: stat1002 replacement - https://phabricator.wikimedia.org/T159838#3080331 (10faidon) What's the status of this (and stat1003 too)? Who is this waiting on? Is it @ottomata or @RobH? [00:04:32] (03PS7) 10Dzahn: change MX records for wikimedia.ee from elkdata.ee to Google [dns] - 10https://gerrit.wikimedia.org/r/341359 (https://phabricator.wikimedia.org/T158638) [00:04:42] 344060 on mwdebug1002 / 343962 still pending tests [00:04:47] (03CR) 10jerkins-bot: [V: 04-1] change MX records for wikimedia.ee from elkdata.ee to Google [dns] - 10https://gerrit.wikimedia.org/r/341359 (https://phabricator.wikimedia.org/T158638) (owner: 10Dzahn) [00:05:11] 344060 works [00:05:24] 00:04:37 ERROR: Error cloning remote repo 'origin' [00:05:57] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Allow translationadmin self-add for beta.wikiversity admins (T160120) (duration: 00m 43s) [00:06:00] bblack: ^ is this possibly due to work on the way DNS-lint works? [00:06:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:06:03] T160120: Install Extension:Translate on beta.wikiversity - https://phabricator.wikimedia.org/T160120 [00:06:16] https://integration.wikimedia.org/ci/job/operations-dns-tabs/2866/console [00:06:49] (03CR) 10Dzahn: [C: 031] "recheck" [dns] - 10https://gerrit.wikimedia.org/r/341359 (https://phabricator.wikimedia.org/T158638) (owner: 10Dzahn) [00:07:33] bblack: well, ignore that. it just had a hick-up i guess [00:07:44] recheck and things are different [00:08:08] RoanKattouw: so it's finally merged and on mwdebug1002, you want it too on terbium? [00:08:23] (03CR) 10Dzahn: [C: 032] change MX records for wikimedia.ee from elkdata.ee to Google [dns] - 10https://gerrit.wikimedia.org/r/341359 (https://phabricator.wikimedia.org/T158638) (owner: 10Dzahn) [00:09:54] Yes, that's where I need to test it [00:10:12] RoanKattouw: done, on terbium too [00:10:30] 06Operations, 10Domains, 10Traffic, 06WMF-Legal, 13Patch-For-Review: Using wikimedia.ee mail address as Google account - https://phabricator.wikimedia.org/T158638#3120414 (10Dzahn) @Beetlebeard I have made the change just now (OKed by legal). It will change in about an hour due to caching. [00:11:52] 06Operations, 10Gerrit, 06Release-Engineering-Team, 10hardware-requests, 13Patch-For-Review: Requesting 1 spare misc box for Gerrit in codfw - https://phabricator.wikimedia.org/T148187#3120420 (10faidon) [00:12:03] Dereckson: Urgh, looks like testwiki was rolled back to wmf.16 and this is a wmf.17 cherry-pick, so it can't easily be tested [00:12:10] But it's only a new maintenance script so it should be OK [00:12:21] (03CR) 10Jforrester: [C: 031] errorpages: Restyle 404.php to be like other error pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343819 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [00:12:40] 06Operations, 10Domains, 10Traffic, 06WMF-Legal, 13Patch-For-Review: Using wikimedia.ee mail address as Google account - https://phabricator.wikimedia.org/T158638#3120422 (10Dzahn) ``` ;; QUESTION SECTION: ;wikimedia.ee. IN MX ;; ANSWER SECTION: wikimedia.ee. 3600 IN MX 5 alt1.aspmx.l.google.com. wik... [00:12:48] (03CR) 10Jforrester: [C: 031] errorpages: Sync with changes to puppet:///varnish/errorpages.html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343821 (owner: 10Krinkle) [00:13:51] RoanKattouw: ok [00:15:45] !log dereckson@tin Synchronized php-1.29.0-wmf.16/extensions/CirrusSearch/: CompSuggest: Increase default limit from 50 to 255 + speed optimization ([[Gerrit:343962]] + [[Gerrit:343966]]) (duration: 00m 55s) [00:15:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:57] !log SWAT done. [00:16:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:25:36] mutante: yeah just random CI-infra fail: [00:25:37] 00:04:37 stdout: [00:25:38] 00:04:37 stderr: fatal: Could not read from remote repository [00:29:02] yep, *nod* [00:35:41] (03PS2) 10Andrew Bogott: Nova scheduler: Prefer virthosts with lower CPU usage [puppet] - 10https://gerrit.wikimedia.org/r/344051 (https://phabricator.wikimedia.org/T161006) [00:35:43] (03PS1) 10Andrew Bogott: nova.conf: Collect cpu metrics [puppet] - 10https://gerrit.wikimedia.org/r/344063 [00:39:51] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.263 second response time [00:40:31] (03CR) 10Andrew Bogott: [C: 032] nova.conf: Collect cpu metrics [puppet] - 10https://gerrit.wikimedia.org/r/344063 (owner: 10Andrew Bogott) [00:40:31] RECOVERY - Disk space on labtestnet2001 is OK: DISK OK [00:41:31] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [00:42:32] (03PS3) 10Andrew Bogott: Nova scheduler: Prefer virthosts with lower CPU usage [puppet] - 10https://gerrit.wikimedia.org/r/344051 (https://phabricator.wikimedia.org/T161006) [00:44:31] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [00:45:51] RECOVERY - puppet last run on labtestnet2001 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [00:49:51] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.421 second response time [00:50:53] (03PS1) 10Chad: Contint: Allow new host gerrit2001 connections as well [puppet] - 10https://gerrit.wikimedia.org/r/344065 [00:51:31] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [00:55:29] (03PS1) 10Chad: Gerrit: Add ipv4 address for codfw slave [puppet] - 10https://gerrit.wikimedia.org/r/344066 [00:59:11] PROBLEM - puppet last run on stat1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:01:55] (03PS3) 10Chad: Gerrit: Move master/slave detection to profile [puppet] - 10https://gerrit.wikimedia.org/r/344053 [01:08:13] 06Operations, 10Ops-Access-Requests, 10Gerrit: archiva-deploy password for Chad H. - https://phabricator.wikimedia.org/T161067#3120509 (10demon) [01:09:51] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.405 second response time [01:12:47] 06Operations, 10ORES, 10Revision-Scoring-As-A-Service-Backlog: [spec] Active-active setup for ORES across datacenters (eqiad, codfw) - https://phabricator.wikimedia.org/T159615#3120524 (10Halfak) @mobrobac, I can't see how this would solve any of the problems we've been discussing. Can you clarify what, exa... [01:16:52] 06Operations, 06Parsing-Team, 06Release-Engineering-Team, 07HHVM, 07Wikimedia-Incident: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#3120540 (10Krinkle) [01:19:51] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.642 second response time [01:22:41] PROBLEM - puppet last run on labsdb1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:25:23] (03PS1) 10Chad: Gerrit: Double size of projects cache [puppet] - 10https://gerrit.wikimedia.org/r/344068 [01:28:11] RECOVERY - puppet last run on stat1003 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [01:29:40] (03PS1) 10Chad: Gerrit: Ensure ops always has admin rights [puppet] - 10https://gerrit.wikimedia.org/r/344069 [01:33:39] (03CR) 10Dzahn: [C: 032] Contint: Allow new host gerrit2001 connections as well [puppet] - 10https://gerrit.wikimedia.org/r/344065 (owner: 10Chad) [01:36:44] (03PS2) 10Dzahn: Gerrit: Add ipv4 address for codfw slave [puppet] - 10https://gerrit.wikimedia.org/r/344066 (owner: 10Chad) [01:37:29] (03PS1) 10BBlack: cache_misc: support WMF-Last-Access-Global cookie [puppet] - 10https://gerrit.wikimedia.org/r/344071 (https://phabricator.wikimedia.org/T138027) [01:37:35] mutante: We should get the ipv6 in codfw first? [01:37:40] Do it all at once on that 2nd change? [01:46:14] RainbowSprinkles: ok, not that important if in one change , but sure [01:46:34] RainbowSprinkles: then it's time to add gerrit2001 to site.pp but separate stanza [01:46:57] we add interface::add_ip6_mapped { 'main': } on the node [01:47:25] and then we get the right IP and add the AAAA record [01:47:46] does that [01:50:41] RECOVERY - puppet last run on labsdb1001 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [01:51:58] (03PS1) 10Dzahn: site.pp: add gerrit2001 with just standard and IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/344072 (https://phabricator.wikimedia.org/T152525) [01:52:31] mutante: Oh, I assumed we got it from dns.git :) [01:52:36] That's where I snagged the v4 [01:53:02] it has not been added yet [01:53:24] it doesnt matter which is first, but this way you can copy/paste without room for error [01:53:33] from what is actually on the interface [01:54:12] Ahh ok [01:54:13] but without this in site.pp we are not getting the "mapped" address, only the "random" one based on MAC. we dont want that [01:54:14] (03PS2) 10Krinkle: errorpages: Sync with changes to puppet:///varnish/errorpages.html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343821 [01:55:07] (03CR) 10Dzahn: [C: 032] site.pp: add gerrit2001 with just standard and IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/344072 (https://phabricator.wikimedia.org/T152525) (owner: 10Dzahn) [01:55:14] (03CR) 10Chad: "The duplication here is mind-numbingly dumb :(" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343821 (owner: 10Krinkle) [01:56:10] + pre-up /sbin/ip token set ::208:80:153:106 dev eth0 [01:56:18] there is the nice IP now [01:57:12] (03CR) 10Krinkle: "Yeah, small steps forward though :) At least they're all in once place now (except for the Varnish one). Next we can perhaps reduce the on" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343821 (owner: 10Krinkle) [01:58:54] RainbowSprinkles: Any immediate ideas? [01:59:01] (03CR) 10Krinkle: [C: 032] errorpages: Sync with changes to puppet:///varnish/errorpages.html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343821 (owner: 10Krinkle) [01:59:09] Not particularly. Just every time I look at it I twitch [01:59:18] Docroots almost cleaned, been avoiding the error pages [01:59:53] Ori and I cleaned up the error pages [01:59:58] they were everywhere, literally [02:00:13] no two were in the same repository or configured in a remotely similar way [02:00:16] the latter is still the case [02:00:27] That's the part that bugs me ^ [02:00:28] but oh well. I suppose that makes sense (apache vs hhvm vs php vs varnish) [02:00:41] (03Merged) 10jenkins-bot: errorpages: Sync with changes to puppet:///varnish/errorpages.html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343821 (owner: 10Krinkle) [02:00:44] They all have different configuration formats and will need to be hooked up seprately [02:00:49] but perhaps we could puppetize that [02:00:50] (03CR) 10jenkins-bot: errorpages: Sync with changes to puppet:///varnish/errorpages.html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343821 (owner: 10Krinkle) [02:00:53] using ERB instead [02:00:58] not in this repo [02:01:07] Yeah I think having them in puppet makes far more sense [02:01:12] Then they can live close to what uses them [02:02:24] !log krinkle@tin Synchronized errorpages/: minor tweaks - I60344bd519d (duration: 00m 54s) [02:02:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:02:51] mutante: Yeah that was the v4 address I had, just needed the v6 :) [02:03:24] RainbowSprinkles: that is the v6, with the prefix added it looks like "2620:0:860:4:208:80:153:106" that makes it "mapped" [02:03:29] mapped to the v4 [02:03:35] adding to DNS [02:03:49] Ok, I'll amend my puppet bit too so it gets set there too [02:04:10] ok [02:04:27] eh, this is row D?? [02:04:48] looks at something because this would be the first in that network [02:05:56] (03PS3) 10Chad: Gerrit: Add ipv4 & v6 address for codfw slave [puppet] - 10https://gerrit.wikimedia.org/r/344066 [02:10:02] RainbowSprinkles: Hm.. We'll need a place to reference them from, though. The good thing is that both HHVM and Apache allow configuring a 4xx/5xx handler by file path (not needed to be a public url). Although right now it is configured as url for some of them (e.g. indirect dependency on docroot and document root, evil) [02:10:37] Do we have an appropiate directory for app servers to put files that are not from mediawiki-config? [02:10:51] e.g. not apache config itelf but files we refer to in it. [02:11:09] I mean anywhere, technically. [02:11:33] Something other than /srv/mediawiki or /home/krinkle, which is odd to puppetize :P [02:11:48] Could deploy them with scap... [02:11:53] /srv/deployment/* [02:11:59] hehe, yeah [02:12:27] I like having them in puppet, allows nice re-use of templates without doing it at either run-time, nor commit time [02:12:40] Puppet works too [02:12:45] They don't change super often [02:13:11] (03PS6) 10Krinkle: errorpages: Restyle 404.php to be like other error pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343819 (https://phabricator.wikimedia.org/T113114) [02:13:30] (03PS1) 10Dzahn: add IPv6 for gerrit2001.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/344074 (https://phabricator.wikimedia.org/T152525) [02:13:47] (03CR) 10Krinkle: "Pending further feedback from Volker." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343819 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [02:14:11] the Apache config is inside mediawiki puppet module as well, so i would use modules/mediawiki/files/ [02:14:22] and put them into /srv/ somewhere [02:14:28] Yeah, I was just wondering where to put them on the server [02:14:33] (03CR) 10Chad: [C: 031] add IPv6 for gerrit2001.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/344074 (https://phabricator.wikimedia.org/T152525) (owner: 10Dzahn) [02:14:51] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.232 second response time [02:15:00] Do we have a /var/lib/mediawiki or /var/lib/wmf or some such? [02:15:36] Nope [02:15:46] i think /srv/mediawiki/ [02:15:56] Right, but that's a git clone [02:16:10] Oooooh [02:16:12] I know [02:16:14] :P [02:16:19] /srv/deployment/mediawiki/errorpages [02:16:34] Since /srv/deployment/mediawiki/mediawiki is going to be what you expect :) [02:16:46] Meh, no, not if puppet [02:16:47] ignore me [02:17:08] * RainbowSprinkles sips beer [02:17:10] Let me guess, mediawiki/mediawiki does not have enough mediawiki in it for it to be actual mediawiki [02:17:43] it's gonna be /srv/deployment/mediawiki/mediawiki/mediawiki-1.29.0-wmf.19/ [02:17:53] No, that'd be silly [02:18:00] Okay :D [02:18:03] /srv/deployment/mediawiki/mediawiki/php-1.29.0-wmf.19/ [02:18:10] but... [02:18:35] checked puppet mediawiki module for file resources, yea nothing like that yet [02:18:45] Yeah I know. [02:18:50] I tried this morning during puppetswat [02:18:52] broke everything [02:18:54] so reverted [02:19:31] I imagine deployment/mediawiki/mediawiki would be reserved for, I dont know, scap-infused dockerized actual 'mediawiki' [02:19:35] introduce /var/lib/mediawiki by adding it to pupepet? [02:20:44] I have the best idea. [02:20:48] Let's just leave it as it is. [02:20:51] And pretend I didn't complain [02:21:30] hehe, /me sips too [02:21:42] RainbowSprinkles: :) [02:21:56] RainbowSprinkles: As part of T113114 I will most likely eventually move errorpages/ to puppet [02:21:56] T113114: Make all wiki-facing error pages consistent - https://phabricator.wikimedia.org/T113114 [02:22:54] Also, if you could help figure out where https://en.m.wikipedia.org/foo.php is served from, that would be the last puzzle piece [02:23:32] (03CR) 10Dzahn: [C: 032] add IPv6 for gerrit2001.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/344074 (https://phabricator.wikimedia.org/T152525) (owner: 10Dzahn) [02:24:51] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.781 second response time [02:25:05] gerrit2001.wikimedia.org has IPv6 address 2620:0:860:4:208:80:153:106 [02:25:10] 6.0.1.0.3.5.1.0.0.8.0.0.8.0.2.0.4.0.0.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa domain name pointer gerrit2001.wikimedia.org. [02:25:14] works both ways now [02:26:26] (03CR) 10Dzahn: [C: 032] Gerrit: Add ipv4 & v6 address for codfw slave [puppet] - 10https://gerrit.wikimedia.org/r/344066 (owner: 10Chad) [02:28:05] RainbowSprinkles: alright, i think this works now and is a good place to take a break. more tomorrow. have a good rest of the night [02:28:09] Yeah [02:28:12] Good stopping place [02:28:14] tyvm [02:28:19] yw, cya tomorrow [02:34:10] 06Operations, 10ops-codfw, 06DC-Ops, 13Patch-For-Review, 05codfw-rollout: rack new mw log host - sinistra - https://phabricator.wikimedia.org/T128796#3120637 (10Dzahn) a:05ori>03fgiunchedi @fgiunchedi given that you did fluorine as mwlog1001 wondering if you know if this ticket for mwlog2001 is also... [02:35:13] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.16) (duration: 13m 20s) [02:35:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:40:42] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Mar 22 02:40:42 UTC 2017 (duration 5m 29s) [02:40:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:05:31] PROBLEM - puppet last run on cp1052 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:19:51] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.427 second response time [03:22:41] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 663.96 seconds [03:24:51] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.645 second response time [03:25:41] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 212.04 seconds [03:33:31] RECOVERY - puppet last run on cp1052 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [03:35:51] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 1806.505837 Seconds [03:36:31] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 1850.360429 Seconds [03:36:31] PROBLEM - puppet last run on maps1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:36:51] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 0.0 Seconds [03:42:41] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 2220.864023 Seconds [03:43:41] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds [03:43:51] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 2286.255927 Seconds [03:44:31] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds [03:45:51] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 0.0 Seconds [03:46:41] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 2461.059542 Seconds [03:50:31] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 2690.339153 Seconds [03:50:51] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 2706.29266 Seconds [03:51:31] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds [03:51:51] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 0.0 Seconds [03:52:41] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds [03:55:31] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 2990.277465 Seconds [03:55:41] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 3001.225311 Seconds [03:56:51] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 3066.232123 Seconds [03:57:31] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds [03:58:51] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 0.0 Seconds [04:00:41] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds [04:01:51] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 3366.484655 Seconds [04:03:41] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 3481.147057 Seconds [04:04:41] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds [04:04:51] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 0.0 Seconds [04:05:31] RECOVERY - puppet last run on maps1001 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [04:43:31] PROBLEM - puppet last run on ganeti1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:43:32] PROBLEM - puppet last run on wdqs1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:46:24] (03PS1) 10Chad: $wgWhitelistRead: grantwiki is in private.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344078 [04:46:52] RECOVERY - BGP status on cr2-knams is OK: BGP OK - up: 11, down: 0, shutdown: 0 [05:11:31] RECOVERY - puppet last run on wdqs1003 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [05:12:31] RECOVERY - puppet last run on ganeti1004 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [05:16:00] (03PS3) 10Giuseppe Lavagetto: realm: remove graphoid_site [puppet] - 10https://gerrit.wikimedia.org/r/340995 [05:18:28] <_joe_> !log depooling temporarily citoid in eqiad from dns discovery [05:18:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:33:41] PROBLEM - puppet last run on kafka1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:44:57] <_joe_> !log finished tests on citoid/dns discovery; restbase successfully detects the change [05:45:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:41] RECOVERY - puppet last run on kafka1003 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:23:21] PROBLEM - Disk space on labcontrol1001 is CRITICAL: DISK CRITICAL - free space: / 1766 MB (3% inode=81%) [06:28:11] (03PS1) 10Marostegui: db-eqiad.php: Restore db1092 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344080 (https://phabricator.wikimedia.org/T137191) [06:28:21] RECOVERY - Disk space on labcontrol1001 is OK: DISK OK [06:29:31] PROBLEM - carbon-local-relay metric drops on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [100.0] [06:30:04] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Restore db1092 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344080 (https://phabricator.wikimedia.org/T137191) (owner: 10Marostegui) [06:31:24] (03Merged) 10jenkins-bot: db-eqiad.php: Restore db1092 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344080 (https://phabricator.wikimedia.org/T137191) (owner: 10Marostegui) [06:31:33] (03CR) 10jenkins-bot: db-eqiad.php: Restore db1092 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344080 (https://phabricator.wikimedia.org/T137191) (owner: 10Marostegui) [06:33:19] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Restore db1092 weight - T137191 (duration: 00m 49s) [06:33:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:33:26] T137191: Defragment db1070, db1082, db1087, db1092 - https://phabricator.wikimedia.org/T137191 [06:33:31] RECOVERY - carbon-local-relay metric drops on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [06:36:48] (03PS1) 10Marostegui: db-codfw.php: Repool db2037 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344081 (https://phabricator.wikimedia.org/T73563) [06:40:44] (03CR) 10Marostegui: [C: 032] db-codfw.php: Repool db2037 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344081 (https://phabricator.wikimedia.org/T73563) (owner: 10Marostegui) [06:41:59] (03Merged) 10jenkins-bot: db-codfw.php: Repool db2037 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344081 (https://phabricator.wikimedia.org/T73563) (owner: 10Marostegui) [06:42:08] (03CR) 10jenkins-bot: db-codfw.php: Repool db2037 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344081 (https://phabricator.wikimedia.org/T73563) (owner: 10Marostegui) [06:43:28] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2037 T160415 - T73563 (duration: 00m 43s) [06:43:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:35] T160415: Review schema changes for T125071 - Add index to image table on all wikis - https://phabricator.wikimedia.org/T160415 [06:43:35] T73563: *_minor_mime are varbinary(32) on WMF sites, out of sync with varbinary(100) in MW core - https://phabricator.wikimedia.org/T73563 [06:50:58] (03PS1) 10Marostegui: db-eqiad.php: Depool db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344083 (https://phabricator.wikimedia.org/T137191) [06:53:08] (03PS2) 10Marostegui: db-eqiad.php: Depool db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344083 (https://phabricator.wikimedia.org/T137191) [06:55:11] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344083 (https://phabricator.wikimedia.org/T137191) (owner: 10Marostegui) [06:57:24] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344083 (https://phabricator.wikimedia.org/T137191) (owner: 10Marostegui) [06:57:33] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344083 (https://phabricator.wikimedia.org/T137191) (owner: 10Marostegui) [06:58:29] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1087 - T137191 (duration: 00m 43s) [06:58:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:36] T137191: Defragment db1070, db1082, db1087, db1092 - https://phabricator.wikimedia.org/T137191 [06:58:54] (03PS1) 10Giuseppe Lavagetto: conftool: add TTL to the discovery schema [puppet] - 10https://gerrit.wikimedia.org/r/344084 [06:58:56] (03PS1) 10Giuseppe Lavagetto: authdns::statefile: add TTL for entries [puppet] - 10https://gerrit.wikimedia.org/r/344085 [07:01:10] (03CR) 10Giuseppe Lavagetto: [C: 032] conftool: add TTL to the discovery schema [puppet] - 10https://gerrit.wikimedia.org/r/344084 (owner: 10Giuseppe Lavagetto) [07:03:51] RECOVERY - MariaDB Slave Lag: s5 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 89738.03 seconds [07:05:11] !log Stop MySQL db1087 - T137191 [07:05:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:19] T137191: Defragment db1070, db1082, db1087, db1092 - https://phabricator.wikimedia.org/T137191 [07:10:45] !log oblivian@puppetmaster1001 conftool action : set/ttl=300; selector: dnsdisc=.* [07:10:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:59] (03CR) 10Giuseppe Lavagetto: [C: 032] realm: remove graphoid_site [puppet] - 10https://gerrit.wikimedia.org/r/340995 (owner: 10Giuseppe Lavagetto) [07:13:05] (03PS4) 10Giuseppe Lavagetto: realm: remove graphoid_site [puppet] - 10https://gerrit.wikimedia.org/r/340995 [07:25:31] 06Operations, 10ops-codfw, 10DBA: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3120835 (10Marostegui) I think it has been wiped as it doesn't even show the GRUB after selecting to boot from disk. As I said, the hard disks are being show in the RAID and BIOS menu. ``` 0 Non-RAID Di... [07:30:32] 06Operations, 10ops-codfw, 10DBA: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3120836 (10Marostegui) Also tried to reinstall grub just in case it was the only thing deleted, but also failed on that. So maybe it was indeed reimaged and when I stopped it, was already half way thru... [07:33:29] (03PS1) 10Marostegui: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344086 (https://phabricator.wikimedia.org/T73563) [07:37:41] (03PS1) 10Marostegui: sanitarium2.my.cnf: Set slave_type_conversions [puppet] - 10https://gerrit.wikimedia.org/r/344087 (https://phabricator.wikimedia.org/T73563) [07:38:09] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344086 (https://phabricator.wikimedia.org/T73563) (owner: 10Marostegui) [07:39:29] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344086 (https://phabricator.wikimedia.org/T73563) (owner: 10Marostegui) [07:39:37] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344086 (https://phabricator.wikimedia.org/T73563) (owner: 10Marostegui) [07:39:51] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.214 second response time [07:40:36] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1091 T160415 - T73563 (duration: 00m 43s) [07:40:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:46] T160415: Review schema changes for T125071 - Add index to image table on all wikis - https://phabricator.wikimedia.org/T160415 [07:40:46] T73563: *_minor_mime are varbinary(32) on WMF sites, out of sync with varbinary(100) in MW core - https://phabricator.wikimedia.org/T73563 [07:44:51] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.574 second response time [07:45:40] I'm running a brief LDAP maintenance, authentications against cn=wmf may fail intermittently for a few minutes [07:48:16] (03PS1) 10Giuseppe Lavagetto: Add missing discovery entries [puppet] - 10https://gerrit.wikimedia.org/r/344088 [07:48:36] LDAP maintenance completed [07:52:25] 06Operations, 10ops-codfw, 10DBA: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3120884 (10Marostegui) The new mainboard is configured to always boot from PXE. ``` System BIOS Settings > Boot Settings > BIOS Boot Settings Boot Sequence... [07:53:27] 06Operations: Enhance account handling (meta bug) - https://phabricator.wikimedia.org/T142815#3120887 (10MoritzMuehlenhoff) [07:53:29] 06Operations, 07LDAP, 13Patch-For-Review: Enhance group membership visibility using the memberof LDAP overlay - https://phabricator.wikimedia.org/T142817#3120885 (10MoritzMuehlenhoff) 05Open>03Resolved All the privileged LDAP groups have been rewritten and most of labs usage should have trickled in as we... [07:53:34] !log rebuilding ttmserver index in elastic@eqiad to catchup lost writes during es5 upgrade [07:53:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:59] 06Operations, 06Performance-Team, 06Reading-Web-Backlog, 10Traffic, and 3 others: Performance review #2 of Hovercards (Popups extension) - https://phabricator.wikimedia.org/T70861#3120888 (10Peter) Where can I find how we reason about the extra delay of 500 ms and 200 ms fade to view as described in https:... [08:07:21] 06Operations, 06Performance-Team, 06Reading-Web-Backlog, 10Traffic, and 3 others: Performance review #2 of Hovercards (Popups extension) - https://phabricator.wikimedia.org/T70861#3120893 (10phuedx) ^ /cc @Nirzar [08:21:03] 06Operations, 10ops-codfw, 10DBA: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3120905 (10Marostegui) p:05Normal>03High a:05Marostegui>03Papaul Before doing a proper reimage, we need to change the boot sequence to boot first from disk and if not, from the NIC. I am not bei... [08:37:33] 06Operations, 10ops-codfw, 10DBA: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3120914 (10Marostegui) @MoritzMuehlenhoff kindly help and suggested: `racadm config -g cfgServerInfo -o cfgServerFirstBootDevice HDD` Which I tried, but had not effect on the boot order: ``` /admin1-> r... [08:46:16] !log Stop MySQL db1070 to clone db1087 from it - T137191 [08:46:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:22] T137191: Defragment db1070, db1082, db1087, db1092 - https://phabricator.wikimedia.org/T137191 [08:53:03] (03Abandoned) 10Giuseppe Lavagetto: discovery: add parsoid entry [puppet] - 10https://gerrit.wikimedia.org/r/340992 (owner: 10Giuseppe Lavagetto) [08:55:31] (03Abandoned) 10Giuseppe Lavagetto: discovery: add more DNS entries [puppet] - 10https://gerrit.wikimedia.org/r/340994 (owner: 10Giuseppe Lavagetto) [08:55:57] (03Abandoned) 10Giuseppe Lavagetto: discovery: add api endpoint [puppet] - 10https://gerrit.wikimedia.org/r/340998 (owner: 10Giuseppe Lavagetto) [08:59:12] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3120948 (10Gehel) Investigation will continue with @Papaul and @Gehel on Thursday March 23 4pm CET (8am PT) [09:09:33] 06Operations, 10ops-eqiad, 13Patch-For-Review, 15User-fgiunchedi: Rack and Setup ms-be1028-ms-1039 - https://phabricator.wikimedia.org/T160640#3120965 (10fgiunchedi) p:05Triage>03High @Cmjohnson is there an ETA to have the servers with OS installed online? thanks! [09:10:25] (03PS1) 10Muehlenhoff: Enable experimental section on mwdebug* and mw1261 [puppet] - 10https://gerrit.wikimedia.org/r/344091 [09:15:11] (03PS2) 10Muehlenhoff: Enable experimental section on mwdebug* and mw1261 [puppet] - 10https://gerrit.wikimedia.org/r/344091 [09:20:26] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=prometheus2002.codfw.wmnet [09:20:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:38] (03CR) 10Muehlenhoff: [C: 032] Enable experimental section on mwdebug* and mw1261 [puppet] - 10https://gerrit.wikimedia.org/r/344091 (owner: 10Muehlenhoff) [09:25:27] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344092 [09:26:29] (03PS1) 10Giuseppe Lavagetto: Add new discovery entries [dns] - 10https://gerrit.wikimedia.org/r/344093 [09:27:25] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344092 (owner: 10Marostegui) [09:28:31] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:28:38] (03PS6) 10Gehel: maps - cleartables osm replication [puppet] - 10https://gerrit.wikimedia.org/r/341563 (https://phabricator.wikimedia.org/T157613) [09:28:53] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344092 (owner: 10Marostegui) [09:29:01] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344092 (owner: 10Marostegui) [09:29:47] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1091 T160415 - T73563 (duration: 00m 43s) [09:29:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:54] T160415: Review schema changes for T125071 - Add index to image table on all wikis - https://phabricator.wikimedia.org/T160415 [09:29:54] T73563: *_minor_mime are varbinary(32) on WMF sites, out of sync with varbinary(100) in MW core - https://phabricator.wikimedia.org/T73563 [09:30:24] 06Operations, 10hardware-requests, 05codfw-rollout: Log host for codfw (fluorine's equivalent) - https://phabricator.wikimedia.org/T126988#3121015 (10fgiunchedi) [09:30:28] 06Operations, 10ops-codfw, 06DC-Ops, 13Patch-For-Review, 05codfw-rollout: rack new mw log host - sinistra - https://phabricator.wikimedia.org/T128796#3121010 (10fgiunchedi) 05Open>03declined @Dzahn @Krinkle yeah sinistra was reinstalled as mwlog2001 in {T153384} [09:35:44] (03PS1) 10Marostegui: db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344094 (https://phabricator.wikimedia.org/T73563) [09:37:29] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344094 (https://phabricator.wikimedia.org/T73563) (owner: 10Marostegui) [09:38:43] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344094 (https://phabricator.wikimedia.org/T73563) (owner: 10Marostegui) [09:38:51] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344094 (https://phabricator.wikimedia.org/T73563) (owner: 10Marostegui) [09:39:40] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1084 T160415 - T73563 (duration: 00m 43s) [09:39:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:46] T160415: Review schema changes for T125071 - Add index to image table on all wikis - https://phabricator.wikimedia.org/T160415 [09:39:46] T73563: *_minor_mime are varbinary(32) on WMF sites, out of sync with varbinary(100) in MW core - https://phabricator.wikimedia.org/T73563 [09:40:05] !log Deploy alter table s4 (commonswiki) db1084 - T73563 T160415 [09:40:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:20] (03PS5) 10Gehel: postgresql: Only set user password if different [puppet] - 10https://gerrit.wikimedia.org/r/329328 (owner: 10Tim Landscheidt) [09:43:09] (03CR) 10Gehel: [C: 032] postgresql: Only set user password if different [puppet] - 10https://gerrit.wikimedia.org/r/329328 (owner: 10Tim Landscheidt) [09:46:11] RECOVERY - puppet last run on labsdb1006 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [09:50:27] !log upgrading mwdebug* to HHVM 3.18.1 [09:50:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:32] !log upgrading mw1261 to HHVM 3.18.1 [09:55:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:31] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [10:00:06] 06Operations, 10ops-eqiad, 06Labs, 10Labs-Infrastructure: Labvirt1001 has insanely slow IO - https://phabricator.wikimedia.org/T159835#3080282 (10hashar) Might well be related to T161006 which suggest the Scheduler prioritize mostly based on RAM usage. So we end up with Nodepool instances spawning mostly... [10:03:34] 06Operations, 10ops-codfw, 10DBA: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3121113 (10jcrespo) Probably what happened is that on boar change, BIOS was reseted and not changed to the default "boot from disk"- a problem I think we had with some of the servers in the past. The r... [10:04:51] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.241 second response time [10:09:44] !log cr1-eqiad: set ae4 and members to disable. T133387 [10:09:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:50] T133387: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387 [10:09:52] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.583 second response time [10:11:11] PROBLEM - puppet last run on labsdb1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[pg_basebackup-labsdb1007.eqiad.wmnet] [10:11:27] jynus: ^ labsd1006 [10:11:32] jynus: need any help ? [10:11:58] I had no time to look it up again [10:12:01] 06Operations, 10ops-codfw, 10DBA: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3121204 (10Marostegui) >>! In T160242#3121113, @jcrespo wrote: > Probably what happened is that on boar change, BIOS was reseted and not changed to the default "boot from disk"- a problem I think we had... [10:12:19] jynus: ok, I can't either right now, but it's not urgent [10:12:20] it fails to connect due to something ssl, but I just saw the erro [10:12:30] I intend to work as soon as I can [10:12:39] ok [10:13:56] what do you thing about enabling and disabling PXE on dhcp on a server base? [10:14:33] you as in "everybody that has an opinion about that here" [10:14:41] (03PS1) 10DCausse: [cirrus] Enable the completion suggester [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344099 [10:15:11] (03CR) 10Jcrespo: [C: 031] sanitarium2.my.cnf: Set slave_type_conversions [puppet] - 10https://gerrit.wikimedia.org/r/344087 (https://phabricator.wikimedia.org/T73563) (owner: 10Marostegui) [10:15:55] !log Upgrading asw2-d-eqiad to JunOS 14.1X53 (T133387) [10:16:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:01] T133387: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387 [10:16:06] jynus: I am not sure I understand [10:16:40] I want to lock certain servers against accidental reimage [10:16:42] [Mar 22 10:15:21]: Retrieving software images. This process can take several minutes. Please be patient.. [10:17:08] manual accidental reimage ? [10:17:13] so that if you want to reimage them [10:17:20] like someone by mistake doing a PXE boot ? [10:17:23] you have to change something on puppet first [10:17:47] akosiaris: a mainboard was replaced yesterday and it wasn't changed to boot from disk first, so it booted from pxe and…bye bye data [10:17:47] akosiaris, like a new board is changed, BIOS is reseted so it forces PXE boot and we lose data [10:18:01] ah [10:18:29] so in general brandon had a very good idea that I have implemented in the past and I am very happy with [10:18:39] we had discussed at the offsite [10:18:43] yeah [10:18:46] at some of the round tables [10:18:47] yes, but I wasn't present [10:18:54] so what was the idea? [10:19:03] it's always PXE, but have the PXE configuration always do "append" [10:19:12] append? [10:19:13] which translates to "boot from the first disk" [10:19:22] yes, that would work [10:19:47] I had cooked up a simple script (like 5 lines?) at some point in time [10:19:59] that would change that flag [10:20:18] and it's a state stored on the tftpboot server [10:20:31] I do not even need anything so sofisticated [10:21:05] what would it happen if I commented the entry on the current config, is there a default boot that happens? [10:21:45] as far as TFTP goes, yes [10:21:50] reinstall as jessie [10:22:00] maybe I can change the default [10:22:07] effectively "append" would be that default TFTP action [10:22:37] now or with your changes? [10:23:09] that would be the implementation I mean [10:23:30] I need this now, and when you have the time, do it better like you say [10:23:43] do you ? [10:23:47] I will try to set a hacky way [10:23:52] I mean.. do we really need that ? [10:24:00] it has happened like 2 times in 5 years ? [10:24:11] granted this time around it was not good [10:24:12] yes, because I can now tested [10:24:24] *test it [10:24:38] and in a few days I will not be able to test it [10:24:41] ? [10:25:22] actually my point was, let's do this in a more orchestrated way. talk a bit more about it, make sure we indeed want to change that [10:25:28] some people maybe against [10:25:32] yeah [10:25:38] I wouldn't, but I don't know about others [10:25:40] that is why I only want to implement it [10:25:51] for my servers, leave the rest alone [10:26:09] most servers are stateless, and probably do not need that [10:26:27] only dbs and swift [10:27:13] also I have now the motivation to do something about that [10:27:18] I won't have it next week [10:27:26] and cassandra and elasticsearch and postgres and... [10:27:32] those are dbs [10:27:38] lesser dbs [10:27:48] but dbs noneless :-) [10:28:03] * volans takes note that jynus said that postgres is a db :D [10:28:12] :-P [10:28:56] FWIW I'd be in favor of the change, right now it is a liability [10:29:07] if I have godog with me [10:29:33] we are probably a majority on stateful services, I would do something about it on those servers [10:29:35] test it [10:29:53] and then the rest of the people can decide if they want it or not for their services [10:30:06] I can see why they wouldn't want it [10:30:26] that is why I said I need a way to "lock" certain servers [10:30:27] jynus: I think most of us agree to have this protection, a reboot should not reinstall unless explicitely stated [10:30:55] not "I need to implement this to all servers" [10:31:40] the reimage scripts are one of the best pieces of code we have gotten [10:31:51] !log cirrus: rebuilding comp suggest indices in elastic@eqiad [10:31:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:16] but I cannot sleep at night if someone tells me "sorry. I reimaged an es1* server instead of an elastic search server" [10:32:37] by accident [10:33:31] I am not that worried by that tbh. We obviously need to minimize the chances of that happening, but again it has happened like what ? 2 times in the past 5 years ? [10:33:51] akosiaris, it will take me less time to implement that [10:33:56] than to recover es2015 [10:34:04] the second will take me 7 hours of work [10:34:14] that's an argument for doing it, I 'll grant you that [10:34:23] jynus: I think an email to ops@ and a link to a phab task with the issue would be enough to see if there's opposition and flip the default, worst case we flip it back or turn the old behaviour for selected services [10:34:38] what godog said [10:34:50] +1 [10:34:54] I would do it differently [10:35:06] 06Operations: Uninitialized string offset warnings with HHVM 3.18 in LanguageAz.php and LanguageKk.php - https://phabricator.wikimedia.org/T161095#3121255 (10MoritzMuehlenhoff) [10:35:09] there might actually be some good ideas about how to implement anything from such a discussion [10:35:17] implement it of es* servers only, then send an email of who wants that, too? [10:35:19] as I 've already said, brandon has a pretty sound idea about this [10:35:36] who can do it better? This is the task to do it properly [10:36:32] I am thinkin like commenting the recipe, nothing too big [10:36:42] the least amount of work [10:37:33] or creating a recipe that doesn't write the partition by default [10:37:40] that is also needed [10:37:49] and that would be helpful [10:38:13] nothing that modifies dhcp or tftp substantially [10:41:02] !log reoot asw2-d T133387 [10:41:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:08] T133387: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387 [10:41:31] jynus: I'm not sure that works if you mean commenting the recipe in modules/install_server/files/dhcpd/linux-host-entries.ttyS1-115200 [10:41:39] (03CR) 10Giuseppe Lavagetto: [C: 031] Add MediaWiki config tasks for ro/rw mode [switchdc] - 10https://gerrit.wikimedia.org/r/343858 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [10:42:09] s/commenting/replacing it with do_not_reinstall recipe/ :-) [10:42:55] I think replacing it with sda_with_no_srv_deletion would be enough [10:43:23] which is what we want anyway for the upcoming stretch [10:44:11] PROBLEM - Host asw2-d-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [10:44:24] 06Operations, 13Patch-For-Review: Package the next LTS kernel (4.9) - https://phabricator.wikimedia.org/T154934#3121271 (10ema) I've noticed the following warning on cp4011. It could potentially be related to the upgrade to 4.9 given that I haven't seen the same message on any other machine. `TCP: eth0: Driv... [10:44:33] (03CR) 10Volans: [C: 032] Add MediaWiki config tasks for ro/rw mode [switchdc] - 10https://gerrit.wikimedia.org/r/343858 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [10:47:15] 06Operations: confctl no longer logs a non-changing state change - https://phabricator.wikimedia.org/T161096#3121275 (10MoritzMuehlenhoff) [10:49:31] RECOVERY - Host asw2-d-eqiad is UP: PING WARNING - Packet loss = 86%, RTA = 4.40 ms [10:49:41] PROBLEM - Host mc1033 is DOWN: PING CRITICAL - Packet loss = 100% [10:49:51] PROBLEM - Host mc1034 is DOWN: PING CRITICAL - Packet loss = 100% [10:49:51] PROBLEM - Host mc1035 is DOWN: PING CRITICAL - Packet loss = 100% [10:49:51] PROBLEM - Host mc1036 is DOWN: PING CRITICAL - Packet loss = 100% [10:50:31] RECOVERY - Host mc1034 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [10:50:41] RECOVERY - Host mc1033 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [10:50:41] RECOVERY - Host mc1035 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [10:50:41] RECOVERY - Host mc1036 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [10:50:46] (03PS2) 10Alexandros Kosiaris: sync_icinga_state: stop/start the service [puppet] - 10https://gerrit.wikimedia.org/r/343889 [10:50:57] <_joe_> wat [10:50:59] ok asw2 rebooted... [10:51:03] _joe_: read backlog [10:51:03] <_joe_> oh the switch rebooted [10:51:07] :P [10:51:13] <_joe_> yeah I thought we were done already [10:51:18] you 'd love to [10:51:35] now I have to check that they actually fixed the bug [10:51:45] too many mc* on the place, isn't it? [10:51:58] oh, it is only 4 [10:52:01] I read 8 [10:52:41] PROBLEM - puppet last run on mc1036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:52:51] PROBLEM - puppet last run on mc1035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:53:02] !log cr1-eqiad: set ae4 and members to enable again. T133387 [10:53:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:09] T133387: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387 [10:53:09] 06Operations, 10MediaWiki-Internationalization, 07HHVM: Uninitialized string offset warnings with HHVM 3.18 in LanguageAz.php and LanguageKk.php - https://phabricator.wikimedia.org/T161095#3121308 (10Aklapper) p:05Triage>03High https://phabricator.wikimedia.org/source/mediawiki/browse/wmf%252F1.29.0-wmf.... [10:53:26] making some coffee and checking to see if we are still affected by T133387 [10:56:34] (03PS3) 10Volans: Check that core DBs replica is in sync [switchdc] - 10https://gerrit.wikimedia.org/r/343627 (https://phabricator.wikimedia.org/T160178) [11:00:36] 06Operations, 10MediaWiki-Internationalization, 07HHVM: Uninitialized string offset warnings with HHVM 3.18 in LanguageAz.php and LanguageKk.php - https://phabricator.wikimedia.org/T161095#3121342 (10MoritzMuehlenhoff) [11:00:38] 06Operations, 07HHVM, 13Patch-For-Review, 07Upstream: Build / migrate to HHVM 3.18 - https://phabricator.wikimedia.org/T158176#3121341 (10MoritzMuehlenhoff) [11:01:31] 06Operations, 07HHVM, 13Patch-For-Review, 07Upstream: Build / migrate to HHVM 3.18 - https://phabricator.wikimedia.org/T158176#3028990 (10MoritzMuehlenhoff) HHVM 3.18.1 has been upgraded on mwdebug* and mw1261 (but currently depooled for T161095) [11:04:14] (03CR) 10Giuseppe Lavagetto: Check that core DBs replica is in sync (031 comment) [switchdc] - 10https://gerrit.wikimedia.org/r/343627 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [11:04:16] !log upgrade maps to nodejs 6 - T150354 [11:04:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:22] T150354: Implement Node6 support for Kartotherian/Tilerator - https://phabricator.wikimedia.org/T150354 [11:05:03] !log disabling puppet on all maps servers - T150354 [11:05:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:11] (03Draft1) 10Paladox: Gerrit: Make gerritbot report the branch in the comment [puppet] - 10https://gerrit.wikimedia.org/r/344102 (https://phabricator.wikimedia.org/T161078) [11:06:14] (03PS2) 10Paladox: Gerrit: Make gerritbot report the branch in the comment [puppet] - 10https://gerrit.wikimedia.org/r/344102 (https://phabricator.wikimedia.org/T161078) [11:06:21] PROBLEM - Disk space on tungsten is CRITICAL: DISK CRITICAL - free space: /srv 66671 MB (3% inode=99%) [11:07:13] (03PS7) 10Gehel: maps - cleartables osm replication [puppet] - 10https://gerrit.wikimedia.org/r/341563 (https://phabricator.wikimedia.org/T157613) [11:08:31] PROBLEM - puppet last run on mw1237 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:08:39] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: [epic] System level upgrade for cirrus / elasticsearch - https://phabricator.wikimedia.org/T151324#3121406 (10Deskana) [11:08:58] 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 07Wikimedia-Multiple-active-datacenters: Prepare and improve the datacenter switchover procedure - https://phabricator.wikimedia.org/T154658#3121410 (10Deskana) [11:09:09] (03CR) 10Volans: Check that core DBs replica is in sync (031 comment) [switchdc] - 10https://gerrit.wikimedia.org/r/343627 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [11:09:19] (03CR) 10Paladox: "Tested here https://phab-01.wmflabs.org/T20#2856" [puppet] - 10https://gerrit.wikimedia.org/r/344102 (https://phabricator.wikimedia.org/T161078) (owner: 10Paladox) [11:09:23] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I'll be honest: I don't like this that much, but I suspect it's hard to do anything better:" [switchdc] - 10https://gerrit.wikimedia.org/r/343627 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [11:11:02] (03PS2) 10Ema: cache_misc: support WMF-Last-Access-Global cookie [puppet] - 10https://gerrit.wikimedia.org/r/344071 (https://phabricator.wikimedia.org/T138027) (owner: 10BBlack) [11:11:12] (03PS1) 10Gehel: maps - remove the pinning of nodejs version [puppet] - 10https://gerrit.wikimedia.org/r/344107 (https://phabricator.wikimedia.org/T150354) [11:11:24] 06Operations, 13Patch-For-Review: Package the next LTS kernel (4.9) - https://phabricator.wikimedia.org/T154934#3121431 (10MoritzMuehlenhoff) I had a look at that GRO error message yesterday: The check was only introduced in 4.9 with https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id... [11:11:30] (03PS1) 10Jcrespo: install_server: Create db-no-srv-format install recipe [puppet] - 10https://gerrit.wikimedia.org/r/344108 [11:12:55] (03CR) 10MaxSem: [C: 031] maps - remove the pinning of nodejs version [puppet] - 10https://gerrit.wikimedia.org/r/344107 (https://phabricator.wikimedia.org/T150354) (owner: 10Gehel) [11:15:20] !log Enable IGMP snooping for private1-d-eqiad. T133387 [11:15:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:26] T133387: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387 [11:15:27] !log Enable IGMP snooping for private1-d-eqiad on asw2-d. T133387 [11:15:28] (03CR) 10Jcrespo: Check that core DBs replica is in sync (031 comment) [switchdc] - 10https://gerrit.wikimedia.org/r/343627 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [11:15:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:53] (03CR) 10Gehel: [C: 032] maps - remove the pinning of nodejs version [puppet] - 10https://gerrit.wikimedia.org/r/344107 (https://phabricator.wikimedia.org/T150354) (owner: 10Gehel) [11:17:13] the bug seems to have been partially fixed [11:18:41] RECOVERY - puppet last run on mc1036 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [11:18:51] RECOVERY - puppet last run on mc1035 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [11:19:25] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps2001.codfw.wmnet [11:19:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:00] akosiaris, if this works, this would be enough for me: https://gerrit.wikimedia.org/r/344108 [11:22:13] (03PS4) 10Volans: Check that core DBs replica is in sync [switchdc] - 10https://gerrit.wikimedia.org/r/343627 (https://phabricator.wikimedia.org/T160178) [11:23:43] (03PS5) 10Volans: Check that core DBs replica is in sync [switchdc] - 10https://gerrit.wikimedia.org/r/343627 (https://phabricator.wikimedia.org/T160178) [11:25:43] (03PS3) 10Ema: cache_misc: support WMF-Last-Access-Global cookie [puppet] - 10https://gerrit.wikimedia.org/r/344071 (https://phabricator.wikimedia.org/T138027) (owner: 10BBlack) [11:25:51] (03CR) 10Ema: [V: 032 C: 032] cache_misc: support WMF-Last-Access-Global cookie [puppet] - 10https://gerrit.wikimedia.org/r/344071 (https://phabricator.wikimedia.org/T138027) (owner: 10BBlack) [11:27:58] !log maps2001.codfw.wmnet upgraded to nodejs6 [11:28:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:51] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps2001.codfw.wmnet [11:33:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:34] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps2002.codfw.wmnet [11:34:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:32] RECOVERY - puppet last run on mw1237 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [11:38:51] PROBLEM - puppet last run on logstash1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:39:00] 06Operations, 10netops: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387#3121483 (10akosiaris) 05stalled>03Open The upgrade to 14.1X53-D42.3 seems to have resolved the issue, at least partially. That is, during tcpdumps before e... [11:41:37] 06Operations, 10Monitoring, 10Traffic: Performance impact evaluation of enabling nginx-lua and nginx-lua-prometheus on tlsproxy - https://phabricator.wikimedia.org/T161101#3121488 (10ema) [11:41:55] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps2002.codfw.wmnet [11:42:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:02] 06Operations, 10Monitoring, 10Traffic: Performance impact evaluation of enabling nginx-lua and nginx-lua-prometheus on tlsproxy - https://phabricator.wikimedia.org/T161101#3121502 (10ema) p:05Triage>03Normal [11:43:32] 06Operations, 10netops: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387#3121505 (10akosiaris) 05Open>03stalled Resetting to stalled status for 24 hours after some discussions on IRC [11:45:50] (03PS1) 10Muehlenhoff: Remove access credentials for tomasz [puppet] - 10https://gerrit.wikimedia.org/r/344111 [11:46:07] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps2004.codfw.wmnet [11:46:07] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Another couple of comments." (032 comments) [switchdc] - 10https://gerrit.wikimedia.org/r/343627 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [11:46:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:47] (03PS1) 10Filippo Giunchedi: librenms: use base::service_unit [puppet] - 10https://gerrit.wikimedia.org/r/344112 (https://phabricator.wikimedia.org/T125020) [11:46:49] (03PS1) 10Filippo Giunchedi: apache 2.4 compat for librenms/servermon/smokeping [puppet] - 10https://gerrit.wikimedia.org/r/344113 (https://phabricator.wikimedia.org/T125020) [11:46:57] (03CR) 10Alexandros Kosiaris: install_server: Create db-no-srv-format install recipe (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/344108 (owner: 10Jcrespo) [11:49:20] (03CR) 10Jcrespo: install_server: Create db-no-srv-format install recipe (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/344108 (owner: 10Jcrespo) [11:50:27] (03PS2) 10Muehlenhoff: Remove access credentials for tomasz [puppet] - 10https://gerrit.wikimedia.org/r/344111 [11:51:04] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps2004.codfw.wmnet [11:51:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:37] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps2003.codfw.wmnet [11:51:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:32] !log maps codfw fully upgraded to nodejs 6, starting upgrade on maps eqiad - T150354 [11:53:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:38] T150354: Implement Node6 support for Kartotherian/Tilerator - https://phabricator.wikimedia.org/T150354 [11:54:07] (03CR) 10Muehlenhoff: [C: 032] Remove access credentials for tomasz [puppet] - 10https://gerrit.wikimedia.org/r/344111 (owner: 10Muehlenhoff) [11:54:42] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps1001.eqiad.wmnet [11:54:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:16] (03CR) 10Alexandros Kosiaris: [C: 04-1] apache 2.4 compat for librenms/servermon/smokeping (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/344113 (https://phabricator.wikimedia.org/T125020) (owner: 10Filippo Giunchedi) [11:55:39] (03CR) 10Alexandros Kosiaris: "ok, fine by me then" [puppet] - 10https://gerrit.wikimedia.org/r/344108 (owner: 10Jcrespo) [11:57:39] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps1001.eqiad.wmnet [11:57:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:27] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps1002.eqiad.wmnet [11:58:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:06] 06Operations, 10netops: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387#3121541 (10faidon) So the situation is very unclear and quite messy, which is why I wanted to try this (thanks @akosiaris for tackling this): - We previously h... [12:01:50] (03CR) 10Marostegui: [C: 031] "Very nice and safe idea" [puppet] - 10https://gerrit.wikimedia.org/r/344108 (owner: 10Jcrespo) [12:01:55] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps1002.eqiad.wmnet [12:02:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:05] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps1003.eqiad.wmnet [12:02:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:38] (03PS2) 10Filippo Giunchedi: apache 2.4 compat for librenms/servermon/smokeping [puppet] - 10https://gerrit.wikimedia.org/r/344113 (https://phabricator.wikimedia.org/T125020) [12:02:50] (03CR) 10Jcrespo: "> Very nice and safe idea" [puppet] - 10https://gerrit.wikimedia.org/r/344108 (owner: 10Jcrespo) [12:02:59] (03CR) 10Filippo Giunchedi: apache 2.4 compat for librenms/servermon/smokeping (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/344113 (https://phabricator.wikimedia.org/T125020) (owner: 10Filippo Giunchedi) [12:03:36] (03PS1) 10Marostegui: db-eqiad.php: Repooli db1087 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344117 (https://phabricator.wikimedia.org/T137191) [12:04:52] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps1003.eqiad.wmnet [12:04:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:00] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps1004.eqiad.wmnet [12:05:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:51] RECOVERY - puppet last run on logstash1006 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [12:07:20] 06Operations, 10ops-codfw, 10DBA: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3121607 (10jcrespo) Can I reimage the server? https://gerrit.wikimedia.org/r/344108 [12:08:25] (03CR) 10Jcrespo: [C: 032] install_server: Create db-no-srv-format install recipe [puppet] - 10https://gerrit.wikimedia.org/r/344108 (owner: 10Jcrespo) [12:08:31] (03PS2) 10Jcrespo: install_server: Create db-no-srv-format install recipe [puppet] - 10https://gerrit.wikimedia.org/r/344108 [12:08:32] PROBLEM - puppet last run on mw1237 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:08:34] 06Operations, 10ops-codfw, 10DBA: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3121612 (10Marostegui) Go ahead [12:09:21] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps1004.eqiad.wmnet [12:09:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:43] !log maps upgrade to nodejs 6 completed - T150354 [12:09:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:49] T150354: Implement Node6 support for Kartotherian/Tilerator - https://phabricator.wikimedia.org/T150354 [12:13:12] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repooli db1087 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344117 (https://phabricator.wikimedia.org/T137191) (owner: 10Marostegui) [12:13:54] (03PS2) 10Marostegui: sanitarium2.my.cnf: Set slave_type_conversions [puppet] - 10https://gerrit.wikimedia.org/r/344087 (https://phabricator.wikimedia.org/T73563) [12:14:37] (03Merged) 10jenkins-bot: db-eqiad.php: Repooli db1087 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344117 (https://phabricator.wikimedia.org/T137191) (owner: 10Marostegui) [12:14:49] (03CR) 10jenkins-bot: db-eqiad.php: Repooli db1087 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344117 (https://phabricator.wikimedia.org/T137191) (owner: 10Marostegui) [12:15:45] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1087 with low weight - T137191 (duration: 00m 43s) [12:15:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:50] T137191: Defragment db1070, db1082, db1087, db1092 - https://phabricator.wikimedia.org/T137191 [12:17:24] (03CR) 10Alexandros Kosiaris: [C: 031] apache 2.4 compat for librenms/servermon/smokeping [puppet] - 10https://gerrit.wikimedia.org/r/344113 (https://phabricator.wikimedia.org/T125020) (owner: 10Filippo Giunchedi) [12:18:16] !log installing latest mapnik version on maps servers [12:18:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:37] 06Operations, 10Domains, 10Traffic, 06WMF-Legal, 13Patch-For-Review: Using wikimedia.ee mail address as Google account - https://phabricator.wikimedia.org/T158638#3121623 (10Beetlebeard) Hello again. The thing is, since the last verification process completion took too long, G Suite blocked and later del... [12:18:48] (03PS1) 10Marostegui: db-eqiad.php: Repool db1084,depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344122 (https://phabricator.wikimedia.org/T73563) [12:19:33] (03CR) 10Marostegui: [C: 032] "Looks good so deploying: https://puppet-compiler.wmflabs.org/5859/" [puppet] - 10https://gerrit.wikimedia.org/r/344087 (https://phabricator.wikimedia.org/T73563) (owner: 10Marostegui) [12:20:21] !log maps restarting kartotherian - T150354 [12:20:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:28] T150354: Implement Node6 support for Kartotherian/Tilerator - https://phabricator.wikimedia.org/T150354 [12:21:05] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1084,depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344122 (https://phabricator.wikimedia.org/T73563) (owner: 10Marostegui) [12:22:27] 06Operations, 10ops-codfw, 10DBA: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3092963 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['es2015.codfw.wmnet'] ``` The log can be found in `/var/log/wmf-auto-reimage/2017032... [12:22:29] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1084,depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344122 (https://phabricator.wikimedia.org/T73563) (owner: 10Marostegui) [12:22:47] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1084,depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344122 (https://phabricator.wikimedia.org/T73563) (owner: 10Marostegui) [12:23:34] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1084, depool db1081 T160415 - T73563 (duration: 00m 43s) [12:23:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:41] T160415: Review schema changes for T125071 - Add index to image table on all wikis - https://phabricator.wikimedia.org/T160415 [12:23:41] T73563: *_minor_mime are varbinary(32) on WMF sites, out of sync with varbinary(100) in MW core - https://phabricator.wikimedia.org/T73563 [12:26:37] !log Deploy schema change on s4 to db1081 and labsdb1011 - T160415 T73563 [12:26:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:01] !log cirrus: reindexing lost writes (2017-03-21T13:30:00Z to 2017-03-21T17:50:00Z) during es5 upgrade in elastic@eqiad (T157479) [12:29:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:07] T157479: Put together a production migration plan for ES 2 -> ES 5 - https://phabricator.wikimedia.org/T157479 [12:31:26] (03CR) 10Muehlenhoff: [C: 04-1] Fix some Debian lintian warnnings for the gerrit package (032 comments) [debs/gerrit] - 10https://gerrit.wikimedia.org/r/343297 (owner: 10Paladox) [12:37:31] RECOVERY - puppet last run on mw1237 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [12:41:04] (03PS1) 10Marostegui: db-eqiad.php: Enable db1087 in API service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344126 (https://phabricator.wikimedia.org/T137191) [12:45:02] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Enable db1087 in API service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344126 (https://phabricator.wikimedia.org/T137191) (owner: 10Marostegui) [12:45:30] (03PS1) 10Muehlenhoff: Change email address for Yuvi [puppet] - 10https://gerrit.wikimedia.org/r/344133 [12:47:33] (03Merged) 10jenkins-bot: db-eqiad.php: Enable db1087 in API service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344126 (https://phabricator.wikimedia.org/T137191) (owner: 10Marostegui) [12:47:41] (03CR) 10jenkins-bot: db-eqiad.php: Enable db1087 in API service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344126 (https://phabricator.wikimedia.org/T137191) (owner: 10Marostegui) [12:48:41] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Enable db1087 for API - T137191 (duration: 00m 42s) [12:48:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:47] T137191: Defragment db1070, db1082, db1087, db1092 - https://phabricator.wikimedia.org/T137191 [12:50:35] jouncebot: next [12:50:35] In 0 hour(s) and 9 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170322T1300) [12:50:56] dcausse: your patch is the only one for eu swat, are you deploying it yourself? [12:55:01] PROBLEM - Disk space on labtestnet2001 is CRITICAL: DISK CRITICAL - free space: / 347 MB (3% inode=74%) [12:55:16] ^looking [12:56:44] (03PS3) 10Volans: utils: add create_ecdsa_cert [puppet] - 10https://gerrit.wikimedia.org/r/340107 (https://phabricator.wikimedia.org/T158757) (owner: 10Giuseppe Lavagetto) [12:58:01] RECOVERY - Disk space on labtestnet2001 is OK: DISK OK [12:58:03] 06Operations, 10MediaWiki-Internationalization, 07HHVM: Uninitialized string offset warnings with HHVM 3.18 in LanguageAz.php and LanguageKk.php - https://phabricator.wikimedia.org/T161095#3121710 (10Nikerabbit) I'm leaving this open in case someone is expected to do backports. [12:58:10] 06Operations, 10Analytics, 10ChangeProp, 10Citoid, and 11 others: Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#3121707 (10MaxSem) [12:58:15] (03PS1) 10BBlack: discovery: set TTL range to 300/10 [dns] - 10https://gerrit.wikimedia.org/r/344135 [12:58:18] zeljkof: ok will swat my patch [12:58:35] dcausse: thanks! [12:59:17] (03CR) 10BBlack: [C: 032] discovery: set TTL range to 300/10 [dns] - 10https://gerrit.wikimedia.org/r/344135 (owner: 10BBlack) [12:59:46] (03CR) 10BBlack: [C: 032] authdns::statefile: add TTL for entries [puppet] - 10https://gerrit.wikimedia.org/r/344085 (owner: 10Giuseppe Lavagetto) [12:59:53] (03PS2) 10BBlack: authdns::statefile: add TTL for entries [puppet] - 10https://gerrit.wikimedia.org/r/344085 (owner: 10Giuseppe Lavagetto) [13:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170322T1300). [13:00:04] dcausse: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:16] I can swat [13:00:40] 06Operations, 10DBA, 10Monitoring, 10media-storage: icinga hp raid check timeout on busy ms-be machines - https://phabricator.wikimedia.org/T141252#3121712 (10jcrespo) [13:00:55] (03CR) 10Hashar: [C: 031] [cirrus] Enable the completion suggester [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344099 (owner: 10DCausse) [13:01:09] (03CR) 10BBlack: [V: 032 C: 032] authdns::statefile: add TTL for entries [puppet] - 10https://gerrit.wikimedia.org/r/344085 (owner: 10Giuseppe Lavagetto) [13:01:30] 06Operations, 10DBA, 10Monitoring, 10media-storage: icinga hp raid check timeout on busy ms-be machines - https://phabricator.wikimedia.org/T141252#2491529 (10jcrespo) [13:01:54] 06Operations, 10DBA, 10Monitoring, 10media-storage: icinga hp raid check timeout on busy ms-be and db machines - https://phabricator.wikimedia.org/T141252#2491529 (10jcrespo) [13:02:02] (03CR) 10DCausse: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344099 (owner: 10DCausse) [13:03:45] (03Merged) 10jenkins-bot: [cirrus] Enable the completion suggester [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344099 (owner: 10DCausse) [13:04:20] 06Operations, 10DBA, 10Monitoring, 10media-storage: icinga hp raid check timeout on busy ms-be and db machines - https://phabricator.wikimedia.org/T141252#2491529 (10Marostegui) Good example of a db server where that happens with big alter tables: dbstore2001 [13:05:00] o/ [13:05:08] looks straight forward isn't it ? :-} [13:05:16] I guess you can proceed dcausse [13:05:32] hashar: yes, already live on mwdebug1002, testing [13:07:00] !log bblack@puppetmaster1001 conftool action : set/ttl=275; selector: dnsdisc=appservers-rw [13:07:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:07] !log dcausse@tin Synchronized wmf-config/InitialiseSettings.php: [cirrus] Enable the completion suggester (duration: 00m 43s) [13:09:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:17] fatalmonitor look good so far, will wait few minutes and monitor a bit more in logstash [13:11:21] 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: EQIAD: stat1002 replacement - https://phabricator.wikimedia.org/T159838#3121732 (10Ottomata) stat1003 ticket is T159839. This stat1002 replacement ticket is waiting on feedback from @dartar and @halfak about acceptable GPU specs. [13:15:31] (03CR) 10jenkins-bot: [cirrus] Enable the completion suggester [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344099 (owner: 10DCausse) [13:15:34] ok logs look good, I can't any failures related to prefixsearch/opensearch api calls [13:15:43] can't see [13:15:57] !log eu swat done [13:16:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:01] PROBLEM - nova-network process on labtestnet2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-network [13:19:26] (03PS1) 10BBlack: discovery failoid: real IPs, TTL=10 [puppet] - 10https://gerrit.wikimedia.org/r/344140 [13:19:33] (03PS16) 10Paladox: Fix some Debian lintian warnnings for the gerrit package [debs/gerrit] - 10https://gerrit.wikimedia.org/r/343297 [13:19:50] (03CR) 10BBlack: [V: 032 C: 032] discovery failoid: real IPs, TTL=10 [puppet] - 10https://gerrit.wikimedia.org/r/344140 (owner: 10BBlack) [13:24:02] (03PS2) 10BBlack: Add new discovery entries [dns] - 10https://gerrit.wikimedia.org/r/344093 (owner: 10Giuseppe Lavagetto) [13:24:08] (03CR) 10Muehlenhoff: [C: 031] "Looks fine now." [debs/gerrit] - 10https://gerrit.wikimedia.org/r/343297 (owner: 10Paladox) [13:24:33] (03PS6) 10Volans: Check that core DBs replica is in sync [switchdc] - 10https://gerrit.wikimedia.org/r/343627 (https://phabricator.wikimedia.org/T160178) [13:27:16] ^me [13:32:33] (03PS3) 10DCausse: [es5 upgrade] step 5: restore normal operations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342034 (https://phabricator.wikimedia.org/T157479) [13:32:51] PROBLEM - Auth DNS on eeden is CRITICAL: CRITICAL - Plugin timed out while executing system call [13:33:05] PROBLEM - Auth DNS on ns2-v4 is CRITICAL: CRITICAL - Plugin timed out while executing system call [13:33:11] PROBLEM - Check systemd state on eeden is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:33:17] bblack: ^^^ [13:33:32] yeah I see [13:33:56] PROBLEM - Auth DNS on ns2-v6 is CRITICAL: CRITICAL - Plugin timed out while executing system call [13:34:11] RECOVERY - Check systemd state on eeden is OK: OK - running: The system is fully operational [13:34:13] whoops [13:34:15] eeden again? [13:34:29] not sure [13:34:40] <_joe_> ns2 is codfw? [13:34:42] Stale template error files present for '/var/lib/gdnsd/discovery-appservers-rw.state' [13:34:52] that's fine [13:34:54] ok [13:35:01] <_joe_> jynus: that's fine [13:35:03] the core issue is my last authdns patch is faulty [13:35:15] (but not caught, because puppet/dns split and thus no good linting) [13:35:25] but there was no restart of the gdnsd servers to pick it up yet, either [13:35:34] so I don't know why gdnsd tried to restart on just eeden [13:36:00] maybe some non-trivial dependency [13:36:08] I happened to me sometime [13:36:10] Phabricator is throwing errors at me. [13:36:24] > Unable to establish a connection to any database host (while trying "phabricator_search"). All masters and replicas are completely unreachable. [13:36:41] PROBLEM - Auth DNS on baham is CRITICAL: CRITICAL - Plugin timed out while executing system call [13:36:47] PROBLEM - Auth DNS on ns1-v6 is CRITICAL: CRITICAL - Plugin timed out while executing system call [13:36:52] PROBLEM - Auth DNS on ns1-v4 is CRITICAL: CRITICAL - Plugin timed out while executing system call [13:36:53] all NSes are down I think [13:36:55] it seems dns may have created a small issue [13:37:02] PROBLEM - Recursive DNS on 2620:0:862:1:91:198:174:106 is CRITICAL: Domain www.wikipedia.org was not found by the server [13:37:02] PROBLEM - Check systemd state on baham is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:37:11] yep, gerrit for instance is down to me [13:37:12] <_joe_> I think so, yes [13:37:12] PROBLEM - Recursive DNS on 2620:0:860:1:208:80:153:12 is CRITICAL: Domain www.wikipedia.org was not found by the server [13:37:12] PROBLEM - Check systemd state on eeden is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:37:13] ah how nice [13:37:14] yeah, I can't resolve [13:37:16] bblack: ^^^ [13:37:25] all dns is down [13:37:29] wtf? [13:37:33] PROBLEM - Auth DNS on radon is CRITICAL: CRITICAL - Plugin timed out while executing system call [13:37:33] PROBLEM - puppet last run on mc1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:37:36] "Woe! This request had its journey cut short by unexpected circumstances (Can Not Connect to MySQL)." [13:37:42] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:37:43] from phabricator [13:37:49] PROBLEM - Auth DNS on ns0-v6 is CRITICAL: CRITICAL - Plugin timed out while executing system call [13:37:53] RECOVERY - Auth DNS on ns2-v6 is OK: DNS OK: 0.093 seconds response time. www.wikipedia.org returns 208.80.154.224 [13:37:53] PROBLEM - Recursive DNS on 91.198.174.106 is CRITICAL: Domain www.wikipedia.org was not found by the server [13:37:59] PROBLEM - Auth DNS on ns0-v4 is CRITICAL: CRITICAL - Plugin timed out while executing system call [13:37:59] RECOVERY - Auth DNS on eeden is OK: DNS OK: 0.102 seconds response time. www.wikipedia.org returns 208.80.154.224 [13:38:09] PROBLEM - Check systemd state on radon is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:38:15] RECOVERY - Auth DNS on ns2-v4 is OK: DNS OK: 0.093 seconds response time. www.wikipedia.org returns 208.80.154.224 [13:38:16] PROBLEM - puppet last run on cp2008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:38:19] PROBLEM - puppet last run on bast2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:38:24] anyways, I'm fixing them [13:38:29] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:38:29] RECOVERY - Check systemd state on eeden is OK: OK - running: The system is fully operational [13:38:29] RECOVERY - Recursive DNS on 2620:0:860:1:208:80:153:12 is OK: DNS OK: 0.163 seconds response time. www.wikipedia.org returns 208.80.153.224 [13:38:29] PROBLEM - puppet last run on restbase2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:38:29] PROBLEM - puppet last run on cp4004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:38:30] RECOVERY - Auth DNS on radon is OK: DNS OK: 0.010 seconds response time. www.wikipedia.org returns 208.80.154.224 [13:38:30] PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:38:31] PROBLEM - restbase endpoints health on restbase2007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:38:31] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:38:32] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:38:33] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:38:33] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:38:33] PROBLEM - restbase endpoints health on restbase-test2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:38:34] PROBLEM - restbase endpoints health on restbase2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:38:45] PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:38:45] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:38:46] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /data/citation/{format}/{query} (Get citation for Darth Vader) is CRITICAL: Test Get citation for Darth Vader returned the unexpected status 520 (expecting: 200) [13:38:46] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:38:47] PROBLEM - puppet last run on es1011 is CRITICAL: CRITICAL: Puppet has 8 failures. Last run 2 minutes ago with 8 failures. Failed resources (up to 3 shown): File[/home/aaron],File[/home/jgreen],File[/home/gehel],File[/home/andrew] [13:38:49] PROBLEM - puppet last run on elastic1025 is CRITICAL: CRITICAL: Puppet has 18 failures. Last run 2 minutes ago with 18 failures. Failed resources (up to 3 shown) [13:38:53] RECOVERY - Auth DNS on ns0-v6 is OK: DNS OK: 0.015 seconds response time. www.wikipedia.org returns 208.80.154.224 [13:38:56] RECOVERY - Auth DNS on ns0-v4 is OK: DNS OK: 0.012 seconds response time. www.wikipedia.org returns 208.80.154.224 [13:38:59] RECOVERY - Recursive DNS on 91.198.174.106 is OK: DNS OK: 0.179 seconds response time. www.wikipedia.org returns 91.198.174.192 [13:39:11] I'm still trying to login grumble grumble [13:39:13] !log stopped ircecho to avoid the message spam [13:39:17] they're all fixed [13:39:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:19] Yvette, paravoid it should be ok now [13:39:20] (the authdns) [13:39:25] yeah, thanks [13:39:25] gerrit is back [13:39:29] puppet is disabled on them all, too [13:39:33] because I did the fix manually [13:39:36] And so is phabricator [13:39:39] 06Operations, 10Monitoring, 10Traffic: Performance impact evaluation of enabling nginx-lua and nginx-lua-prometheus on tlsproxy - https://phabricator.wikimedia.org/T161101#3121792 (10ema) === Methodology === We have benchmarked nginx performance on pinkunicorn (cp1008). An additional nginx location block has... [13:39:49] ...and this is why I've been worried about touching authdns. [13:39:52] paravoid, sorry, I meant paladox [13:39:58] I know you know :-) [13:40:03] paravoid: and this is why I've been worried about deploying it in a hacky way :P [13:40:04] thanks :) [13:40:08] I am checking impact now [13:40:23] foo => 192.0.2.1 [13:40:31] https://grafana.wikimedia.org/dashboard/db/dns?from=now-1h&to=now [13:40:35] we've never had a full DNS outage in at least the last 5 years [13:40:43] <_joe_> something I can do? [13:40:50] there's nothing currently wrong [13:41:03] bblack, I am checkin past impact :-) [13:41:15] this was scary and recovery time would have been much longer if bblack wasn't around [13:41:26] yep [13:41:36] paravoid: stop drawing conclusions before analysis? [13:41:46] :-) [13:41:53] either that or just step up and say revert everything related because this is all a horrible idea [13:42:14] we'll see after the analysis [13:42:57] the initial spike is of 28K 503/minute [13:43:34] for 2 minutes only [13:43:49] <_joe_> I see a 4 minute interval on radon [13:44:11] it could be- [13:44:46] 13:39-13:42 for me [13:44:57] <_joe_> Mar 22 13:34:43 radon systemd[1]: Stopping gdnsd... [13:45:00] I am going to check the proxies [13:45:11] yeah so the sequence that lead to this is this: [13:45:20] incase the had failed over [13:45:21] <_joe_> Mar 22 13:38:21 radon systemd[1]: Started gdnsd. [13:45:23] (well, sequence of faulty things and human actions): [13:45:55] I am doing inderect checks only: https://grafana.wikimedia.org/dashboard/file/varnish-http-errors.json?panelId=7&fullscreen&from=now-1h&to=now [13:46:21] 1. The discovery authdns work already applied to the puppet repo (the current stuff that leaves things split) has a service dep for the config files => restart gdnsd [13:46:36] (that's wrong, I think, and unexpected at least by me in the present) [13:47:09] 2. I applied a simple-but-faulty patch: https://gerrit.wikimedia.org/r/#/c/344140/1 , which causes gdnsd to not start up due to config error [13:47:10] <_joe_> bblack: that's surely wrong [13:47:24] <_joe_> the restart from the new files [13:47:45] 3. I expected to have time to move around and check things and authdns-update before it hit, but random puppet agent runs restarted gdnsd and failed to star tup [13:48:20] there's a 0. there -- the init script did a config check and refused to stop on a restart, but the systemd unit does not (cannot) do that [13:48:57] <_joe_> paravoid: yeah there is no way not to do that in systemd [13:49:00] IIRC [13:49:10] 4. the puppet crontab are too close to each other (less than 4 minutes) [13:49:10] yeah it's a fairly typical issue [13:49:12] 06Operations, 03Interactive-Sprint, 06Maps (Tilerator): Increase frequency of OSM replication - https://phabricator.wikimedia.org/T137939#3121886 (10Deskana) [13:49:23] (03PS1) 10Marostegui: db-eqiad.php: Increase weight db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344143 (https://phabricator.wikimedia.org/T137191) [13:49:31] yesterday I also got hit by a non-outage version of the split-repos linting issue, which was that authdns-lint gave +2 to a dns patch that then failed (without outage) at authdns-update time (for lack of matching puppet-side update for the mocked part) [13:49:32] <_joe_> so let's resolve the more pressing issue [13:49:59] puppet crontab: 29,59 - 3,33 - 4-34 :/ [13:50:02] 06Operations, 10MediaWiki-Internationalization, 07HHVM, 05MW-1.29-release (WMF-deploy-2017-03-28_(1.29.0-wmf.18)), and 2 others: Uninitialized string offset warnings with HHVM 3.18 in LanguageAz.php and LanguageKk.php - https://phabricator.wikimedia.org/T161095#3121891 (10hashar) [13:50:13] 06Operations, 10MediaWiki-Internationalization, 07HHVM, 05MW-1.29-release (WMF-deploy-2017-03-28_(1.29.0-wmf.18)), and 3 others: Uninitialized string offset warnings with HHVM 3.18 in LanguageAz.php and LanguageKk.php - https://phabricator.wikimedia.org/T161095#3121255 (10hashar) I am handling the backport... [13:50:16] in this case, I didn't even bother waiting for jenkins on the simple-but-faulty patch, because it can't check that file anyways and puppet shouldn't be deploying it [13:50:18] <_joe_> volans: yeah but then there is splay that should help a bit [13:50:30] I could/should have validated the syntax manually offline first :/ [13:50:55] <_joe_> bblack: well the issue is the way it was restarted by puppet, right? [13:50:56] 06Operations, 06DC-Ops: Change cdentinger's icinga sms gateway to Sprint - https://phabricator.wikimedia.org/T161112#3121897 (10cwdent) [13:51:08] well the issue is that I wasn't expecting puppet to restart it [13:51:23] I mean, the manifests make it obvious that it does, but I think that was unintentional [13:51:49] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase weight db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344143 (https://phabricator.wikimedia.org/T137191) (owner: 10Marostegui) [13:52:32] there's some notify in there that should be "before" at best [13:52:49] (03PS1) 10Giuseppe Lavagetto: authdns: do not restart gdnsd on file changes [puppet] - 10https://gerrit.wikimedia.org/r/344144 [13:52:53] <_joe_> bblack: I just removed them [13:53:08] <_joe_> but if you prefer to have a "before", that's ok [13:53:16] well if they're not handled another way it's a problem on provisioning a new authdns [13:53:18] <_joe_> it won't be racy when you install a new server [13:53:34] (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344143 (https://phabricator.wikimedia.org/T137191) (owner: 10Marostegui) [13:53:46] (03CR) 10jenkins-bot: db-eqiad.php: Increase weight db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344143 (https://phabricator.wikimedia.org/T137191) (owner: 10Marostegui) [13:54:00] we have to have it not-racy when reinstalling, or we can serve wrong answers [13:54:26] <_joe_> bblack: /etc/gdnsd/discovery-map doesn't notify atm [13:54:27] anyways, there's time to contemplate that [13:54:28] that's part of the issue [13:54:29] er [13:54:32] sorry [13:54:40] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase db1087 weight - T137191 (duration: 00m 47s) [13:54:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:46] T137191: Defragment db1070, db1082, db1087, db1092 - https://phabricator.wikimedia.org/T137191 [13:54:51] that's one of the N issues addressed by the patch to move to puppet repo, actually [13:55:25] how would that have been avoided/addressed by that? [13:55:29] (03PS2) 10Giuseppe Lavagetto: authdns: do not restart gdnsd on file changes [puppet] - 10https://gerrit.wikimedia.org/r/344144 [13:55:32] ircecho coming back by puppet, let me know if you want it to be stopped a bit more untile the failed puppet runs recovers [13:56:47] probably for now the simplest fix is just to add the new files to the dep list for the authdns-local-update exec, in place of those notify [13:58:03] did we always have a notify on the main config file? [13:58:48] heh I did that, in 2014 [13:58:57] we just don't update the main config file often [13:59:04] also what I mentioned above [13:59:08] pre-jessie, that wasn't an issue [13:59:55] hm, or maybe it was [14:00:04] I don't see checkconf in the init script either [14:00:11] so maybe I'm just wrong there [14:00:18] dcausse: your swat change went all fine wasn't it? [14:01:07] I thought we did have something like that before [14:01:47] the init script called /usr/sbin/gdnsd restart [14:02:15] ah [14:02:16] there we go [14:02:30] "/usr/sbin/gdnsd restart" doesn't fail on bad config, intrinsically [14:02:35] yup [14:02:59] while under systemd that's not the case [14:03:16] so yeah, your 2014 change was probably pre-jessie/pre-systemd and at the time was fine [14:03:23] <_joe_> yeah [14:03:56] then implicitly jessie made "service gdnsd restart" kill gdnsd if the new config is faulty [14:04:15] ExecStop=/usr/sbin/gdnsd stop [14:04:20] from the systemd unit [14:04:31] there is only start and stop [14:04:38] yeah with systemd you don't get the option of having the daemon manage its own restart [14:04:40] so I guess it does stop+start? [14:04:43] <_joe_> volans: execrestart doesn't exist [14:04:59] it's a big historical pain-point between gdnsd+systemd [14:05:01] <_joe_> because lennart decided you don't know how to do it [14:05:19] yeah I remember the discussion with brandon about it [14:05:28] the native "/usr/sbin/gdnsd restart" starts the new daemon in parallel with the old until it's ready to serve queries (or fails due to bad config, etc), then hands off service and exits the old [14:05:29] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [14:05:39] RECOVERY - IPsec on cp1067 is OK: Strongswan OK - 44 ESP OK [14:05:46] at the same time there isn't an ExecRestartPre [14:05:54] in which you could do "gdnsd checkconf" and abort [14:05:59] RECOVERY - puppet last run on es1011 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [14:06:08] <_joe_> paravoid: yeah, no options for that [14:06:10] RECOVERY - puppet last run on cp2008 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [14:06:13] that's a pain point with other software too, I've experienced it when rewriting init scripts to systemd units in multiple packages of mine [14:06:17] I think nutcracker is one too [14:06:24] radsecproxy is another one, etc. [14:06:27] there's ExecReload which does not carry of course the same semantics [14:06:29] RECOVERY - puppet last run on bast2001 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [14:06:29] RECOVERY - puppet last run on cp3038 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [14:06:30] RECOVERY - puppet last run on cp4004 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [14:06:30] RECOVERY - puppet last run on mc2021 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [14:06:39] RECOVERY - puppet last run on mc1015 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [14:06:52] several upstreams provide a "check config" option but systemd does not have a way to use it :( [14:06:59] RECOVERY - puppet last run on elastic1025 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [14:07:09] RECOVERY - puppet last run on restbase1016 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [14:07:12] https://github.com/systemd/systemd/issues/2175 [14:07:19] the problem with ExecReload=/usr/sbin/gdnsd restart (or just doing /usr/sbin/gdnsd restart from outside of systemd) is that systemd then doesn't expect the daemon's PID to swap out, and kills the new daemon when the old one exits (tried before) [14:07:29] RECOVERY - puppet last run on restbase2001 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [14:07:29] RECOVERY - puppet last run on cp4001 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [14:07:29] RECOVERY - puppet last run on ms-be3004 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [14:07:29] RECOVERY - puppet last run on cp4011 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [14:07:39] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [14:07:43] bblack: ah, good point, hadn't thought of that [14:07:49] RECOVERY - IPsec on cp2019 is OK: Strongswan OK - 56 ESP OK [14:07:49] RECOVERY - IPsec on cp2004 is OK: Strongswan OK - 56 ESP OK [14:07:52] I'm not sure if it only applies to start, but also to restart, but maybe the linter could create /var/run/gdns-valid-config and then we have a ConditionPathExists= /var/run/gdns-valid-config in the systemd unit? [14:08:09] RECOVERY - puppet last run on elastic1052 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [14:08:09] RECOVERY - IPsec on cp2010 is OK: Strongswan OK - 56 ESP OK [14:08:10] RECOVERY - puppet last run on cp2012 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [14:08:10] I was referring to semantics change from the user calling the restart script [14:08:17] that is having to do systemctl reload [14:08:29] RECOVERY - puppet last run on mw2135 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [14:08:29] RECOVERY - IPsec on cp3035 is OK: Strongswan OK - 54 ESP OK [14:08:29] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 56 ESP OK [14:08:30] RECOVERY - puppet last run on cp3044 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [14:08:37] ExecStartPre does exist btw [14:08:39] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 56 ESP OK [14:08:49] akosiaris: by then it's too late [14:08:55] you already stopped [14:08:58] as stop has been called already [14:08:59] RECOVERY - puppet last run on mc1005 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [14:08:59] ExecRestartPre is new to me [14:08:59] RECOVERY - puppet last run on aqs1006 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [14:09:09] RECOVERY - puppet last run on elastic1024 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [14:09:10] RECOVERY - IPsec on cp2021 is OK: Strongswan OK - 36 ESP OK [14:09:15] I don't think that exists? [14:09:20] no it does not [14:09:25] oh someone mentioned it above [14:09:29] RECOVERY - puppet last run on mw2165 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:09:29] RECOVERY - puppet last run on cp3043 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [14:09:30] ExecStartPre [14:09:30] RECOVERY - puppet last run on cp3033 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [14:09:31] <_joe_> it doesn't [14:09:36] not RestartPre [14:09:39] 16:05 < paravoid> at the same time there isn't an ExecRestartPre [14:09:40] <_joe_> or it would've solved our problems [14:09:43] yes [14:09:50] but lennart [14:10:12] anyways, the two easiest fixes in the present are (1) fix the bad recent commit and (2) get rid of all notify=>gdnsd, replace where necc with requires deps on the initial authdns-local-update. [14:10:13] maybe we could abuse ExecStopPre ? [14:10:26] which would have very nice implications [14:10:35] like not being able to shutdown the daemon if the config is bad [14:10:36] :P [14:11:01] bblack: ack [14:11:09] what else though? [14:11:14] akosiaris: I doesn't exists, does it? [14:11:14] which, the way I wrote it sounds like a benefit, not a disadvantage [14:11:16] I'd like another layer of protection for this kind of outage :) [14:11:23] ah [14:11:25] only post [14:11:28] ExecStopPre= [14:11:28] systemd.socket(5) [14:11:35] grrrr only for sockets ? [14:11:37] really ? [14:11:39] ExecStopPost [14:12:01] stop may just be stop, not as part of a restart [14:12:36] so having that failed because a new on-disk config is not valid is also not the right thing to do [14:12:50] (03PS1) 10Marostegui: db-eqiad.php: Repool db1081, depool db1068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344147 (https://phabricator.wikimedia.org/T73563) [14:13:01] yeah, define "right" in this sense [14:13:22] it might be argued that it is "right" to not deny the service if the config is bad [14:13:33] <_joe_> paravoid: that issue is appalling, I hope the denier there is not a systemd dev [14:13:49] (03PS1) 10BBlack: authdns: load static plugin, bugfix for dd3b4fd71 [puppet] - 10https://gerrit.wikimedia.org/r/344148 [14:14:45] <_joe_> btw, I think we can make puppet reload instead of restarting gdnsd [14:14:58] there is no reload, only reload-zones [14:15:18] <_joe_> oh I see [14:15:36] <_joe_> let me look at the systemd unit [14:15:54] that would be ok though as an action for ExecReload= right ? [14:15:57] hmm [14:15:57] well, puppet should never restart the daemon, only authdns scripts should (which always run checkconf) [14:16:09] yeah semantics... maybe you actually want to reload the entire config [14:16:11] not just the zones [14:16:12] damn [14:16:19] PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 698205 [14:16:45] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1081, depool db1068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344147 (https://phabricator.wikimedia.org/T73563) (owner: 10Marostegui) [14:16:49] yeah I think bblack is right [14:16:51] <_joe_> bblack: I guess you tried to have ExecReload=/usr/sbin/gdnsd restart in systemd and it doesn't work? [14:17:05] <_joe_> yeah I agree, my patch earlier was doing that [14:17:18] <_joe_> puppet won't touch gdnsd upon config changes [14:17:20] bblack: did you figure out why you added that notify 3 years ago? [14:17:33] paravoid: in any case, back on the more contentious topic - the "dns in puppet" patch introduces a layer of separation here. puppet doesn't ever directly touch gdnsd's input files. it stages them to a staging directory, and the authdns-update scripts are the only ones that move config or zonefiles into place and checkconf -> restart/reload [14:17:49] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1081, depool db1068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344147 (https://phabricator.wikimedia.org/T73563) (owner: 10Marostegui) [14:17:58] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1081, depool db1068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344147 (https://phabricator.wikimedia.org/T73563) (owner: 10Marostegui) [14:18:00] paravoid: probably they weren't jessie at the time and it seemed saner than letting a config change go un-restarted after a merge accidentally for a long time [14:18:52] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1081, depool db1068 T160415 - T73563 (duration: 00m 43s) [14:18:56] right [14:18:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:59] T160415: Review schema changes for T125071 - Add index to image table on all wikis - https://phabricator.wikimedia.org/T160415 [14:18:59] T73563: *_minor_mime are varbinary(32) on WMF sites, out of sync with varbinary(100) in MW core - https://phabricator.wikimedia.org/T73563 [14:19:07] yeah that actually makes sense too [14:19:39] the process of pushing a config change into puppet, then running an authdns-update to have it take effect (without any zone changes) is also not very intuitive [14:20:27] in the "dns in puppet" model, puppet's really just managing the templating/creation of the set of data that compromises the intended config+zonefiles [14:20:37] and authdns-update is in charge of deploying all things authdns [14:20:38] <_joe_> ok if I'm not needed for this, I'll go back to try to decouple redis replication and puppet [14:21:15] (without puppet's help, if warranted by the situation) [14:21:30] PROBLEM - puppet last run on elastic1035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:21:38] (03PS1) 10Jcrespo: install_server: Test the new db recipe db-no-srv-format on es2015 [puppet] - 10https://gerrit.wikimedia.org/r/344149 (https://phabricator.wikimedia.org/T160242) [14:22:35] (03CR) 10BBlack: [C: 032] authdns: load static plugin, bugfix for dd3b4fd71 [puppet] - 10https://gerrit.wikimedia.org/r/344148 (owner: 10BBlack) [14:22:43] (03PS1) 10Marostegui: db-eqiad.php: Restore db1087 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344150 (https://phabricator.wikimedia.org/T137191) [14:23:53] (03CR) 10Marostegui: "Shouldn't it point to: db-no-srv-format.cfg ?" [puppet] - 10https://gerrit.wikimedia.org/r/344149 (https://phabricator.wikimedia.org/T160242) (owner: 10Jcrespo) [14:24:05] anyways, if it hadn't been for the notify+systemd issues, this would've been caught today by authdns-update before it tried a restart [14:24:31] but in the absence of other changes, it wouldn't have tried a restart at all, since these config files aren't managed by it [14:24:42] !log Deploy schema change s4 to db1068 - https://phabricator.wikimedia.org/T160415 https://phabricator.wikimedia.org/T73563 [14:24:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:54] right, I was just thinking of that [14:25:10] also, the codepath of checkconf in authdns-update is pretty much untested [14:25:22] what do you mean? [14:25:44] oh, that lint always catches the issue first? [14:25:47] yes [14:25:54] it's been recently tested [14:25:57] so that code path probably hasn't been exercised much [14:25:58] oh? [14:25:59] (it still works) [14:26:11] heh, ok, good [14:26:50] with the current split (where dns repo mocks the config to pass lint), the hieradata change to puppet repo has to be merged and agent-deployed to authdns, before you should merge->deploy the related dns patch (with the zonefile entry + mock test config) [14:27:19] I deploy a pair of such patches backwards yesterday accidentally, meaning I hadn't merged the puppet side at all. But authdns-lint said +2 to the dns side, so I tried to run authdns-update [14:27:32] and it failed at the checkconf step for lack of the config inputs missing from the puppet commit [14:27:45] right ok [14:30:51] I wonder if there are broken resolvers out there that tried to resolve us, failed while the 3 NSes were down, then cached the negative result [14:31:11] it sounds like a /very/ broken behavior, so hopefully it's not happening much [14:31:50] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Restore db1087 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344150 (https://phabricator.wikimedia.org/T137191) (owner: 10Marostegui) [14:31:55] bblack: do you have any sense on whether that's something we would need to worry about? [14:32:29] PROBLEM - Postgres Replication Lag on maps-test2002 is CRITICAL: CRITICAL - Rep Delay is: 18730.56468 Seconds [14:33:09] PROBLEM - Postgres Replication Lag on maps-test2004 is CRITICAL: CRITICAL - Rep Delay is: 18768.632646 Seconds [14:33:09] PROBLEM - Postgres Replication Lag on maps-test2003 is CRITICAL: CRITICAL - Rep Delay is: 18771.38542 Seconds [14:33:20] second day mark is away, second outage [14:33:33] it's the curse again! [14:33:52] (03PS1) 10BBlack: authdns: never restart gdnsd, keep initial install working right [puppet] - 10https://gerrit.wikimedia.org/r/344152 [14:34:35] paravoid: third day actually ;) [14:34:37] paravoid: that would be pretty broken. was there an actual window when all 3 were down? when I was first editing->fixing eeden radon was still up I thought. I haven't looked at graphs yet [14:34:39] (03Merged) 10jenkins-bot: db-eqiad.php: Restore db1087 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344150 (https://phabricator.wikimedia.org/T137191) (owner: 10Marostegui) [14:34:51] (03CR) 10jenkins-bot: db-eqiad.php: Restore db1087 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344150 (https://phabricator.wikimedia.org/T137191) (owner: 10Marostegui) [14:35:01] bblack: it doesn't hurt, but you don't need the last require [14:35:17] the git::clone one yeah I thought so [14:35:21] no [14:35:26] since it's already notify [14:35:38] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Restore db1087 original weight - T137191 (duration: 00m 44s) [14:35:39] oh you mean on the file paths [14:35:41] right :) [14:35:43] the File['/etc/gdnsd/discovery-map'] -> File['/etc/gdnsd'] [14:35:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:46] T137191: Defragment db1070, db1082, db1087, db1092 - https://phabricator.wikimedia.org/T137191 [14:35:50] puppet implicitly creates that one [14:36:09] notify implies requires too, though [14:37:14] (03PS1) 10BBlack: authdns: remove redundant relationships [puppet] - 10https://gerrit.wikimedia.org/r/344153 [14:37:20] heh [14:37:21] bblack: yes, looks like for ~1m they were all down from the graph I can check from systemd logs [14:37:27] (03PS3) 10Ottomata: Load wiki project namespace map into HDFS weekly, sqoop mediawiki monthly [puppet] - 10https://gerrit.wikimedia.org/r/343753 (https://phabricator.wikimedia.org/T160083) [14:37:54] and the problem there is the puppet crontabs are so close for this 3 critical hosts, they should more distributed [14:37:58] (03CR) 10jerkins-bot: [V: 04-1] authdns: never restart gdnsd, keep initial install working right [puppet] - 10https://gerrit.wikimedia.org/r/344152 (owner: 10BBlack) [14:38:07] (03CR) 10Jcrespo: "It does:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/344149 (https://phabricator.wikimedia.org/T160242) (owner: 10Jcrespo) [14:38:12] yeah... [14:38:39] (03PS2) 10Jcrespo: install_server: Test the new db recipe db-no-srv-format on es2015 [puppet] - 10https://gerrit.wikimedia.org/r/344149 (https://phabricator.wikimedia.org/T160242) [14:38:46] so, I have a solution for the cron timing thing, but I don't think it's quite good enough yet for solving the general-case problem like this [14:38:48] do we have an easy way to force the distribution? [14:39:09] (03CR) 10Marostegui: [C: 031] "Sorry - missed it!" [puppet] - 10https://gerrit.wikimedia.org/r/344149 (https://phabricator.wikimedia.org/T160242) (owner: 10Jcrespo) [14:39:11] ideally this "better solution" should be extended to take roles into account and such [14:39:21] it should have some sort of cluster weight, distribute amongst the cluster and amongst all hosts in the same dc [14:39:33] or better, that affers to the same puppetmaster [14:39:55] right [14:39:57] (03PS4) 10Ottomata: Load wiki project namespace map into HDFS weekly, sqoop mediawiki monthly [puppet] - 10https://gerrit.wikimedia.org/r/343753 (https://phabricator.wikimedia.org/T160083) [14:40:32] (03CR) 10Jcrespo: [C: 032] "No need to be sorry- I am always happy about a review." [puppet] - 10https://gerrit.wikimedia.org/r/344149 (https://phabricator.wikimedia.org/T160242) (owner: 10Jcrespo) [14:40:57] volans: this is how I fixed the cron-timing issue for a narrower case (than our big global agent cron): [14:41:01] volans: https://github.com/wikimedia/puppet/blob/production/modules/wmflib/lib/puppet/parser/functions/cron_splay.rb [14:41:21] that's smart enough, when applied to a single cluster which uses fooNNNN naming [14:41:40] the basic idea could be expanded to apply to a global cron like agent, and pay attention to roles to figure out clusters, etc [14:41:54] (and get $::site instead of relying on NNNN to know which DC) [14:42:47] yeah I remeber the splay [14:44:14] given the unique constraints of the agent cron (many many hosts, very short interval) maybe it's worth its own similar parser function that isn't fully generic, I donno [14:44:55] where it splits clusters based on role/profile-set and does a separate splay for each unique role/profile set paying attention to $::site [14:44:57] (03PS5) 10Ottomata: Load wiki project namespace map into HDFS weekly, sqoop mediawiki monthly [puppet] - 10https://gerrit.wikimedia.org/r/343753 (https://phabricator.wikimedia.org/T160083) [14:45:08] (03CR) 10Ottomata: [V: 032 C: 032] Load wiki project namespace map into HDFS weekly, sqoop mediawiki monthly [puppet] - 10https://gerrit.wikimedia.org/r/343753 (https://phabricator.wikimedia.org/T160083) (owner: 10Ottomata) [14:45:16] but I donno, when you think about it, how far off optimal were we anyways [14:45:52] with 30-minute puppet timing and 3x authdns, best-cast was :00, :10, :20. Which is only about twice as good as what we had in practice. [14:47:23] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3122024 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['es2015.codfw.wmnet'] ``` The log can be found in `/var/log/wm... [14:47:49] PROBLEM - puppet last run on es1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:49:29] RECOVERY - puppet last run on elastic1035 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [14:49:47] (03PS1) 10Ottomata: s/day/monthday/ in sqoop_mediawiki job [puppet] - 10https://gerrit.wikimedia.org/r/344155 [14:49:49] PROBLEM - puppet last run on analytics1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:50:10] (03CR) 10Ottomata: [V: 032 C: 032] s/day/monthday/ in sqoop_mediawiki job [puppet] - 10https://gerrit.wikimedia.org/r/344155 (owner: 10Ottomata) [14:50:51] (03PS2) 10BBlack: authdns: never restart gdnsd, keep initial install working right [puppet] - 10https://gerrit.wikimedia.org/r/344152 [14:50:53] (03PS2) 10BBlack: authdns: remove redundant relationships [puppet] - 10https://gerrit.wikimedia.org/r/344153 [14:52:20] bblack: yes although we have now 4m between all of them, with the ideal 0,10,20 we'd have a 20m before all died [14:52:30] RECOVERY - Postgres Replication Lag on maps-test2002 is OK: OK - Rep Delay is: 0.0 Seconds [14:52:45] right, 20 vs 12, which almost certainly would help in this case [14:52:54] err wait, 20 vs 5 [14:52:56] 20 vs 4 [14:53:00] yeah 5 [14:53:09] RECOVERY - Postgres Replication Lag on maps-test2003 is OK: OK - Rep Delay is: 0.0 Seconds [14:53:14] 4m 13s to be precise [14:53:39] I thought it was :59 -> :04? (well, the cron times I mean) [14:53:41] 4m 07s, I cannot do a substraction apparently [14:53:49] RECOVERY - puppet last run on analytics1003 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [14:53:50] yes, from effective stop [14:54:02] right, I guess agent timing varies a lot too [14:54:09] !log rebooting elastic2001 to Linux 4.9 [14:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:22] 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: EQIAD: stat1003 replacement - https://phabricator.wikimedia.org/T159839#3080357 (10Halfak) I was just told that this task is blocking on me to provide GPU specs. Is that right? [14:55:30] PROBLEM - Postgres Replication Lag on maps-test2002 is CRITICAL: CRITICAL - Rep Delay is: 20110.727887 Seconds [14:56:04] the problem of a general splay is that we want to spread across time to not overload the puppetmaster and at the same time spread a single cluster across time to avoid this kind of failure [14:56:49] and smaller the cluster more important it should be to spread it, with 100 hosts and 30m the general one should be ok [14:57:29] RECOVERY - Postgres Replication Lag on maps-test2002 is OK: OK - Rep Delay is: 0.0 Seconds [14:58:09] RECOVERY - Postgres Replication Lag on maps-test2004 is OK: OK - Rep Delay is: 0.0 Seconds [15:00:29] PROBLEM - Postgres Replication Lag on maps-test2002 is CRITICAL: CRITICAL - Rep Delay is: 20410.531059 Seconds [15:00:55] gehel: FYI ^^^ [15:01:09] PROBLEM - Postgres Replication Lag on maps-test2004 is CRITICAL: CRITICAL - Rep Delay is: 20449.188356 Seconds [15:01:10] PROBLEM - Postgres Replication Lag on maps-test2003 is CRITICAL: CRITICAL - Rep Delay is: 20451.56741 Seconds [15:01:19] volans: got it [15:01:40] going to deploy a couple mediawik patches for Moritz and HHVM [15:01:45] it's maps-test, will have a look in a few moment [15:01:58] gehel: sure, was just a FYI ;) [15:08:00] (03PS2) 10Filippo Giunchedi: librenms: use base::service_unit [puppet] - 10https://gerrit.wikimedia.org/r/344112 (https://phabricator.wikimedia.org/T125020) [15:11:39] PROBLEM - puppet last run on mw1269 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:12:32] (03PS1) 10Muehlenhoff: Enable experimental section on restbase-test* [puppet] - 10https://gerrit.wikimedia.org/r/344159 [15:13:03] (03CR) 10Filippo Giunchedi: [C: 032] librenms: use base::service_unit [puppet] - 10https://gerrit.wikimedia.org/r/344112 (https://phabricator.wikimedia.org/T125020) (owner: 10Filippo Giunchedi) [15:15:12] (03PS3) 10Filippo Giunchedi: apache 2.4 compat for librenms/servermon/smokeping [puppet] - 10https://gerrit.wikimedia.org/r/344113 (https://phabricator.wikimedia.org/T125020) [15:15:49] RECOVERY - puppet last run on es1014 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [15:17:21] (03CR) 10Filippo Giunchedi: [C: 032] apache 2.4 compat for librenms/servermon/smokeping [puppet] - 10https://gerrit.wikimedia.org/r/344113 (https://phabricator.wikimedia.org/T125020) (owner: 10Filippo Giunchedi) [15:22:16] (03PS2) 10Muehlenhoff: Enable experimental section on restbase-test* [puppet] - 10https://gerrit.wikimedia.org/r/344159 [15:25:45] !log cp*: removed linux-image-amd64, linux-image-3.16.0-4-amd64 and linux-image-4.4.0-1-amd64 to reduce churn [15:25:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:40] 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: EQIAD: stat1002 replacement - https://phabricator.wikimedia.org/T159838#3122125 (10Halfak) OK I did some digging. My assessment is that the Nvidia Tesla K80 is most desirable, but the closed source drivers will be a problem. The AMD Fir... [15:28:48] (03CR) 10Muehlenhoff: [C: 032] Enable experimental section on restbase-test* [puppet] - 10https://gerrit.wikimedia.org/r/344159 (owner: 10Muehlenhoff) [15:29:43] ottomata & robh, https://phabricator.wikimedia.org/T159838#3122125 [15:29:57] I hope that is helpful. We're in a weird position with this hardware investment. [15:32:10] RECOVERY - Postgres Replication Lag on maps-test2003 is OK: OK - Rep Delay is: 0.0 Seconds [15:32:11] 06Operations, 10ORES, 10Revision-Scoring-As-A-Service-Backlog: [spec] Active-active setup for ORES across datacenters (eqiad, codfw) - https://phabricator.wikimedia.org/T159615#3122138 (10mobrovac) Ok, so the focus of this ticket seems to be on how to have both Redis instances warmed up with the same content... [15:33:00] moritzm: I am syncing the mediawiki changes for hhvm [15:33:07] wmf.17 is not deployed so syncing everywhere [15:33:35] for wmf.16 I guess we want to first try on mwdebug* and mw1261 [15:33:46] !log hashar@tin Synchronized php-1.29.0-wmf.17/languages/classes/LanguageAz.php: Check for string initialization in ucfirst() to make HHVM 3.18 happy - T161095 (duration: 00m 59s) [15:33:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:51] T161095: Uninitialized string offset warnings with HHVM 3.18 in LanguageAz.php and LanguageKk.php - https://phabricator.wikimedia.org/T161095 [15:34:42] 06Operations, 10ORES, 10Revision-Scoring-As-A-Service-Backlog, 06Services (designing), 15User-mobrovac: [spec] Active-active setup for ORES across datacenters (eqiad, codfw) - https://phabricator.wikimedia.org/T159615#3122142 (10mobrovac) [15:34:47] !log hashar@tin Synchronized php-1.29.0-wmf.17/languages/classes/LanguageKk.php: Check for string initialization in ucfirst() to make HHVM 3.18 happy - T161095 (duration: 00m 54s) [15:34:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:10] PROBLEM - Postgres Replication Lag on maps-test2003 is CRITICAL: CRITICAL - Rep Delay is: 22491.542704 Seconds [15:36:13] (03PS1) 10Jcrespo: install_server: Do not remove the lvm partition for db-no-srv-format [puppet] - 10https://gerrit.wikimedia.org/r/344160 (https://phabricator.wikimedia.org/T160242) [15:36:19] RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 554 [15:36:29] !log Deploying LanguageAz.php and LanguageKk.php hotfix for HHVM 3.18 on mwdebug* and mw1261 - T161095 [15:36:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:09] RECOVERY - Postgres Replication Lag on maps-test2004 is OK: OK - Rep Delay is: 0.0 Seconds [15:38:19] (03CR) 10Marostegui: [C: 031] install_server: Do not remove the lvm partition for db-no-srv-format [puppet] - 10https://gerrit.wikimedia.org/r/344160 (https://phabricator.wikimedia.org/T160242) (owner: 10Jcrespo) [15:40:10] PROBLEM - Postgres Replication Lag on maps-test2004 is CRITICAL: CRITICAL - Rep Delay is: 22790.556092 Seconds [15:40:34] !log hashar@tin Synchronized php-1.29.0-wmf.16/languages/classes/LanguageAz.php: Check for string initialization in ucfirst() to make HHVM 3.18 happy - T161095 (duration: 00m 48s) [15:40:39] RECOVERY - puppet last run on mw1269 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [15:40:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:40] T161095: Uninitialized string offset warnings with HHVM 3.18 in LanguageAz.php and LanguageKk.php - https://phabricator.wikimedia.org/T161095 [15:41:31] !log hashar@tin Synchronized php-1.29.0-wmf.16/languages/classes/LanguageKk.php: Check for string initialization in ucfirst() to make HHVM 3.18 happy - T161095 (duration: 00m 44s) [15:41:36] 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: EQIAD: stat1002 replacement - https://phabricator.wikimedia.org/T159838#3122155 (10RobH) I'd say the closed source driver is a blocker. We've blocked the use of some PCIe flash memory cards due to closed source driver usage in years past... [15:41:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:40] (03CR) 10Jcrespo: [C: 032] install_server: Do not remove the lvm partition for db-no-srv-format [puppet] - 10https://gerrit.wikimedia.org/r/344160 (https://phabricator.wikimedia.org/T160242) (owner: 10Jcrespo) [15:41:45] (03PS2) 10Jcrespo: install_server: Do not remove the lvm partition for db-no-srv-format [puppet] - 10https://gerrit.wikimedia.org/r/344160 (https://phabricator.wikimedia.org/T160242) [15:41:52] (03CR) 10Jcrespo: [V: 032 C: 032] install_server: Do not remove the lvm partition for db-no-srv-format [puppet] - 10https://gerrit.wikimedia.org/r/344160 (https://phabricator.wikimedia.org/T160242) (owner: 10Jcrespo) [15:41:55] halfak: I don't think you're going to like the answer I gave =[ [15:42:15] but I'll quote out the system with Dell using some open source options plus the closed source one [15:42:27] but im pretty sure closed source = disqualification [15:42:33] robh, no worries. I still have some distance from this given that I'm not fully invested in tensor flow yet. [15:42:48] but i'll get the quote for the other stat box replacement requested today [15:42:50] robh, I just wish we had a clear answer around the state of Tensorflow OpenCL support [15:42:55] as well as get that one started [15:43:09] RECOVERY - Postgres Replication Lag on maps-test2004 is OK: OK - Rep Delay is: 0.0 Seconds [15:43:15] was out sick yesterday, still not 100%, so it'll be a bit slower than I normally am (so today not this morning ;) [15:43:50] robh, glad you're feeling partially better. thanks for the quick responses :) [15:43:51] halfak: cool, i just hate telling folks 'nope cannot use that' cuz i know it sucks! [15:44:02] (03PS1) 10Dzahn: wikimedia.ee: update Google verification TXT record [dns] - 10https://gerrit.wikimedia.org/r/344161 (https://phabricator.wikimedia.org/T158638) [15:44:22] Yeah. We might end up spending 100-200 labor hours working around the open source experimental support stuff :S [15:44:35] But it does look like support will likely get better in the next few years [15:44:52] So if we find things are really bad in the short term, they' [15:44:57] 'll likely get better [15:45:02] 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: EQIAD: stat1002 replacement - https://phabricator.wikimedia.org/T159838#3122164 (10RobH) I'll plan on getting some quotes generated with the Dell and HP options, with the more open source friendly GPU options to start. If the closed sour... [15:45:25] i felt bad cuz i said 'pick one of these' and then right away changed that to 'you cannot have some of these' heh [15:45:51] 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: EQIAD: stat1002 replacement - https://phabricator.wikimedia.org/T159838#3122167 (10MoritzMuehlenhoff) Leaving our FLOSS policy aside, the proprietary Nvidia is going to be a significant problem both in terms of - maintainability (we tend... [15:46:06] (03CR) 10Dzahn: [C: 032] wikimedia.ee: update Google verification TXT record [dns] - 10https://gerrit.wikimedia.org/r/344161 (https://phabricator.wikimedia.org/T158638) (owner: 10Dzahn) [15:46:09] PROBLEM - Postgres Replication Lag on maps-test2004 is CRITICAL: CRITICAL - Rep Delay is: 23150.810285 Seconds [15:46:11] :) If only technology was simple -- well, I suppose we wouldn't have a job then [15:47:19] know the "primitive technology" youtube channel? that still looks really hard, heh [15:47:31] 06Operations, 10MediaWiki-Internationalization, 07HHVM, 05MW-1.29-release (WMF-deploy-2017-03-28_(1.29.0-wmf.18)), and 3 others: Uninitialized string offset warnings with HHVM 3.18 in LanguageAz.php and LanguageKk.php - https://phabricator.wikimedia.org/T161095#3122169 (10hashar) I have deployed the hotfix... [15:47:33] guy building stuff with just stone age tools [15:50:02] 06Operations, 10Domains, 10Traffic, 06WMF-Legal, 13Patch-For-Review: Using wikimedia.ee mail address as Google account - https://phabricator.wikimedia.org/T158638#3122174 (10Dzahn) @Beetlebeard Ok, no worries. This time it can be faster since we have a general ok and are just fixing it. Here you go: ht... [15:50:31] mutante: and in places where nature likely is prone to kill you, IIRC he's in queensland [15:51:54] godog: oooh, yea, doing that in Australia on top of it, very true [15:53:29] RECOVERY - Postgres Replication Lag on maps-test2002 is OK: OK - Rep Delay is: 0.0 Seconds [15:53:43] (03PS3) 10Dzahn: Gerrit: Make gerritbot report the branch in the comment [puppet] - 10https://gerrit.wikimedia.org/r/344102 (https://phabricator.wikimedia.org/T161078) (owner: 10Paladox) [15:56:29] PROBLEM - Postgres Replication Lag on maps-test2002 is CRITICAL: CRITICAL - Rep Delay is: 23770.626461 Seconds [15:57:30] (03CR) 10Dzahn: [C: 032] Gerrit: Make gerritbot report the branch in the comment [puppet] - 10https://gerrit.wikimedia.org/r/344102 (https://phabricator.wikimedia.org/T161078) (owner: 10Paladox) [15:58:46] (03CR) 10Paladox: "thanks." [puppet] - 10https://gerrit.wikimedia.org/r/344102 (https://phabricator.wikimedia.org/T161078) (owner: 10Paladox) [15:58:56] (03CR) 10Dzahn: [C: 031] Move all ssl certs to the module and out of files/ [puppet] - 10https://gerrit.wikimedia.org/r/341729 (owner: 10Chad) [16:01:12] (03PS3) 10BBlack: authdns: never restart gdnsd, keep initial install working right [puppet] - 10https://gerrit.wikimedia.org/r/344152 [16:01:14] (03PS3) 10BBlack: authdns: remove redundant relationships [puppet] - 10https://gerrit.wikimedia.org/r/344153 [16:03:09] RECOVERY - Postgres Replication Lag on maps-test2003 is OK: OK - Rep Delay is: 0.0 Seconds [16:06:09] PROBLEM - Postgres Replication Lag on maps-test2003 is CRITICAL: CRITICAL - Rep Delay is: 24351.606622 Seconds [16:09:00] (03CR) 10BBlack: [C: 032] authdns: never restart gdnsd, keep initial install working right [puppet] - 10https://gerrit.wikimedia.org/r/344152 (owner: 10BBlack) [16:09:08] (03CR) 10BBlack: [C: 032] authdns: remove redundant relationships [puppet] - 10https://gerrit.wikimedia.org/r/344153 (owner: 10BBlack) [16:23:29] PROBLEM - Disk space on labcontrol1001 is CRITICAL: DISK CRITICAL - free space: / 2976 MB (6% inode=81%): /srv 25372 MB (3% inode=99%) [16:25:14] (03PS1) 10Ottomata: Add analytics labsdb pw in hdfs, add logrotate for refinery logs [puppet] - 10https://gerrit.wikimedia.org/r/344165 (https://phabricator.wikimedia.org/T160083) [16:26:07] (03PS3) 10BryanDavis: Add a default Apache 2.0 license [puppet] - 10https://gerrit.wikimedia.org/r/183862 (https://phabricator.wikimedia.org/T67270) (owner: 10Rush) [16:26:14] (03PS2) 10Ottomata: Add analytics labsdb pw in hdfs, add logrotate for refinery logs [puppet] - 10https://gerrit.wikimedia.org/r/344165 (https://phabricator.wikimedia.org/T160083) [16:26:53] (03CR) 10Chad: [C: 032] Fix some Debian lintian warnnings for the gerrit package [debs/gerrit] - 10https://gerrit.wikimedia.org/r/343297 (owner: 10Paladox) [16:27:44] (03CR) 10Paladox: "thanks." [debs/gerrit] - 10https://gerrit.wikimedia.org/r/343297 (owner: 10Paladox) [16:28:31] (03PS3) 10Ottomata: Add analytics labsdb pw in hdfs, add logrotate for refinery logs [puppet] - 10https://gerrit.wikimedia.org/r/344165 (https://phabricator.wikimedia.org/T160083) [16:29:05] 06Operations, 06WMF-Legal, 10Wikimedia-General-or-Unknown, 07Documentation, and 2 others: Default license for operations/puppet - https://phabricator.wikimedia.org/T67270#3122269 (10bd808) I have BOLDly amended @chasemp's patch to: * Use Apache 2.0 as the default license * Add a NOTICE file that includes t... [16:29:42] ^ andrewbogott [16:29:54] scheduling debugging loggin? [16:30:05] oh nope [16:30:44] guessing andrewbogott this is you :) 330G migratetmp [16:31:44] 06Operations, 06WMF-Legal, 10Wikimedia-General-or-Unknown, 07Documentation, and 2 others: Default license for operations/puppet - https://phabricator.wikimedia.org/T67270#3122276 (10hashar) Patch is https://gerrit.wikimedia.org/r/#/c/183862/ [16:31:45] (03CR) 10Ottomata: [C: 032] Add analytics labsdb pw in hdfs, add logrotate for refinery logs [puppet] - 10https://gerrit.wikimedia.org/r/344165 (https://phabricator.wikimedia.org/T160083) (owner: 10Ottomata) [16:32:53] (03CR) 10Hashar: [C: 031] Add a default Apache 2.0 license [puppet] - 10https://gerrit.wikimedia.org/r/183862 (https://phabricator.wikimedia.org/T67270) (owner: 10Rush) [16:33:07] bd808: I am lighting that candle :D [16:34:55] (03CR) 10Rush: [C: 031] "2 years and 2 months later :)" [puppet] - 10https://gerrit.wikimedia.org/r/183862 (https://phabricator.wikimedia.org/T67270) (owner: 10Rush) [16:35:10] RECOVERY - Postgres Replication Lag on maps-test2004 is OK: OK - Rep Delay is: 0.0 Seconds [16:35:39] PROBLEM - puppet last run on analytics1027 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/logrotate.d/refinery] [16:37:31] (03PS1) 10Ottomata: Rename refinery-logrotate.conf.erb to refinery-logrotate.conf [puppet] - 10https://gerrit.wikimedia.org/r/344166 [16:37:41] (03CR) 10Ottomata: [V: 032 C: 032] Rename refinery-logrotate.conf.erb to refinery-logrotate.conf [puppet] - 10https://gerrit.wikimedia.org/r/344166 (owner: 10Ottomata) [16:37:49] PROBLEM - puppet last run on analytics1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/logrotate.d/refinery] [16:38:10] PROBLEM - Postgres Replication Lag on maps-test2004 is CRITICAL: CRITICAL - Rep Delay is: 26271.115079 Seconds [16:38:49] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/logrotate.d/refinery] [16:39:00] ^ me [16:39:49] RECOVERY - puppet last run on analytics1003 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [16:40:39] RECOVERY - puppet last run on analytics1027 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [16:40:39] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [16:41:09] RECOVERY - Postgres Replication Lag on maps-test2004 is OK: OK - Rep Delay is: 0.0 Seconds [16:41:10] RECOVERY - Postgres Replication Lag on maps-test2003 is OK: OK - Rep Delay is: 0.0 Seconds [16:42:49] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3122315 (10Marostegui) p:05High>03Normal a:05Papaul>03None [16:44:09] PROBLEM - Postgres Replication Lag on maps-test2004 is CRITICAL: CRITICAL - Rep Delay is: 26631.455835 Seconds [16:44:09] PROBLEM - Postgres Replication Lag on maps-test2003 is CRITICAL: CRITICAL - Rep Delay is: 26631.575722 Seconds [16:44:11] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3092963 (10Marostegui) The server is now set up, and ready to get the data from es2014. Things that have been done: - Tested a new way to prevent a server to avoid wiping the part... [16:45:41] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3122330 (10Marostegui) a:03jcrespo [16:48:59] PROBLEM - puppet last run on elastic1041 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:55:35] !log shutting down es2016's mariadb to clone to es2015 [16:55:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:07] ^this will break es2 on codfw for a few hours [16:57:18] but it is the fastest way to do it [17:00:04] 06Operations, 10ops-eqiad, 13Patch-For-Review: db1057 does not react to powercycle/powerdown/powerup commands - https://phabricator.wikimedia.org/T160435#3122374 (10jcrespo) @Cmjohnson this is still blocked on you- db1057 does not start. Does it need a new PDU? [17:03:35] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3122398 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['es2015.codfw.wmnet'] ``` Of which those **FAILED**: ``` set(['es2015.codfw.wmnet']) ``` [17:04:43] (03PS4) 10BryanDavis: Add a default Apache 2.0 license [puppet] - 10https://gerrit.wikimedia.org/r/183862 (https://phabricator.wikimedia.org/T67270) (owner: 10Rush) [17:06:00] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3122421 (10jcrespo) I am going to use the codfw master **es2016** not es2014, because the latter does have compressed tables- something we have yet to fix, and not something we wan... [17:07:12] (03PS1) 10Filippo Giunchedi: install_server: switch netmon to jessie [puppet] - 10https://gerrit.wikimedia.org/r/344168 (https://phabricator.wikimedia.org/T125020) [17:09:49] (03CR) 10BryanDavis: "There has been some discussion on irc about the nicks which are present in the CONTRIBUTORS file for some contributors. The easy fix for t" [puppet] - 10https://gerrit.wikimedia.org/r/183862 (https://phabricator.wikimedia.org/T67270) (owner: 10Rush) [17:09:49] RECOVERY - OCG health on ocg1001 is OK: OK: ocg_job_status 702699 msg: ocg_render_job_queue 0 msg [17:10:00] <_joe_> !log restarted ocg on ogc1001, not serving http queries [17:10:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:25] (03CR) 10BryanDavis: "> Uploaded patch set 4." [puppet] - 10https://gerrit.wikimedia.org/r/183862 (https://phabricator.wikimedia.org/T67270) (owner: 10Rush) [17:16:59] RECOVERY - puppet last run on elastic1041 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [17:17:08] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3122439 (10jcrespo) I've started the transfer from es2016 to es2015, the transfer may take 11-12 hours, so it will finish by ~6-7 UTC. es2016 and es2015 will be down during it. I h... [17:18:39] RECOVERY - Postgres Replication Lag on maps-test2002 is OK: OK - Rep Delay is: 0.0 Seconds [17:21:39] PROBLEM - Postgres Replication Lag on maps-test2002 is CRITICAL: CRITICAL - Rep Delay is: 28878.971472 Seconds [17:22:17] (03PS3) 10Jcrespo: Set cron script to dump MediaWiki DB lag times into statsd [puppet] - 10https://gerrit.wikimedia.org/r/327667 (https://phabricator.wikimedia.org/T149210) (owner: 10Aaron Schulz) [17:22:22] (03PS1) 10Giuseppe Lavagetto: role::jobqueue_redis::master: depend on etcd, not mw_primary [puppet] - 10https://gerrit.wikimedia.org/r/344169 [17:22:58] (03CR) 10Jcrespo: [C: 032] "I said I was going to forget, and I didn't lie!" [puppet] - 10https://gerrit.wikimedia.org/r/327667 (https://phabricator.wikimedia.org/T149210) (owner: 10Aaron Schulz) [17:23:22] (03CR) 10jerkins-bot: [V: 04-1] role::jobqueue_redis::master: depend on etcd, not mw_primary [puppet] - 10https://gerrit.wikimedia.org/r/344169 (owner: 10Giuseppe Lavagetto) [17:24:09] RECOVERY - Postgres Replication Lag on maps-test2003 is OK: OK - Rep Delay is: 0.0 Seconds [17:24:10] RECOVERY - Postgres Replication Lag on maps-test2004 is OK: OK - Rep Delay is: 0.0 Seconds [17:25:03] (03PS2) 10Giuseppe Lavagetto: role::jobqueue_redis::master: depend on etcd, not mw_primary [puppet] - 10https://gerrit.wikimedia.org/r/344169 [17:25:52] (03CR) 10jerkins-bot: [V: 04-1] role::jobqueue_redis::master: depend on etcd, not mw_primary [puppet] - 10https://gerrit.wikimedia.org/r/344169 (owner: 10Giuseppe Lavagetto) [17:27:09] PROBLEM - Postgres Replication Lag on maps-test2003 is CRITICAL: CRITICAL - Rep Delay is: 29211.47089 Seconds [17:27:09] PROBLEM - Postgres Replication Lag on maps-test2004 is CRITICAL: CRITICAL - Rep Delay is: 29211.483108 Seconds [17:27:42] (03PS2) 10Dzahn: install_server: switch netmon to jessie [puppet] - 10https://gerrit.wikimedia.org/r/344168 (https://phabricator.wikimedia.org/T125020) (owner: 10Filippo Giunchedi) [17:29:15] (03CR) 10Dzahn: [C: 032] install_server: switch netmon to jessie [puppet] - 10https://gerrit.wikimedia.org/r/344168 (https://phabricator.wikimedia.org/T125020) (owner: 10Filippo Giunchedi) [17:29:48] 06Operations, 06WMF-Legal, 10Wikimedia-General-or-Unknown, 07Documentation, and 2 others: Default license for operations/puppet - https://phabricator.wikimedia.org/T67270#3122460 (10faidon) You're all very right that we should finally fix this. I like the latest patchset personally but I don't feel comfort... [17:32:43] (03PS3) 10Giuseppe Lavagetto: role::jobqueue_redis::master: depend on etcd, not mw_primary [puppet] - 10https://gerrit.wikimedia.org/r/344169 [17:34:12] (03PS1) 10DCausse: Revert "Disable updateSuggesterIndex cron take 2" [puppet] - 10https://gerrit.wikimedia.org/r/344171 [17:34:59] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.219 second response time [17:35:39] PROBLEM - puppet last run on analytics1027 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:38:36] (03PS4) 10Giuseppe Lavagetto: role::jobqueue_redis::master: depend on etcd, not mw_primary [puppet] - 10https://gerrit.wikimedia.org/r/344169 [17:39:59] I see BatchRowIterator running on terbium [17:39:59] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.575 second response time [17:41:14] is it for T157479 ? [17:41:14] T157479: Put together a production migration plan for ES 2 -> ES 5 - https://phabricator.wikimedia.org/T157479 [17:42:13] (03PS5) 10Giuseppe Lavagetto: role::jobqueue_redis::master: depend on etcd, not mw_primary [puppet] - 10https://gerrit.wikimedia.org/r/344169 [17:43:17] (03PS2) 10Dzahn: Gerrit: Double size of projects cache [puppet] - 10https://gerrit.wikimedia.org/r/344068 (owner: 10Chad) [17:44:49] it could be just a process blocked on the alter table, maybe? [17:45:12] (03CR) 10Dzahn: [C: 031] "..actually let's do that later this US afternoon, with a service restart" [puppet] - 10https://gerrit.wikimedia.org/r/344068 (owner: 10Chad) [17:46:05] 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: EQIAD: stat1002 replacement - https://phabricator.wikimedia.org/T159838#3122489 (10faidon) Yeah, we had a similar conversation over email with Adam (@dr0ptp4kt) who was also inquiring about TensorFlow. I had the same considerations that w... [17:46:59] (03PS1) 10Ottomata: Add -k $num_processors arg to sqoop-mediawiki-job [puppet] - 10https://gerrit.wikimedia.org/r/344172 [17:47:39] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [17:47:52] (03CR) 10Ottomata: [V: 032 C: 032] Add -k $num_processors arg to sqoop-mediawiki-job [puppet] - 10https://gerrit.wikimedia.org/r/344172 (owner: 10Ottomata) [17:47:59] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [17:50:09] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [17:50:39] (03PS1) 10Jcrespo: Revert "Set cron script to dump MediaWiki DB lag times into statsd" [puppet] - 10https://gerrit.wikimedia.org/r/344174 [17:51:01] (03PS2) 10Jcrespo: Revert "Set cron script to dump MediaWiki DB lag times into statsd" [puppet] - 10https://gerrit.wikimedia.org/r/344174 [17:51:26] ^ bblack trouble? [17:53:20] (03CR) 10Muehlenhoff: "Sounds good, but I guess the retroactive license application needs to be acked by everyone contributor (or at least by every contributor n" [puppet] - 10https://gerrit.wikimedia.org/r/183862 (https://phabricator.wikimedia.org/T67270) (owner: 10Rush) [17:53:45] (03CR) 10Jcrespo: [C: 032] Revert "Set cron script to dump MediaWiki DB lag times into statsd" [puppet] - 10https://gerrit.wikimedia.org/r/344174 (owner: 10Jcrespo) [17:54:30] 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: EQIAD: stat1002 replacement - https://phabricator.wikimedia.org/T159838#3122500 (10dr0ptp4kt) I've emailed with a contact re: OpenCL support for the Nvidia Tesla `P100` (presumably facilitated by mainstreaming of OpenCL), and also shared... [17:55:39] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:56:09] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:58:49] PROBLEM - HHVM rendering on mw2235 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:58:50] PROBLEM - HHVM rendering on mw2218 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:58:50] PROBLEM - HHVM rendering on mw2128 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:58:50] PROBLEM - HHVM rendering on mw2117 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:58:59] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [17:59:23] (03PS2) 10Filippo Giunchedi: graphite: cleanup eventstreams rdkafka stale data [puppet] - 10https://gerrit.wikimedia.org/r/343609 (https://phabricator.wikimedia.org/T160644) [17:59:49] RECOVERY - HHVM rendering on mw2235 is OK: HTTP OK: HTTP/1.1 200 OK - 1505 bytes in 8.048 second response time [17:59:49] RECOVERY - HHVM rendering on mw2117 is OK: HTTP OK: HTTP/1.1 200 OK - 1505 bytes in 7.818 second response time [17:59:49] RECOVERY - HHVM rendering on mw2218 is OK: HTTP OK: HTTP/1.1 200 OK - 1505 bytes in 8.132 second response time [17:59:49] RECOVERY - HHVM rendering on mw2128 is OK: HTTP OK: HTTP/1.1 200 OK - 1505 bytes in 8.132 second response time [17:59:59] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170322T1800). Please do the needful. [18:01:39] PROBLEM - graphite.wikimedia.org on graphite1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:01:45] 06Operations, 10Ops-Access-Requests: Requesting access to reseachers, analytics-wmde, analytics-users for GoranSMilovanovic - https://phabricator.wikimedia.org/T160980#3122539 (10Tobi_WMDE_SW) @MoritzMuehlenhoff our IT is still working on getting an @wikimedia.de address for @GoranSMilovanovic but it will stil... [18:02:26] (03PS1) 10Gehel: postgresql - drop support for postgis 1.5 [puppet] - 10https://gerrit.wikimedia.org/r/344176 [18:02:29] RECOVERY - graphite.wikimedia.org on graphite1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1547 bytes in 0.004 second response time [18:02:39] RECOVERY - Disk space on labcontrol1001 is OK: DISK OK [18:04:39] RECOVERY - puppet last run on analytics1027 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [18:04:40] (03PS2) 10DCausse: Re-enable updateSuggesterIndex cron [puppet] - 10https://gerrit.wikimedia.org/r/344171 [18:06:51] PROBLEM - HHVM rendering on mw2134 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:06:51] PROBLEM - HHVM rendering on mw2095 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:06:51] PROBLEM - HHVM rendering on mw2100 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:06:51] PROBLEM - HHVM rendering on mw2142 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:07:49] RECOVERY - HHVM rendering on mw2142 is OK: HTTP OK: HTTP/1.1 200 OK - 1505 bytes in 7.803 second response time [18:07:49] RECOVERY - HHVM rendering on mw2095 is OK: HTTP OK: HTTP/1.1 200 OK - 1505 bytes in 8.195 second response time [18:07:49] RECOVERY - HHVM rendering on mw2134 is OK: HTTP OK: HTTP/1.1 200 OK - 1505 bytes in 8.409 second response time [18:07:49] RECOVERY - HHVM rendering on mw2100 is OK: HTTP OK: HTTP/1.1 200 OK - 1505 bytes in 8.817 second response time [18:07:59] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] [18:10:10] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3122655 (10jcrespo) https://grafana.wikimedia.org/dashboard/file/server-board.json?panelId=17&fullscreen&var-server=es2015&var-network=eth0&from=1490198400000&to=now [18:10:49] PROBLEM - HHVM rendering on mw2134 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:11:49] RECOVERY - HHVM rendering on mw2134 is OK: HTTP OK: HTTP/1.1 200 OK - 1505 bytes in 9.529 second response time [18:17:03] 06Operations, 06Analytics-Kanban, 10Traffic, 06Wikipedia-iOS-App-Backlog, and 2 others: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558#2915651 (10JMinor) The version returning to 10% sampling rate is up on the app store. [18:18:09] RECOVERY - Postgres Replication Lag on maps-test2003 is OK: OK - Rep Delay is: 0.0 Seconds [18:18:10] RECOVERY - Postgres Replication Lag on maps-test2004 is OK: OK - Rep Delay is: 0.0 Seconds [18:20:39] RECOVERY - Postgres Replication Lag on maps-test2002 is OK: OK - Rep Delay is: 0.0 Seconds [18:21:10] PROBLEM - Postgres Replication Lag on maps-test2003 is CRITICAL: CRITICAL - Rep Delay is: 32451.605053 Seconds [18:21:10] PROBLEM - Postgres Replication Lag on maps-test2004 is CRITICAL: CRITICAL - Rep Delay is: 32451.655074 Seconds [18:23:40] PROBLEM - Postgres Replication Lag on maps-test2002 is CRITICAL: CRITICAL - Rep Delay is: 32598.817574 Seconds [18:23:59] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [18:30:20] (03CR) 10Filippo Giunchedi: [C: 032] graphite: cleanup eventstreams rdkafka stale data [puppet] - 10https://gerrit.wikimedia.org/r/343609 (https://phabricator.wikimedia.org/T160644) (owner: 10Filippo Giunchedi) [18:31:24] 06Operations, 10Ops-Access-Requests: Requesting access to reseachers, analytics-wmde, analytics-users for GoranSMilovanovic - https://phabricator.wikimedia.org/T160980#3122774 (10MoritzMuehlenhoff) If that takes a while to setup, we can also use an interim address for the account; @GoranSMilovanovic can you pl... [18:34:33] bblack: i've another one of those google webmaster tools ("search console") dns record updates i'd like to do through noc@wikimedia.org for mediawiki.org (like https://gerrit.wikimedia.org/r/#/q/owner:%22Dr0ptp4kt+%253Cabaso%2540wikimedia.org%253E%22+project:operations/dns) [18:35:03] i'd then like the admin access delegated. could we do a hangout or something like that to step through it? [18:35:38] it's for mediawiki.rg [18:35:59] PROBLEM - puppet last run on mw1193 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:36:11] alternatively i could do direct in my abaso@ account, but thought better to centrally do it, then delegate [18:36:43] 06Operations, 10Ops-Access-Requests: Requesting access to reseachers, analytics-wmde, analytics-users for GoranSMilovanovic - https://phabricator.wikimedia.org/T160980#3122778 (10GoranSMilovanovic) @MoritzMuehlenhoff Please use goran.s.milovanovic@gmail.com for emergency cases anytime, and for any related purp... [18:38:39] RECOVERY - Postgres Replication Lag on maps-test2002 is OK: OK - Rep Delay is: 0.0 Seconds [18:38:49] PROBLEM - HHVM rendering on mw2095 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:38:49] PROBLEM - HHVM rendering on mw2142 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:38:49] PROBLEM - HHVM rendering on mw2108 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:38:49] PROBLEM - HHVM rendering on mw2102 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:39:49] RECOVERY - HHVM rendering on mw2142 is OK: HTTP OK: HTTP/1.1 200 OK - 1505 bytes in 8.817 second response time [18:39:49] RECOVERY - HHVM rendering on mw2095 is OK: HTTP OK: HTTP/1.1 200 OK - 1505 bytes in 9.109 second response time [18:39:49] RECOVERY - HHVM rendering on mw2102 is OK: HTTP OK: HTTP/1.1 200 OK - 1505 bytes in 9.389 second response time [18:39:49] RECOVERY - HHVM rendering on mw2108 is OK: HTTP OK: HTTP/1.1 200 OK - 1505 bytes in 9.406 second response time [18:45:59] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] [18:49:59] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.298 second response time [18:53:57] ^ madhuvishy wah wah wah [18:54:59] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.843 second response time [18:58:47] 06Operations, 10ops-eqiad, 13Patch-For-Review, 15User-fgiunchedi: Rack and Setup ms-be1028-ms-1039 - https://phabricator.wikimedia.org/T160640#3122832 (10Cmjohnson) @fgiunchedi All the servers are racked, cabled, for the most part the ILO is setup. On-Site work still needed is last few ILO configs, and... [19:00:04] thcipriani: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170322T1900). [19:00:10] chasemp: gah [19:00:36] 06Operations: Fix the general problem of randomly-bad puppet agent cron timings within redundancy clusters - https://phabricator.wikimedia.org/T161145#3122864 (10BBlack) [19:00:47] well jouncebot...I have some bad news. [19:01:36] (03PS2) 10Jcrespo: mariadb: Pool db1094 with full weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343955 (https://phabricator.wikimedia.org/T160832) [19:03:39] what happens with the train? [19:04:00] blocked? [19:04:13] jynus: yeah, got a handful of blockers at the moment [19:04:18] Yeah what thcipriani said [19:04:26] Couple of blockers, patches in flight for backport right now [19:04:28] https://phabricator.wikimedia.org/T160549#3122522 [19:04:31] ok, I will deploy then a quick db change [19:04:36] :-) [19:04:44] be my guest :) [19:04:59] RECOVERY - puppet last run on mw1193 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [19:06:12] jynus: Oh, quick question I had for you. Right now, gerrit in eqiad uses m2-master.eqiad.wmnet as its dbhost. I'm setting up a read-only slave in codfw. I don't see a m2-slave.* of any sorts, how would I point at db2011? Is there a stable way to refer to the slave? Or should I point codfw at eqiad too? [19:06:29] ok [19:06:35] that is actually a good question [19:06:56] let me deploy this and I will try to organize myself [19:07:00] (03CR) 10Jcrespo: [C: 032] mariadb: Pool db1094 with full weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343955 (https://phabricator.wikimedia.org/T160832) (owner: 10Jcrespo) [19:07:03] Ok, no rush thx [19:07:09] PROBLEM - MariaDB Slave Lag: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:07:28] normally, we would want to point to m2-slave.codfw.wmnet [19:07:49] but there are not misc dns entries yet [19:07:57] so it may be a good excuse to add it [19:08:26] (03Merged) 10jenkins-bot: mariadb: Pool db1094 with full weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343955 (https://phabricator.wikimedia.org/T160832) (owner: 10Jcrespo) [19:08:28] AFAIK there is no master on codfw, either [19:08:33] (dns-wise) [19:08:35] (03CR) 10jenkins-bot: mariadb: Pool db1094 with full weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343955 (https://phabricator.wikimedia.org/T160832) (owner: 10Jcrespo) [19:08:46] there is a db slave, however [19:08:48] jynus: Yeah I checked that too :) [19:09:03] Yeah, I saw there's db2011, but no master or slave dns entry for it yet [19:09:10] And I wanted to be future-proof and not use the host directly :) [19:09:12] so I would create those [19:09:20] in reality [19:09:25] it should point to the proxy [19:09:35] the problem is there is several scenarios [19:09:53] a master failure [19:09:59] and a dc failure [19:11:09] PROBLEM - MariaDB Slave Lag: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:11:41] (03PS1) 10Dzahn: temp copy netmon1001 app data to gerrit2001 for migration [puppet] - 10https://gerrit.wikimedia.org/r/344184 (https://phabricator.wikimedia.org/T125020) [19:12:35] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Pool db1094 with full weight (duration: 00m 43s) [19:12:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:20] (03PS2) 10Dzahn: temp copy netmon1001 app data to gerrit2001 for migration [puppet] - 10https://gerrit.wikimedia.org/r/344184 (https://phabricator.wikimedia.org/T125020) [19:16:52] so let me seee [19:17:09] RECOVERY - Postgres Replication Lag on maps-test2003 is OK: OK - Rep Delay is: 0.0 Seconds [19:17:10] RECOVERY - Postgres Replication Lag on maps-test2004 is OK: OK - Rep Delay is: 0.0 Seconds [19:19:55] RainbowSprinkles, is there a chance the codfw gerrit want to write to the database [19:20:15] and if it would- we would want it to go thourough or to fail? [19:20:28] It shouldn't want to. If it did, that's a bug [19:20:37] So we could let it fail hard [19:21:00] ok, so pointing to the local slave is probably ok [19:21:11] actually [19:21:17] we will point it to the local master [19:21:34] it is just a passive, read only master [19:21:44] that is a slave of the real master [19:22:02] (it makes sense when there are more servers involved, like mediawiki) [19:22:23] gerrit is m1 or m2, cannot remember? [19:22:50] 06Operations: Fix the general problem of randomly-bad puppet agent cron timings within redundant clusters - https://phabricator.wikimedia.org/T161145#3122926 (10BBlack) [19:24:17] m2 according to myself: https://wikitech.wikimedia.org/wiki/MariaDB/misc [19:25:00] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [19:27:28] 06Operations, 10DNS, 10Traffic: AuthDNS CM/CI refactor - https://phabricator.wikimedia.org/T161148#3122942 (10BBlack) [19:27:52] RainbowSprinkles, do you have a ticket # ? [19:28:05] it is for a CR [19:28:09] PROBLEM - MariaDB Slave Lag: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:28:16] Just a general "spin up gerrit failover in codfw" task [19:28:20] yeah [19:28:22] that works [19:28:29] T152525 [19:28:29] T152525: setup/install gerrit2001/WMF6408 - https://phabricator.wikimedia.org/T152525 [19:28:46] (03PS1) 10Jcrespo: Add m2 aliases for db2011- in the future that should be a proxy [dns] - 10https://gerrit.wikimedia.org/r/344187 (https://phabricator.wikimedia.org/T152525) [19:28:49] ^ [19:28:57] * RainbowSprinkles has a look [19:29:15] (03CR) 10Chad: [C: 031] Add m2 aliases for db2011- in the future that should be a proxy [dns] - 10https://gerrit.wikimedia.org/r/344187 (https://phabricator.wikimedia.org/T152525) (owner: 10Jcrespo) [19:29:22] don't need really, just using the new dns [19:29:39] If I point at m2-master there, it'll effectively be read-only. That would make my failover easier if you fail over. [19:29:40] those are read only [19:29:48] Then I'm just m2-master.$dc.wmnet [19:30:01] yes [19:30:04] the other model [19:30:20] is to point to a single dns [19:30:30] and with the proxy, failover at the same time [19:30:36] both have pros and cons [19:30:41] So -slave is more useful if it's like "I want a slave connection even if active DC" [19:30:51] yeah [19:30:51] Vs -master is "Master if I'm active, slave ok if not" [19:30:53] That works [19:30:54] :D [19:30:55] that is more or less the idea [19:31:01] definitely I would use the master [19:31:07] Sounds good! Thanks! [19:31:10] if the active dc master [19:31:17] or the local dc master [19:31:28] it its argable [19:31:32] *arguable [19:31:44] local I think is easier [19:31:53] and that is the model we use for mediawiki [19:32:05] (03CR) 10Dzahn: [C: 032] temp copy netmon1001 app data to gerrit2001 for migration [puppet] - 10https://gerrit.wikimedia.org/r/344184 (https://phabricator.wikimedia.org/T125020) (owner: 10Dzahn) [19:32:09] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=1184.50 Read Requests/Sec=318.10 Write Requests/Sec=0.70 KBytes Read/Sec=34358.40 KBytes_Written/Sec=14.40 [19:36:02] (03CR) 10Jcrespo: [C: 032] Add m2 aliases for db2011- in the future that should be a proxy [dns] - 10https://gerrit.wikimedia.org/r/344187 (https://phabricator.wikimedia.org/T152525) (owner: 10Jcrespo) [19:37:50] !log deploying m2 dns additions on codfw [19:37:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:08] !log rsyncing /srv of netmon1001 to /srv/netmon1001 on gerrit2001 (T125020) [19:44:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:13] T125020: upgrade netmon1001 to jessie - https://phabricator.wikimedia.org/T125020 [19:45:34] !log thcipriani@tin Synchronized php-1.29.0-wmf.17/includes/Revision.php: [[gerrit:344077|Make Revision::getRevisionText() cache the converted text]] (duration: 00m 44s) [19:45:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:31] (03PS1) 10Chad: Gerrit: Set db host (m2-master) for codfw [puppet] - 10https://gerrit.wikimedia.org/r/344192 [19:48:09] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=154.20 Read Requests/Sec=141.30 Write Requests/Sec=57.70 KBytes Read/Sec=2040.80 KBytes_Written/Sec=325.60 [19:48:30] 06Operations, 06Analytics-Kanban, 06Performance-Team, 06Reading-Admin, 10Traffic: Preliminary Design document for A/B testing - https://phabricator.wikimedia.org/T143694#3123030 (10Nuria) 05Open>03Resolved [19:48:34] 06Operations, 10Analytics, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#3123031 (10Nuria) [19:51:19] (03PS2) 10Chad: $wgWhitelistRead: grantwiki is in private.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344078 [19:54:09] PROBLEM - MariaDB Slave Lag: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:54:59] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [50.0] [19:55:44] !log thcipriani@tin Synchronized php-1.29.0-wmf.17/extensions/Flow: [[gerrit:344188|Make sure topiclist queries always join against workflow table]] T121644 (duration: 00m 59s) [19:55:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:50] T121644: Fail to load more topics above current topic due to pseudo-column being queried - https://phabricator.wikimedia.org/T121644 [19:59:26] 06Operations, 10ops-eqiad, 10DBA, 13Patch-For-Review: db1094 crash - https://phabricator.wikimedia.org/T160832#3123061 (10jcrespo) 05Open>03Resolved a:03jcrespo Resolved- we have to contact the vendor if it happens any other time. [19:59:45] (03CR) 10Chad: [C: 032] $wgWhitelistRead: grantwiki is in private.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344078 (owner: 10Chad) [20:00:04] gwicke, cscott, arlolra, subbu, bearND, halfak, and Amir1: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170322T2000). Please do the needful. [20:00:11] no parsoid deploy today [20:00:17] nor for ores [20:01:04] no mobileapps deploy today [20:01:08] (03Merged) 10jenkins-bot: $wgWhitelistRead: grantwiki is in private.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344078 (owner: 10Chad) [20:01:15] (03CR) 10jenkins-bot: $wgWhitelistRead: grantwiki is in private.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344078 (owner: 10Chad) [20:03:50] 06Operations, 10DBA, 10Wikimedia-General-or-Unknown: Spurious completely empty `image` table row on commonswiki - https://phabricator.wikimedia.org/T155769#3123076 (10jcrespo) p:05Triage>03Low [20:04:36] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: Remove redundant whitelist read list for grantswiki (duration: 00m 44s) [20:04:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:42] (03PS3) 10Chad: Move mwdeploy home to /var/lib where it belongs, it's a system user [puppet] - 10https://gerrit.wikimedia.org/r/323867 (https://phabricator.wikimedia.org/T86971) [20:05:21] !log thcipriani@tin Synchronized php-1.29.0-wmf.17/extensions/ZeroPortal/includes/ApiZeroPortal.php: [[gerrit:344190|Failure to parse json config should result in a usable error]] T161036 (duration: 00m 42s) [20:05:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:28] T161036: Warning: ZeroAPI: Unable to parse json of page Zero: - https://phabricator.wikimedia.org/T161036 [20:06:56] alright, looks like all the blockers have some kind of movement for train, let's roll out to group0... [20:07:21] (03PS8) 10Ottomata: Drop mediawiki logs in HDFS after 90 days [puppet] - 10https://gerrit.wikimedia.org/r/335140 (owner: 10EBernhardson) [20:07:53] Rather interesting rendering of [[gerrit:344077{{!}}Make Revision::getRevisionText() cache the converted text]] [20:07:58] https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:59] (03PS1) 10Rush: labtest: convert labtestmetal to labtestvirt2001 [puppet] - 10https://gerrit.wikimedia.org/r/344193 [20:08:03] https://wikitech.wikimedia.org/wiki/Special:FormEdit/Tools%20Access%20Request/Edit/GetRevisionText()_cache_the_converted_text?redlink=1 [20:08:09] How does *that* happen? [20:09:34] (03PS1) 10Rush: labtest: labtestmetal convert labtestvirt2002 [dns] - 10https://gerrit.wikimedia.org/r/344194 [20:10:28] (03PS3) 10Rush: labstore: apply exportd monitoring to secondary role [puppet] - 10https://gerrit.wikimedia.org/r/343624 (https://phabricator.wikimedia.org/T160838) [20:10:30] (03PS1) 10Chad: Mariadb: Move remaining non-module files to the module [puppet] - 10https://gerrit.wikimedia.org/r/344195 [20:10:32] (03CR) 10Rush: [V: 032] labstore: apply exportd monitoring to secondary role [puppet] - 10https://gerrit.wikimedia.org/r/343624 (https://phabricator.wikimedia.org/T160838) (owner: 10Rush) [20:10:41] (03CR) 10Rush: [C: 032] labstore: keep archival copy of dynamic export.d contents [puppet] - 10https://gerrit.wikimedia.org/r/343623 (owner: 10Rush) [20:10:47] (03PS6) 10Rush: labstore: keep archival copy of dynamic export.d contents [puppet] - 10https://gerrit.wikimedia.org/r/343623 [20:10:51] (03CR) 10Rush: [V: 032 C: 032] labstore: keep archival copy of dynamic export.d contents [puppet] - 10https://gerrit.wikimedia.org/r/343623 (owner: 10Rush) [20:11:23] (03CR) 10Andrew Bogott: [C: 031] labtest: labtestmetal convert labtestvirt2002 [dns] - 10https://gerrit.wikimedia.org/r/344194 (owner: 10Rush) [20:11:36] (03CR) 10Ottomata: [C: 032] Drop mediawiki logs in HDFS after 90 days [puppet] - 10https://gerrit.wikimedia.org/r/335140 (owner: 10EBernhardson) [20:11:43] (03PS9) 10Ottomata: Drop mediawiki logs in HDFS after 90 days [puppet] - 10https://gerrit.wikimedia.org/r/335140 (owner: 10EBernhardson) [20:11:47] (03CR) 10Ottomata: [V: 032 C: 032] Drop mediawiki logs in HDFS after 90 days [puppet] - 10https://gerrit.wikimedia.org/r/335140 (owner: 10EBernhardson) [20:11:53] (03PS4) 10Rush: labstore: apply exportd monitoring to secondary role [puppet] - 10https://gerrit.wikimedia.org/r/343624 (https://phabricator.wikimedia.org/T160838) [20:12:00] (03CR) 10Rush: [V: 032] labstore: apply exportd monitoring to secondary role [puppet] - 10https://gerrit.wikimedia.org/r/343624 (https://phabricator.wikimedia.org/T160838) (owner: 10Rush) [20:12:35] (03CR) 10Andrew Bogott: [C: 031] "No idea if the same partman script will work on that box, but it's worth a try!" [puppet] - 10https://gerrit.wikimedia.org/r/344193 (owner: 10Rush) [20:14:54] (03PS2) 10Chad: Mariadb: Move remaining non-module files to the module [puppet] - 10https://gerrit.wikimedia.org/r/344195 [20:16:50] 06Operations, 13Patch-For-Review, 15User-fgiunchedi: upgrade netmon1001 to jessie - https://phabricator.wikimedia.org/T125020#3123116 (10Dzahn) also tested on a seperate jessie labs instance: role rancid::server - no issues, just finishes run role librenms - package snmp-mibs-downloader can't be downloade... [20:18:35] (03PS1) 10Smalyshev: Direct LDF requests to single host to solve paging issues [puppet] - 10https://gerrit.wikimedia.org/r/344197 (https://phabricator.wikimedia.org/T159574) [20:19:38] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 3 others: LDF endpoint ordering is not stable between servers when paging - https://phabricator.wikimedia.org/T159574#3123124 (10Smalyshev) I don't think we need new external endpoint - it looks like our VCL routing allows switching by path (unless I mi... [20:20:18] (03PS1) 10EBernhardson: Fix typo in refinery data-drop for cirrussearchrequestset [puppet] - 10https://gerrit.wikimedia.org/r/344198 [20:20:33] (03CR) 10jerkins-bot: [V: 04-1] Direct LDF requests to single host to solve paging issues [puppet] - 10https://gerrit.wikimedia.org/r/344197 (https://phabricator.wikimedia.org/T159574) (owner: 10Smalyshev) [20:21:12] (03CR) 10Ottomata: [V: 032 C: 032] Fix typo in refinery data-drop for cirrussearchrequestset [puppet] - 10https://gerrit.wikimedia.org/r/344198 (owner: 10EBernhardson) [20:22:22] (03PS2) 10Smalyshev: Direct LDF requests to single host to solve paging issues [puppet] - 10https://gerrit.wikimedia.org/r/344197 (https://phabricator.wikimedia.org/T159574) [20:23:59] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [20:24:11] (03PS3) 10Smalyshev: Direct LDF requests to single host to solve paging issues [puppet] - 10https://gerrit.wikimedia.org/r/344197 (https://phabricator.wikimedia.org/T159574) [20:27:15] (03PS4) 10Smalyshev: Direct LDF requests to single host to solve paging issues [puppet] - 10https://gerrit.wikimedia.org/r/344197 (https://phabricator.wikimedia.org/T159574) [20:29:59] (03CR) 10BBlack: [C: 031] Direct LDF requests to single host to solve paging issues [puppet] - 10https://gerrit.wikimedia.org/r/344197 (https://phabricator.wikimedia.org/T159574) (owner: 10Smalyshev) [20:30:29] (03PS1) 10Rush: Revert "labstore: keep archival copy of dynamic export.d contents" [puppet] - 10https://gerrit.wikimedia.org/r/344202 [20:30:37] (03PS2) 10Rush: Revert "labstore: keep archival copy of dynamic export.d contents" [puppet] - 10https://gerrit.wikimedia.org/r/344202 [20:32:29] (03CR) 10Rush: [C: 032] Revert "labstore: keep archival copy of dynamic export.d contents" [puppet] - 10https://gerrit.wikimedia.org/r/344202 (owner: 10Rush) [20:32:50] (03PS1) 10Rush: Revert "labstore: apply exportd monitoring to secondary role" [puppet] - 10https://gerrit.wikimedia.org/r/344204 [20:33:12] (03PS2) 10Rush: Revert "labstore: apply exportd monitoring to secondary role" [puppet] - 10https://gerrit.wikimedia.org/r/344204 [20:35:01] (03PS1) 10Thcipriani: Revert "Revert "Group0 to 1.29.0-wmf.17"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344206 [20:35:35] (03CR) 10Thcipriani: [C: 032] Revert "Revert "Group0 to 1.29.0-wmf.17"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344206 (owner: 10Thcipriani) [20:35:51] (03CR) 10Rush: [C: 032] Revert "labstore: apply exportd monitoring to secondary role" [puppet] - 10https://gerrit.wikimedia.org/r/344204 (owner: 10Rush) [20:35:56] 06Operations, 06Discovery, 10Elasticsearch, 10Wikimedia-Logstash, and 2 others: Update logstash on wikimedia to 5.x - https://phabricator.wikimedia.org/T154473#3123190 (10EBernhardson) Pulled the production .kibana index into my local vagrant and tested, looks like all the visualizations "just work". Going... [20:36:42] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, 10Wikimedia-Logstash: Import new kibana and logstash .debs to wikimedia experimental repository - https://phabricator.wikimedia.org/T160597#3123191 (10EBernhardson) @gehel could you push these into experimental for me? Will be trying to get... [20:37:32] (03Merged) 10jenkins-bot: Revert "Revert "Group0 to 1.29.0-wmf.17"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344206 (owner: 10Thcipriani) [20:37:41] (03CR) 10jenkins-bot: Revert "Revert "Group0 to 1.29.0-wmf.17"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344206 (owner: 10Thcipriani) [20:38:11] ^ is there something that was going on there? [20:39:44] with what? The revert revert? [20:40:09] more specifically, with what? *my* revert revert? [20:40:57] just resolved some blockers from yesterday reverting my revert now, try it again, such is the lot of train deployment. [20:41:10] 07Puppet, 10Deployment-Systems, 10Scap: Unify co-master sync - https://phabricator.wikimedia.org/T161156#3123212 (10demon) [20:41:20] 07Puppet, 10Deployment-Systems, 10Scap: Unify co-master sync - https://phabricator.wikimedia.org/T161156#3123224 (10demon) p:05Triage>03Low [20:42:47] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [20:43:09] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: Group0 to php-1.29.0-wmf.17 [20:43:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:32] (03CR) 10Chad: [C: 032] Fix labs-specific Dashiki hack with generic enable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336446 (https://phabricator.wikimedia.org/T161038) (owner: 10Milimetric) [20:44:47] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is OK: OK - nfs-exportd is active [20:44:57] (03Merged) 10jenkins-bot: Fix labs-specific Dashiki hack with generic enable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336446 (https://phabricator.wikimedia.org/T161038) (owner: 10Milimetric) [20:45:23] RainbowSprinkles: that change needs this to be deployed too, I think: https://gerrit.wikimedia.org/r/#/c/344007/ [20:45:32] because it removes config that's been moved to extension.json [20:45:47] I used to link those from the commit message itself, but was told not to at some point [20:46:01] Bah [20:46:01] Who told you that? [20:46:11] Depends-On: I12345.... is wonderful [20:46:15] :) [20:46:32] oh, I didn't know that trick, I'll use it in the future [20:46:43] :D [20:47:02] Ok, so the labs bit is ok. If we make a quick follow-up, we can still keep CommonSettings as-is. One sec [20:47:45] (03CR) 10jenkins-bot: Fix labs-specific Dashiki hack with generic enable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336446 (https://phabricator.wikimedia.org/T161038) (owner: 10Milimetric) [20:48:23] (03PS1) 10Chad: Restore Dashiki config in CommonSettings for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344208 [20:49:52] milimetric: How's that follow-up look? [20:50:35] uh, sorry, I just saw the follow-up button for the first time, trying to figure out what it is :) [20:51:11] RainbowSprinkles: you mean I should push another change that puts back the config? [20:51:55] I just did [20:51:58] :) [20:52:01] oh! you mean just +1 it duh [20:52:04] (03CR) 10Milimetric: [C: 031] Restore Dashiki config in CommonSettings for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344208 (owner: 10Chad) [20:52:14] (03CR) 10Chad: [C: 032] Restore Dashiki config in CommonSettings for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344208 (owner: 10Chad) [20:54:47] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [20:54:58] (03Merged) 10jenkins-bot: Restore Dashiki config in CommonSettings for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344208 (owner: 10Chad) [20:55:47] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is OK: OK - nfs-exportd is active [20:56:37] (03PS1) 10Milimetric: Revert "Restore Dashiki config in CommonSettings for now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344209 [20:57:13] RainbowSprinkles: k, I added that with the depends-on for when the dashiki change gets merged [20:57:16] (03CR) 10jenkins-bot: Restore Dashiki config in CommonSettings for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344208 (owner: 10Chad) [20:57:37] 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: EQIAD: stat1002 replacement - https://phabricator.wikimedia.org/T159838#3123246 (10Halfak) To be clear, it's likely that using AMD/opencl could involve "a ton of extra effort", but I don't see a good alternative given what @MoritzMuehlenh... [20:57:38] RainbowSprinkles: btw, can I get added to wmf-deployments, should I ask in Phab? Wiki says to just ask [20:57:44] (the gerrit group) [20:58:04] and ideally someone else on our team so we don't have to self-merge [20:58:22] otto, luca, marcel, nuri-a [20:59:53] milimetric: Phab task [21:00:22] !log demon@tin Synchronized wmf-config/CommonSettings-labs.php: No-op, beta (duration: 00m 43s) [21:00:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:20] 06Operations, 10ops-eqiad, 06Labs, 10Labs-Infrastructure: Labvirt1001 has insanely slow IO - https://phabricator.wikimedia.org/T159835#3123251 (10Andrew) @hashar, it's nothing to do with load. there are no VMs running on labvirt1001 and it still has the problem. [21:03:36] 06Operations, 10Ops-Access-Requests: Add two Analytics team members to wmf-deployments - https://phabricator.wikimedia.org/T161157#3123255 (10Milimetric) [21:05:40] !log demon@tin Synchronized wmf-config/InitialiseSettings-labs.php: No-op, beta (duration: 00m 47s) [21:05:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:03] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [50.0] [21:11:49] thcipriani: twentyafterfour ^ is there a thing going on? [21:12:41] chasemp: these are errors not caused by any deployments, problems connecting to a database. [21:13:14] thcipriani: do jynus or marostegui know? [21:13:19] https://logstash.wikimedia.org/goto/363d1976685626748b370a71b4c41850 [21:13:24] I brought it up in _security [21:13:34] but I don't know [21:13:53] PROBLEM - MD RAID on ocg1001 is CRITICAL: CRITICAL: State: degraded, Active: 5, Working: 5, Failed: 1, Spare: 0 [21:13:54] ACKNOWLEDGEMENT - MD RAID on ocg1001 is CRITICAL: CRITICAL: State: degraded, Active: 5, Working: 5, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T161158 [21:13:57] 06Operations, 10ops-eqiad: Degraded RAID on ocg1001 - https://phabricator.wikimedia.org/T161158#3123279 (10ops-monitoring-bot) [21:14:07] * thcipriani makes task [21:14:45] I don't know what es2016.codfw.wmnet is but that's 10.192.48.41 [21:14:57] considering codfw it's not critical? [21:15:16] 06Operations, 10ops-eqiad, 06Labs, 10Labs-Infrastructure: Labvirt1001 has insanely slow IO - https://phabricator.wikimedia.org/T159835#3123283 (10Andrew) The one symptom I'm fixating on is puppet runs. A puppet run on labvirt1001 takes 811.99. The same run on labvirt1002 (which is actually doing useful t... [21:15:46] it may be some check considering the origin. This was ebernhardson 's theory [21:16:56] the check should still probably work though, might be worth seeing what happened, but probably not worth delaying the train [21:17:42] I mean you run teh risk of real issues getting masked by this since that fatals check is noise [21:18:47] !log rebooting labvirt1001 because it is being terrible. https://phabricator.wikimedia.org/T159835 [21:18:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:03] 06Operations, 10ops-eqiad, 06Labs, 10Labs-Infrastructure: Labvirt1001 has insanely slow IO - https://phabricator.wikimedia.org/T159835#3123314 (10Paladox) what about doing puppet agent -tv --debug --verbose to see what it is taking so long on? [21:21:01] (03PS1) 10Madhuvishy: nfs-exportd: Explicitly check if instance address is valid [puppet] - 10https://gerrit.wikimedia.org/r/344220 [21:21:05] (03PS1) 10Chad: Setup apache vhost on scap proxies as well [puppet] - 10https://gerrit.wikimedia.org/r/344221 [21:21:52] (03PS2) 10Chad: Setup apache vhost on scap proxies as well [puppet] - 10https://gerrit.wikimedia.org/r/344221 [21:24:21] 06Operations, 10ops-eqiad, 06Labs, 10Labs-Infrastructure: Labvirt1001 has insanely slow IO - https://phabricator.wikimedia.org/T159835#3080282 (10yuvipanda) @Paladox what kind of things should I be looking for when running `puppet agent -tv --debug --verbose`? [21:25:47] (03CR) 10jerkins-bot: [V: 04-1] Setup apache vhost on scap proxies as well [puppet] - 10https://gerrit.wikimedia.org/r/344221 (owner: 10Chad) [21:25:51] (03PS2) 10Madhuvishy: nfs-exportd: Explicitly check if instance address is valid [puppet] - 10https://gerrit.wikimedia.org/r/344220 [21:26:03] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [21:26:46] 06Operations, 10ops-eqiad, 06Labs, 10Labs-Infrastructure: Labvirt1001 has insanely slow IO - https://phabricator.wikimedia.org/T159835#3123323 (10Andrew) And post-reboot it's fast again dammit [21:27:36] 06Operations, 10ops-eqiad, 06Labs, 10Labs-Infrastructure: Labvirt1001 has insanely slow IO - https://phabricator.wikimedia.org/T159835#3123325 (10Paladox) @yuvipanda hi, For example running it on gerrit-test3 returns P5113 So maybe it will tell us what bit it gets stuck on the longest. It will go through... [21:27:37] (03PS3) 10Chad: Setup apache vhost on scap proxies as well [puppet] - 10https://gerrit.wikimedia.org/r/344221 [21:28:15] /70/5 [21:28:18] 06Operations, 10ops-eqiad, 06Labs, 10Labs-Infrastructure: Labvirt1001 has insanely slow IO - https://phabricator.wikimedia.org/T159835#3123326 (10yuvipanda) @Paladox thank you. Do you know how to get timing information out of it? [21:29:21] 06Operations, 10ops-eqiad, 06Labs, 10Labs-Infrastructure: Labvirt1001 has insanely slow IO - https://phabricator.wikimedia.org/T159835#3123328 (10Paladox) @yuvipanda puppet agent -tv --debug --verbose --evaltrace https://ask.puppet.com/question/2755/howto-trace-execution-time-of-components-of-agent-run/ [21:30:28] 06Operations, 10ops-eqiad, 06Labs, 10Labs-Infrastructure: Labvirt1001 has insanely slow IO - https://phabricator.wikimedia.org/T159835#3123329 (10Paladox) It will show it like Debug: Finishing transaction 20116140 [21:31:04] 06Operations, 10ops-eqiad, 06Labs, 10Labs-Infrastructure: Labvirt1001 has insanely slow IO - https://phabricator.wikimedia.org/T159835#3123330 (10Paladox) or doing "If you have reports=true in your puppet.conf on the agent, you can see the time spent on each resource type. Reports are stored on the agent i... [21:31:57] 06Operations, 10ops-eqiad, 06Labs, 10Labs-Infrastructure: Labvirt1001 has insanely slow IO - https://phabricator.wikimedia.org/T159835#3123343 (10Paladox) actually the command is puppet agent -tv --debug --verbose --evaltrace -td [21:33:30] (03PS3) 10Madhuvishy: nfs-exportd: Explicitly check if instance address is valid [puppet] - 10https://gerrit.wikimedia.org/r/344220 [21:34:57] PROBLEM - puppet last run on maps1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:35:40] 06Operations, 10Ops-Access-Requests, 10Gerrit: archiva-deploy password for Chad H. - https://phabricator.wikimedia.org/T161067#3123353 (10Dzahn) Of course just sharing the password is easiest, but the perfect solution would be if we add another group to pwstore to handle this long-term. right, @Muehlenhoff ? [21:36:13] (03CR) 10Chad: "Puppet compiler ended how I'd hoped: https://puppet-compiler.wmflabs.org/5868/" [puppet] - 10https://gerrit.wikimedia.org/r/344221 (owner: 10Chad) [21:36:53] (03PS4) 10Madhuvishy: nfs-exportd: Explicitly check if instance address is valid [puppet] - 10https://gerrit.wikimedia.org/r/344220 [21:40:49] RainbowSprinkles there's now a gerrit-support plugin lol [21:40:50] https://github.com/GerritForge/gerrit-support [21:40:56] https://groups.google.com/forum/#!topic/repo-discuss/mdEbjBH0v3A [21:49:28] RainbowSprinkles mutante oh wow you can name your patches in polygerrit [21:49:33] there calling it descriptions [21:50:18] 06Operations, 13Patch-For-Review, 15User-fgiunchedi: upgrade netmon1001 to jessie - https://phabricator.wikimedia.org/T125020#3123375 (10Dzahn) @fgiunchedi i rsynced /srv to gerrit2001. to update next week before shutdown, push: `root@netmon1001:/srv# rsync -avp /srv/ rsync://gerrit2001.wikimedia.org/netmon1... [21:51:13] paladox: eh.. as in "the commit message" ? [21:51:27] Nope, you can name your patches [21:51:31] It's different [21:53:35] (03CR) 10Rush: [C: 031] nfs-exportd: Explicitly check if instance address is valid [puppet] - 10https://gerrit.wikimedia.org/r/344220 (owner: 10Madhuvishy) [21:54:57] PROBLEM - puppet last run on mc1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:55:21] (03CR) 10Madhuvishy: [C: 032] nfs-exportd: Explicitly check if instance address is valid [puppet] - 10https://gerrit.wikimedia.org/r/344220 (owner: 10Madhuvishy) [21:57:09] 06Operations, 10Traffic, 10Wikimedia-General-or-Unknown: Disable caching on the main page for anonymous users - https://phabricator.wikimedia.org/T119366#1824561 (10StevenJ81) I don't know if it's related, but over the last two weeks my own .js scripts (global, local and gadget) only run sporadically, and I... [21:57:40] 06Operations, 06WMF-Legal, 10Wikimedia-General-or-Unknown, 07Documentation, and 2 others: Default license for operations/puppet - https://phabricator.wikimedia.org/T67270#3123407 (10hashar) For other repositories on which we wanted to set/change the license, we usually have done a list of non-wmf contribut... [21:58:28] (03CR) 10Hashar: [C: 031] "Yes Moritz. For other repositories we have done it via the Phabricator task and once some agreement is reached patch get merged." [puppet] - 10https://gerrit.wikimedia.org/r/183862 (https://phabricator.wikimedia.org/T67270) (owner: 10Rush) [22:02:20] (03PS1) 10Rush: Revert "Revert "labstore: apply exportd monitoring to secondary role"" [puppet] - 10https://gerrit.wikimedia.org/r/344241 [22:02:29] (03PS2) 10Rush: Revert "Revert "labstore: apply exportd monitoring to secondary role"" [puppet] - 10https://gerrit.wikimedia.org/r/344241 [22:02:57] (03PS2) 10Dzahn: Gerrit: Set db host (m2-master) for codfw [puppet] - 10https://gerrit.wikimedia.org/r/344192 (owner: 10Chad) [22:03:47] (03CR) 10Dzahn: [C: 032] "yep, has been discussed on IRC earlier. "< jynus> definitely I would use the master" etc." [puppet] - 10https://gerrit.wikimedia.org/r/344192 (owner: 10Chad) [22:04:01] RECOVERY - puppet last run on maps1003 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [22:04:49] (03CR) 10Rush: [C: 032] Revert "Revert "labstore: apply exportd monitoring to secondary role"" [puppet] - 10https://gerrit.wikimedia.org/r/344241 (owner: 10Rush) [22:08:43] (03PS1) 10Rush: Revert "Revert "labstore: keep archival copy of dynamic export.d contents"" [puppet] - 10https://gerrit.wikimedia.org/r/344246 [22:09:04] (03PS2) 10Rush: Revert "Revert "labstore: keep archival copy of dynamic export.d contents"" [puppet] - 10https://gerrit.wikimedia.org/r/344246 [22:10:13] (03CR) 10Rush: [C: 032] Revert "Revert "labstore: keep archival copy of dynamic export.d contents"" [puppet] - 10https://gerrit.wikimedia.org/r/344246 (owner: 10Rush) [22:13:01] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] [22:17:03] (03PS5) 10Madhuvishy: tools: Update maintain-dbusers to create labsdb accounts for tools users [puppet] - 10https://gerrit.wikimedia.org/r/343894 (https://phabricator.wikimedia.org/T158420) [22:22:10] (03CR) 10Madhuvishy: [C: 032] tools: Update maintain-dbusers to create labsdb accounts for tools users [puppet] - 10https://gerrit.wikimedia.org/r/343894 (https://phabricator.wikimedia.org/T158420) (owner: 10Madhuvishy) [22:24:01] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [22:24:21] RECOVERY - puppet last run on mc1006 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [22:31:06] PROBLEM - puppet last run on praseodymium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:37:00] (03CR) 10Chad: [C: 031] "Um why didn't we do this already?" [puppet] - 10https://gerrit.wikimedia.org/r/255958 (owner: 10Reedy) [22:38:05] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] [22:38:25] (03CR) 10Reedy: ""But will it rebase?"" [puppet] - 10https://gerrit.wikimedia.org/r/255958 (owner: 10Reedy) [22:40:54] (03PS3) 10Dzahn: Gerrit: Set db host (m2-master) for codfw [puppet] - 10https://gerrit.wikimedia.org/r/344192 (owner: 10Chad) [22:41:17] RainbowSprinkles: It kinda looks like someone else did it... [22:41:24] Oh, heh [22:41:29] Abandon! [22:41:36] Just checking github [22:42:31] ah, no [22:46:23] (03PS3) 10Reedy: l10nupdate: Reduce code duplication in git clone operations [puppet] - 10https://gerrit.wikimedia.org/r/255958 [22:47:35] better [22:47:50] (03CR) 10Dzahn: [C: 032] Gerrit: Double size of projects cache [puppet] - 10https://gerrit.wikimedia.org/r/344068 (owner: 10Chad) [22:49:24] (03CR) 10Chad: [C: 031] "Manifest changes, but no actual changes: https://puppet-compiler.wmflabs.org/5869/cobalt.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/344053 (owner: 10Chad) [22:50:30] (03CR) 10Chad: [C: 031] l10nupdate: Reduce code duplication in git clone operations [puppet] - 10https://gerrit.wikimedia.org/r/255958 (owner: 10Reedy) [22:51:01] (03PS3) 10Dzahn: Gerrit: Double size of projects cache [puppet] - 10https://gerrit.wikimedia.org/r/344068 (owner: 10Chad) [22:54:40] (03PS2) 10Dzahn: Gerrit: Ensure ops always has admin rights [puppet] - 10https://gerrit.wikimedia.org/r/344069 (owner: 10Chad) [22:56:16] (03CR) 10Dzahn: [C: 031] Mariadb: Move remaining non-module files to the module [puppet] - 10https://gerrit.wikimedia.org/r/344195 (owner: 10Chad) [22:56:25] (03CR) 10Dzahn: [C: 032] Gerrit: Ensure ops always has admin rights [puppet] - 10https://gerrit.wikimedia.org/r/344069 (owner: 10Chad) [22:57:01] (03CR) 10Paladox: Gerrit: Ensure ops always has admin rights (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/344069 (owner: 10Chad) [22:57:51] (03CR) 10Dzahn: [C: 032] Gerrit: Move master/slave detection to profile [puppet] - 10https://gerrit.wikimedia.org/r/344053 (owner: 10Chad) [22:57:56] (03PS4) 10Dzahn: Gerrit: Move master/slave detection to profile [puppet] - 10https://gerrit.wikimedia.org/r/344053 (owner: 10Chad) [22:58:05] RECOVERY - puppet last run on praseodymium is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [22:59:16] !log gerrit: Quick service restart, picking up new config [22:59:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170322T2300). Please do the needful. [23:02:45] nothing to swat so [23:26:05] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [23:35:50] (03CR) 10Dzahn: "@subbu let me know once this seems ok to do" [puppet] - 10https://gerrit.wikimedia.org/r/343948 (owner: 10Dzahn) [23:37:29] (03PS2) 10Dzahn: Monthly Phabricator stats email: Fix output for zero open UBN! tasks [puppet] - 10https://gerrit.wikimedia.org/r/343766 (https://phabricator.wikimedia.org/T159314) (owner: 10Aklapper) [23:38:49] (03CR) 10Dzahn: [C: 032] Monthly Phabricator stats email: Fix output for zero open UBN! tasks [puppet] - 10https://gerrit.wikimedia.org/r/343766 (https://phabricator.wikimedia.org/T159314) (owner: 10Aklapper) [23:44:05] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] [23:48:50] 06Operations, 10Analytics, 10ChangeProp, 10Citoid, and 11 others: Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#3123813 (10Pchelolo) Vagrant was updated to node 6 as well. [23:56:15] PROBLEM - puppet last run on labvirt1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues