[00:00:17] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [00:04:36] !log tstarling@deploy1001 Synchronized wmf-config/set-time-limit.php: T97192 (duration: 00m 52s) [00:04:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:04:42] T97192: HHVM request timeouts not working; support lowering the API request timeout per request - https://phabricator.wikimedia.org/T97192 [00:05:11] (03CR) 10jenkins-bot: Set PHP time limit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458623 (https://phabricator.wikimedia.org/T97192) (owner: 10Tim Starling) [00:07:18] !log tstarling@deploy1001 Synchronized wmf-config/PhpAutoPrepend.php: T97192 (duration: 00m 49s) [00:07:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:07:28] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 22 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [00:10:57] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 20 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [00:12:04] !log tstarling@deploy1001 Synchronized w/infinite-loop.php: Testing for T97192 (duration: 00m 48s) [00:12:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:10] T97192: HHVM request timeouts not working; support lowering the API request timeout per request - https://phabricator.wikimedia.org/T97192 [00:15:57] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 17 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [00:17:38] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [00:24:58] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [00:30:07] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [00:32:28] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 20 probes of 317 (alerts on 19) - https://atlas.ripe.net/measurements/11645088/#!map [00:37:37] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 18 probes of 317 (alerts on 19) - https://atlas.ripe.net/measurements/11645088/#!map [00:41:04] (03PS1) 10Tim Starling: Fix job runner timeout, use the new timeouts in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459411 [00:42:17] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [00:42:38] (03PS2) 10Tim Starling: Fix job runner timeout, use the new timeouts in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459411 [00:42:54] (03CR) 10Tim Starling: [C: 032] Fix job runner timeout, use the new timeouts in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459411 (owner: 10Tim Starling) [00:44:48] (03Merged) 10jenkins-bot: Fix job runner timeout, use the new timeouts in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459411 (owner: 10Tim Starling) [00:46:33] !log tstarling@deploy1001 Synchronized wmf-config/set-time-limit.php: (no justification provided) (duration: 00m 49s) [00:46:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:47:18] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [00:51:06] (03CR) 10jenkins-bot: Fix job runner timeout, use the new timeouts in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459411 (owner: 10Tim Starling) [00:51:58] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 20 probes of 317 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [00:56:58] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 16 probes of 317 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [00:59:38] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [01:04:38] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [01:11:57] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [01:16:58] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [01:24:17] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [01:34:27] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [01:39:07] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 20 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [01:41:47] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [01:44:07] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 18 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [01:51:59] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [01:56:27] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 20 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [02:01:28] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 18 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [02:16:37] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [02:21:37] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [02:38:57] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [02:39:17] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [02:40:08] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.20) (duration: 13m 48s) [02:40:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:43:57] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [02:48:05] 08Warning Alert for device cr1-eqsin.wikimedia.org - Traffic on tunnel link [02:51:00] !log l10nupdate@deploy1001 ResourceLoader cache refresh completed at Mon Sep 10 02:51:00 UTC 2018 (duration 10m 52s) [02:51:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:54:05] 08̶W̶a̶r̶n̶i̶n̶g Device cr1-eqsin.wikimedia.org recovered from Traffic on tunnel link [02:56:18] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [03:01:18] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [03:08:38] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [03:17:27] PROBLEM - puppet last run on lvs5003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:18:48] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [03:20:28] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [03:26:07] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [03:26:32] (03CR) 10Krinkle: Fix job runner timeout, use the new timeouts in labs (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459411 (owner: 10Tim Starling) [03:29:58] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 839.52 seconds [03:31:08] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [03:42:57] RECOVERY - puppet last run on lvs5003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:50:57] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [03:51:38] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 248.38 seconds [03:56:51] (03PS3) 10Mathew.onipe: elasticsearch shard size check * Checks shard size and sends alert if more than 30gb. [puppet] - 10https://gerrit.wikimedia.org/r/458891 (https://phabricator.wikimedia.org/T203546) [03:57:44] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch shard size check * Checks shard size and sends alert if more than 30gb. [puppet] - 10https://gerrit.wikimedia.org/r/458891 (https://phabricator.wikimedia.org/T203546) (owner: 10Mathew.onipe) [03:58:47] (03CR) 10Mathew.onipe: elasticsearch shard size check * Checks shard size and sends alert if more than 30gb. (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/458891 (https://phabricator.wikimedia.org/T203546) (owner: 10Mathew.onipe) [04:00:58] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [04:01:15] (03CR) 10Mathew.onipe: elasticsearch shard size check * Checks shard size and sends alert if more than 30gb. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/458891 (https://phabricator.wikimedia.org/T203546) (owner: 10Mathew.onipe) [04:08:18] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [04:13:27] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [04:36:48] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [04:41:48] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [04:48:07] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [04:51:18] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [04:54:08] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [04:55:07] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [04:57:17] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [04:59:17] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [05:06:18] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [05:10:48] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [05:16:28] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [05:17:27] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [05:28:48] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [05:33:48] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [05:39:17] (03Abandoned) 10KartikMistry: WIP: Beta: Use Restbase provided public API instead of CXServer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431526 (https://phabricator.wikimedia.org/T163203) (owner: 10KartikMistry) [05:41:07] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [05:46:07] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [05:49:39] ACKNOWLEDGEMENT - Device not healthy -SMART- on db2053 is CRITICAL: cluster=mysql device=cciss,11 instance=db2053:9100 job=node site=codfw Marostegui T203623 - The acknowledgement expires at: 2018-09-11 05:49:08. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2053&var-datasource=codfw%2520prometheus%252Fops [05:58:18] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [06:03:18] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [06:07:16] (03CR) 10Urbanecm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459383 (https://phabricator.wikimedia.org/T203909) (owner: 10Urbanecm) [06:11:58] PROBLEM - Host re0.cr1-eqsin is DOWN: PING CRITICAL - Packet loss = 100% [06:11:58] PROBLEM - Host cp5006.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [06:13:01] (03PS2) 10Urbanecm: New throttle rule for Czech school [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459383 (https://phabricator.wikimedia.org/T203909) [06:15:37] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [06:16:28] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 21 probes of 317 (alerts on 19) - https://atlas.ripe.net/measurements/11645088/#!map [06:17:08] RECOVERY - Host cp5006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 244.90 ms [06:17:08] RECOVERY - Host re0.cr1-eqsin is UP: PING OK - Packet loss = 0%, RTA = 240.44 ms [06:20:37] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [06:21:37] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 17 probes of 317 (alerts on 19) - https://atlas.ripe.net/measurements/11645088/#!map [06:32:47] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [06:37:48] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [06:42:33] Hello [06:50:16] There is an UBN throttle rule to deploy for a large scale educational project starting this morning. https://phabricator.wikimedia.org/T203909 [06:56:20] <_joe_> I don't think it's a correct classification [06:56:34] <_joe_> but I can deploy such a throttle rule if needed [06:57:11] Yes, high is more appropriate. [06:57:14] <_joe_> UBN! is reserved for things that are broken and need to be fixed with maximum priority. This is at best a config change requested very very late :) [06:57:39] <_joe_> Dereckson: are you working on the patch or should I? [06:57:46] as you wish [06:58:01] there is a change ready here by Urbanecm: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/459383/ [06:58:03] <_joe_> oh I see Urbanecm already created one [06:58:17] <_joe_> yes, I was reading the ticket backwards now [06:58:45] You deploy it or should I? [06:59:10] <_joe_> I'm deploying it, don't worry [06:59:14] Thanks [06:59:24] (03CR) 10Giuseppe Lavagetto: [C: 032] New throttle rule for Czech school [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459383 (https://phabricator.wikimedia.org/T203909) (owner: 10Urbanecm) [06:59:57] PROBLEM - Check systemd state on analytics1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:00:01] <_joe_> btw, this is the kind of thing that should /not/ require a code deploy [07:00:21] (I am playing with an1003) [07:00:31] Indeed, but how do you want to prioritize a proper throtlle extension? [07:00:40] (03Merged) 10jenkins-bot: New throttle rule for Czech school [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459383 (https://phabricator.wikimedia.org/T203909) (owner: 10Urbanecm) [07:00:42] <_joe_> Dereckson: heh [07:02:15] There is a Phabricator task with 10 years history of people developping or intending to develop this extension, so rules could be added on meta, but it never has been a planned thing for a team. [07:02:38] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [07:03:04] and "configuration" wasn't really treated seriously as a product per se [07:03:23] more something between releng, and features producing teams [07:04:00] so on the next round of ideas for planning, we should raise the concerns about such long-term tasks [07:04:14] !log oblivian@deploy1001 Synchronized wmf-config/throttle.php: Deploy throttle rule for Czech School T203909 (duration: 00m 51s) [07:04:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:20] T203909: Allow IP for creating account for school project for 30 days - https://phabricator.wikimedia.org/T203909 [07:05:13] Thanks for the deploy. [07:05:22] <_joe_> Dereckson: counting about 1 hour of work per throttle rule, we should calculate the amount of man-hours spent over the years for this [07:05:25] <_joe_> :P [07:05:39] <_joe_> that might convince someone it's a good use of engineering time [07:05:42] <_joe_> or maybe not [07:07:38] _joe_, well, raise T27000's priority :) [07:07:38] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [07:07:38] T27000: Deploy ThrottleOverride extension to Wikimedia wikis - https://phabricator.wikimedia.org/T27000 [07:08:47] BTW, thank you for your deploy _joe_ :) [07:09:20] <_joe_> Urbanecm: it literally took me 5 minutes [07:09:22] <_joe_> :) [07:11:46] Well, still a reason to thank you :) [07:15:56] (03PS2) 10Muehlenhoff: Add Cumin aliases for ATS [puppet] - 10https://gerrit.wikimedia.org/r/458800 [07:18:08] RECOVERY - pdfrender on scb2004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.075 second response time [07:18:18] !log restarted pdfrender on scb2004 - T174916 [07:18:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:23] T174916: electron/pdfrender hangs - https://phabricator.wikimedia.org/T174916 [07:18:36] (03CR) 10Muehlenhoff: [C: 032] Add Cumin aliases for ATS [puppet] - 10https://gerrit.wikimedia.org/r/458800 (owner: 10Muehlenhoff) [07:20:08] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [07:20:58] PROBLEM - Disk space on elastic1017 is CRITICAL: DISK CRITICAL - free space: /srv 51087 MB (10% inode=99%) [07:25:08] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [07:25:58] 10Operations, 10ops-codfw: wtp2020 correctable memory errors - https://phabricator.wikimedia.org/T194176 (10MoritzMuehlenhoff) [07:26:00] 10Operations: wtp2020 - Memory correctable errors -EDAC- - https://phabricator.wikimedia.org/T203265 (10MoritzMuehlenhoff) [07:26:22] 10Operations, 10ops-codfw: wtp2020 correctable memory errors - https://phabricator.wikimedia.org/T194176 (10MoritzMuehlenhoff) a:03Papaul [07:31:58] (03PS1) 10Elukey: Fix analytics Nagios plugin adding proper return codes [puppet] - 10https://gerrit.wikimedia.org/r/459454 (https://phabricator.wikimedia.org/T172532) [07:32:28] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [07:34:07] (03CR) 10Elukey: [C: 032] Fix analytics Nagios plugin adding proper return codes [puppet] - 10https://gerrit.wikimedia.org/r/459454 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [07:40:57] (03CR) 10jenkins-bot: New throttle rule for Czech school [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459383 (https://phabricator.wikimedia.org/T203909) (owner: 10Urbanecm) [07:42:27] RECOVERY - Check systemd state on analytics1003 is OK: OK - running: The system is fully operational [07:42:38] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [07:46:58] !log installing ghostscript security updates [07:47:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:58] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [08:00:07] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [08:01:39] (03CR) 10Hashar: [C: 031] "8-)" [puppet] - 10https://gerrit.wikimedia.org/r/458751 (owner: 10Giuseppe Lavagetto) [08:03:48] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 20 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [08:04:58] !log Drop unused root grants from core servers [08:05:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:38] RECOVERY - Disk space on elastic1017 is OK: DISK OK [08:08:57] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 19 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [08:14:26] (03PS1) 10Gilles: analytics-privatedata-users access for Dario Rossi [puppet] - 10https://gerrit.wikimedia.org/r/459500 (https://phabricator.wikimedia.org/T201196) [08:14:28] (03PS1) 10Gilles: analytics-privatedata-users access for Flavia Salutari [puppet] - 10https://gerrit.wikimedia.org/r/459501 (https://phabricator.wikimedia.org/T201199) [08:15:28] PROBLEM - Filesystem available is greater than filesystem size on ms-be1041 is CRITICAL: cluster=swift device=/dev/sdk1 fstype=xfs instance=ms-be1041:9100 job=node mountpoint=/srv/swift-storage/sdk1 site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be1041&var-datasource=eqiad%2520prometheus%252Fops [08:18:02] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: analytics-privatedata-users access for Dario Rossi (username drossi) - https://phabricator.wikimedia.org/T201196 (10elukey) @Rossi.dario.g Hi! When you have a moment please read https://wikitech.wikimedia.org/wiki/Analytics/Data_access#User_responsibil... [08:18:23] 10Operations, 10Puppet: Debian package or files managed my puppet for pt-kill-wmf - https://phabricator.wikimedia.org/T203674 (10jcrespo) [08:18:42] (03CR) 10Gehel: "This is looking good! Thanks! Very minor (and hopefully last) comment on the script itself." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/458891 (https://phabricator.wikimedia.org/T203546) (owner: 10Mathew.onipe) [08:19:48] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [08:22:04] !log Drop users metric and wikilytics from core databases [08:22:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:17] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: analytics-privatedata-users access for Flavia Salutari - https://phabricator.wikimedia.org/T201199 (10elukey) Hi Flavia! When you have a moment please read https://wikitech.wikimedia.org/wiki/Analytics/Data_access#User_responsibilities [08:24:48] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [08:26:08] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 20 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [08:26:32] (03CR) 10Muehlenhoff: analytics-privatedata-users access for Flavia Salutari (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/459501 (https://phabricator.wikimedia.org/T201199) (owner: 10Gilles) [08:28:08] (03CR) 10Muehlenhoff: analytics-privatedata-users access for Dario Rossi (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/459500 (https://phabricator.wikimedia.org/T201196) (owner: 10Gilles) [08:31:17] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 19 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [08:31:36] (03PS2) 10Gilles: analytics-privatedata-users access for Dario Rossi [puppet] - 10https://gerrit.wikimedia.org/r/459500 (https://phabricator.wikimedia.org/T201196) [08:32:02] (03CR) 10Gilles: analytics-privatedata-users access for Dario Rossi (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/459500 (https://phabricator.wikimedia.org/T201196) (owner: 10Gilles) [08:32:07] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [08:35:31] (03PS2) 10Ema: cache_text: remove lab{,test}spice [puppet] - 10https://gerrit.wikimedia.org/r/458769 [08:36:33] (03CR) 10Ema: [C: 032] cache_text: remove lab{,test}spice [puppet] - 10https://gerrit.wikimedia.org/r/458769 (owner: 10Ema) [08:37:13] (03PS2) 10Gilles: analytics-privatedata-users access for Flavia Salutari [puppet] - 10https://gerrit.wikimedia.org/r/459501 (https://phabricator.wikimedia.org/T201199) [08:37:42] (03CR) 10Gilles: analytics-privatedata-users access for Flavia Salutari (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/459501 (https://phabricator.wikimedia.org/T201199) (owner: 10Gilles) [08:41:45] (03PS1) 10Elukey: Allow analytics-admins to use journalctl to inspect logs [puppet] - 10https://gerrit.wikimedia.org/r/459508 (https://phabricator.wikimedia.org/T172532) [08:42:18] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [08:44:22] (03CR) 10Muehlenhoff: [C: 031] "Looks good, no objections have been raised on the Phab task, good to merge." [puppet] - 10https://gerrit.wikimedia.org/r/459501 (https://phabricator.wikimedia.org/T201199) (owner: 10Gilles) [08:44:26] (03CR) 10Muehlenhoff: [C: 032] analytics-privatedata-users access for Flavia Salutari [puppet] - 10https://gerrit.wikimedia.org/r/459501 (https://phabricator.wikimedia.org/T201199) (owner: 10Gilles) [08:45:06] (03PS3) 10Muehlenhoff: analytics-privatedata-users access for Dario Rossi [puppet] - 10https://gerrit.wikimedia.org/r/459500 (https://phabricator.wikimedia.org/T201196) (owner: 10Gilles) [08:45:30] (03CR) 10Muehlenhoff: [C: 031] "Looks good, no objections have been raised, good to merge." [puppet] - 10https://gerrit.wikimedia.org/r/459500 (https://phabricator.wikimedia.org/T201196) (owner: 10Gilles) [08:46:33] (03PS3) 10Gilles: analytics-privatedata-users access for Flavia Salutari [puppet] - 10https://gerrit.wikimedia.org/r/459501 (https://phabricator.wikimedia.org/T201199) [08:46:44] (03CR) 10Muehlenhoff: [C: 032] analytics-privatedata-users access for Dario Rossi [puppet] - 10https://gerrit.wikimedia.org/r/459500 (https://phabricator.wikimedia.org/T201196) (owner: 10Gilles) [08:48:16] FYI: SREs will soon be doing a "live" test of the switchover process. It should without any repercussions but still keep it in mind. [08:48:47] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 20 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [08:53:48] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 19 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [09:06:19] 10Operations, 10SRE-Access-Requests: NDA access for Telecom Paristech Research Team - https://phabricator.wikimedia.org/T200800 (10MoritzMuehlenhoff) [09:06:25] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: analytics-privatedata-users access for Flavia Salutari - https://phabricator.wikimedia.org/T201199 (10MoritzMuehlenhoff) 05Open>03Resolved a:03MoritzMuehlenhoff [09:06:55] 10Operations, 10SRE-Access-Requests: NDA access for Telecom Paristech Research Team - https://phabricator.wikimedia.org/T200800 (10MoritzMuehlenhoff) [09:07:02] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: analytics-privatedata-users access for Dario Rossi (username drossi) - https://phabricator.wikimedia.org/T201196 (10MoritzMuehlenhoff) 05Open>03Resolved a:05Rossi.dario.g>03MoritzMuehlenhoff [09:08:59] (03PS1) 10Volans: sre.switchdc.mediawiki: ask for confirmation [cookbooks] - 10https://gerrit.wikimedia.org/r/459511 (https://phabricator.wikimedia.org/T199079) [09:09:37] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: rack/setup/install analytics-master100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T201939 (10elukey) @RobH we thought to schedule the maintenance window to swap analytics100[1,2] with analytics-master100[1,2] for Sept 22nd, and I'd like to sen... [09:10:52] (03CR) 10Alexandros Kosiaris: [C: 031] sre.switchdc.mediawiki: ask for confirmation [cookbooks] - 10https://gerrit.wikimedia.org/r/459511 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [09:10:52] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [09:11:08] (03CR) 10Volans: [C: 032] sre.switchdc.mediawiki: ask for confirmation [cookbooks] - 10https://gerrit.wikimedia.org/r/459511 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [09:11:49] <_joe_> did you guys just wipe out the caches in eqiad? :P [09:11:52] (03Merged) 10jenkins-bot: sre.switchdc.mediawiki: ask for confirmation [cookbooks] - 10https://gerrit.wikimedia.org/r/459511 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [09:12:16] _joe_: lol, ofc not, we just want to make sure we don't do that :-P [09:15:52] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [09:18:03] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 20 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [09:22:14] 10Operations, 10Operations-Software-Development, 10User-Joe, 10User-jijiki: Convert automation scripts to spicerack cookbooks - https://phabricator.wikimedia.org/T203943 (10Joe) [09:27:00] 10Operations, 10Operations-Software-Development, 10User-Joe, 10User-jijiki: Create a spicerack cookbook for restoring an etcd cluster from backups - https://phabricator.wikimedia.org/T203944 (10Joe) p:05Triage>03Normal [09:28:13] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [09:28:13] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 18 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [09:30:32] !log starting execution of "cookbook sre.switchdc.mediawiki --live-test codfw eqiad" - T199073 [09:30:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:39] T199073: Perform a datacenter switchover (2018-19 Q1) - https://phabricator.wikimedia.org/T199073 [09:31:06] expect SAL messages from the cookbooks, those are real, but we're "migrating" from codfw to eqiad, so mostly noop [09:31:43] !log START - Cookbook sre.switchdc.mediawiki.00-disable-puppet (volans@sarin) [09:31:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:50] "Those are real" is nice [09:31:51] :D [09:32:05] :) [09:32:10] !log END (PASS) - Cookbook sre.switchdc.mediawiki.00-disable-puppet (exit_code=0) (volans@sarin) [09:32:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:22] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [09:35:33] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 20 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [09:36:19] !log START - Cookbook sre.switchdc.mediawiki.00-reduce-ttl (volans@sarin) [09:36:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:25] !log END (PASS) - Cookbook sre.switchdc.mediawiki.00-reduce-ttl (exit_code=0) (volans@sarin) [09:36:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:22] 10Operations, 10ops-eqiad, 10DC-Ops: Replace wtp1043's sda - https://phabricator.wikimedia.org/T196886 (10faidon) 05Resolved>03Open We're still getting RAID alerts about this host. [09:40:43] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 22 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [09:44:24] (03PS1) 10Volans: sre.switchdc.mediawiki: remove memcache wipe [cookbooks] - 10https://gerrit.wikimedia.org/r/459517 (https://phabricator.wikimedia.org/T199079) [09:45:38] (03Abandoned) 10Volans: sre.switchdc.mediawiki: remove memcache wipe [cookbooks] - 10https://gerrit.wikimedia.org/r/459517 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [09:45:43] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 19 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [09:45:52] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [09:47:19] (03PS2) 10Volans: sre.switchdc.mediawiki: Do not restart memcached before warming up [cookbooks] - 10https://gerrit.wikimedia.org/r/458793 (owner: 10Giuseppe Lavagetto) [09:47:33] I've rebased it due to a conflict _joe_ ^^^ [09:48:09] (03CR) 10Volans: [C: 032] sre.switchdc.mediawiki: Do not restart memcached before warming up [cookbooks] - 10https://gerrit.wikimedia.org/r/458793 (owner: 10Giuseppe Lavagetto) [09:48:37] 10Operations, 10Goal, 10Patch-For-Review: Perform a datacenter switchover (2018-19 Q1) - https://phabricator.wikimedia.org/T199073 (10akosiaris) [09:48:39] 10Operations, 10Puppet, 10DBA: Remove all usages of $::mw_primary on puppet - https://phabricator.wikimedia.org/T199124 (10akosiaris) 05Open>03Resolved $::mw_primary is removed from puppet now. Resolving this. [09:48:46] (03Merged) 10jenkins-bot: sre.switchdc.mediawiki: Do not restart memcached before warming up [cookbooks] - 10https://gerrit.wikimedia.org/r/458793 (owner: 10Giuseppe Lavagetto) [09:50:37] akosiaris: 'git grep' disagrees with you :) [09:51:02] ?? [09:51:16] about ? $app_routes or $aqs_site ? [09:51:22] mw_primary [09:51:30] 10Operations, 10Puppet, 10DBA: Remove all usages of $::mw_primary on puppet - https://phabricator.wikimedia.org/T199124 (10jcrespo) Done at https://gerrit.wikimedia.org/r/457491 [09:51:39] paravoid: it's $::mw_primary [09:51:47] not $mw_primary [09:51:56] which is what git grep returns results for [09:52:06] well ok sure, but this approach is effectively the same as the old one isn't it? [09:52:14] we wanted to ditch the top scope variable, not the per module ones [09:52:24] $mw_primary is a local var [09:52:34] the top one is gotten from etcd [09:52:34] i.e. a puppet run is needed for this check to react to a MW primary change [09:52:59] sure, but puppet is only used for monitoring changes [09:53:03] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 20 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [09:53:11] !log START - Cookbook sre.switchdc.mediawiki.00-wipe-and-warmup-caches (volans@sarin) [09:53:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:20] in particular, for dbs is only used to make critical only the active dc [09:53:32] codfw caches, don't worry :) [09:53:40] paravoid: a better solution will be wanted for mwmaint [09:53:58] but in general, they are not on the critical path [09:54:07] (03CR) 10ArielGlenn: [C: 031] "everything looks good." [puppet] - 10https://gerrit.wikimedia.org/r/458877 (https://phabricator.wikimedia.org/T202708) (owner: 10Dzahn) [09:54:22] yeah, that' [09:54:30] that's exactly what we said when we introduced mw_primary :P [09:55:02] well, for dbs it is a non-issue, and will disappear anyway once we are active active [09:55:03] I don't particularly disagree with the change i.e. making this a better-scoped variable [09:55:03] OR [09:55:03] (03CR) 10Gehel: [C: 04-1] elasticsearch shard size check * Checks shard size and sends alert if more than 30gb. (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/458891 (https://phabricator.wikimedia.org/T203546) (owner: 10Mathew.onipe) [09:55:11] we can set all dbs as critical [09:55:12] but it's not really what I expected this task to be about [09:55:17] note btw there is an improvement. The puppet run is still required, but not a puppet-merge [09:55:50] I had always thought of this change to be about not managing this kind of state within puppet [09:56:11] paravoid: I still have my patches to query etcd directly from icinga [09:56:16] maybe I had misunderstood the scope of this task [09:56:17] but icinga is a blocker [09:56:30] but even so, then maybe we need a task to that effect :) [09:56:33] because it doesn't allow dynamic state [09:56:42] PROBLEM - HHVM jobrunner on mw2267 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.073 second response time [09:56:46] so we would need an icinga replacement for that [09:57:14] (that was the main blocker of that issues) [09:57:35] I don't understand why an icinga replacement is needed, do you have a pointer? [09:57:52] RECOVERY - HHVM jobrunner on mw2267 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.074 second response time [09:58:12] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 19 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [09:58:13] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 22 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [09:59:17] what's up with the ripe atlas probes ? [09:59:18] paravoid: read my patches where I comment the issue [09:59:22] PROBLEM - Nginx local proxy to apache on mw2289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:59:23] PROBLEM - HHVM rendering on mw2220 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:59:32] paravoid: one sec so I can give you a link [10:00:22] RECOVERY - Nginx local proxy to apache on mw2289 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.194 second response time [10:00:23] RECOVERY - HHVM rendering on mw2220 is OK: HTTP OK: HTTP/1.1 200 OK - 80155 bytes in 0.291 second response time [10:01:10] those alarms I guess was the warmup script [10:01:25] paravoid: https://gerrit.wikimedia.org/r/345346 [10:01:55] icinga cannot change its internal state without puppet (or a full config reload, which is the same) [10:02:36] !log END (PASS) - Cookbook sre.switchdc.mediawiki.00-wipe-and-warmup-caches (exit_code=0) (volans@sarin) [10:02:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:23] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [10:04:32] <_joe_> we should change the way we manage the icinga config, more precisely. I didn't want to fiddle with contact groups with such a small time before the switch, but I think something can be done by creating special contact group files that we can then manage via confd directly, bypassing puppet [10:05:13] <_joe_> similarly, there is something that can be done for the cronjobs by working on a wrapper for the scripts, but that requires much more time and verification too [10:06:12] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [10:08:23] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [10:09:25] <_joe_> jynus: so my idea was to create new contact groups called "db-core-$::site", and make it paging or not paging depending on the state in etcd [10:09:53] <_joe_> that could work, but I just had the idea over the weekend [10:09:53] _joe_: using a contact command that is not echo | mail I guess [10:10:11] but a small wrapper that check etcd [10:10:16] (03CR) 10Muehlenhoff: Allow analytics-admins to use journalctl to inspect logs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/459508 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [10:10:16] <_joe_> volans: we would have two configs: one for paging, one for sending an email [10:10:23] <_joe_> and swapt the two [10:10:36] <_joe_> no, I want to avoid to depend strictly on etcd to send out pages [10:10:42] two full configs? [10:10:54] (03CR) 10Elukey: "Yep! I'll add it to today's agenda!" [puppet] - 10https://gerrit.wikimedia.org/r/459508 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [10:10:55] <_joe_> no, just of the contact groups [10:11:04] <_joe_> we can talk about this later [10:11:12] ah ok, but still need to reload icinga, sure later [10:11:14] <_joe_> maybe we can try before the switchback :) [10:11:17] <_joe_> yes [10:12:28] !log START - Cookbook sre.switchdc.mediawiki.01-stop-maintenance (volans@sarin) [10:12:29] !log END (FAIL) - Cookbook sre.switchdc.mediawiki.01-stop-maintenance (exit_code=99) (volans@sarin) [10:12:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:52] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 20 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [10:18:44] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10User-Addshore: Requesting Access to view EventLogging data for gabriel-wmde / gbirke - https://phabricator.wikimedia.org/T202072 (10ArielGlenn) Pinging @gabriel-wmde, this is just waiting on your input. [10:21:35] (03PS1) 10Ema: ATS request routing: fix yarn, remove scolarships [puppet] - 10https://gerrit.wikimedia.org/r/459520 (https://phabricator.wikimedia.org/T199720) [10:22:53] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 19 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [10:23:03] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 22 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [10:25:18] (03CR) 10Muehlenhoff: ATS request routing: fix yarn, remove scolarships (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/459520 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema) [10:25:20] !log START - Cookbook sre.switchdc.mediawiki.02-set-readonly (volans@sarin) [10:25:20] !log [DRY-RUN] MediaWiki read-only period starts at: 2018-09-10 10:25:20.558408 (volans@sarin) [10:25:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:32] !log END (PASS) - Cookbook sre.switchdc.mediawiki.02-set-readonly (exit_code=0) (volans@sarin) [10:25:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:30] !log uploaded linux-meta 1.20+deb9u2 for apt.wikimedia.org/stretch-wikimedia [10:27:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:13] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [10:28:47] 10Operations: Trying to install updated versions of "linux-meta linux-meta-4.9" fails - https://phabricator.wikimedia.org/T203851 (10MoritzMuehlenhoff) 05Open>03Resolved a:03MoritzMuehlenhoff This is fixed in 1.20+deb9u2 which only builds the "linux-meta-4.14" package for stretch, "linux-meta-4.9" isn't re... [10:28:54] !log START - Cookbook sre.switchdc.mediawiki.03-set-db-readonly (volans@sarin) [10:28:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:20] !log END (PASS) - Cookbook sre.switchdc.mediawiki.03-set-db-readonly (exit_code=0) (volans@sarin) [10:29:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:04] jan_drewniak: How many deployers does it take to do Wikimedia Portals Update deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180910T1030). [10:30:40] (03PS40) 10Gehel: Convert elasticsearch to systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/440498 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [10:30:42] (03PS5) 10Gehel: elasticsearch: disable the default elasticsearch unit [puppet] - 10https://gerrit.wikimedia.org/r/458464 (https://phabricator.wikimedia.org/T198351) [10:31:00] (03CR) 10jerkins-bot: [V: 04-1] Convert elasticsearch to systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/440498 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [10:31:09] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: disable the default elasticsearch unit [puppet] - 10https://gerrit.wikimedia.org/r/458464 (https://phabricator.wikimedia.org/T198351) (owner: 10Gehel) [10:31:18] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459523 (https://phabricator.wikimedia.org/T128546) [10:31:58] (03PS1) 10ArielGlenn: fix typo in dump dirs base shell script [puppet] - 10https://gerrit.wikimedia.org/r/459524 [10:33:12] (03CR) 10ArielGlenn: [C: 032] fix typo in dump dirs base shell script [puppet] - 10https://gerrit.wikimedia.org/r/459524 (owner: 10ArielGlenn) [10:35:23] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 21 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [10:35:32] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 23 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [10:36:13] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 20 probes of 317 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [10:37:05] (03CR) 10Jdrewniak: [C: 032] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459523 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:37:17] (03CR) 10Hoo man: Create wikidata ntriples dump from ttl dump (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/447922 (https://phabricator.wikimedia.org/T144103) (owner: 10Smalyshev) [10:37:55] (03PS41) 10Gehel: Convert elasticsearch to systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/440498 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [10:37:57] (03PS6) 10Gehel: elasticsearch: disable the default elasticsearch unit [puppet] - 10https://gerrit.wikimedia.org/r/458464 (https://phabricator.wikimedia.org/T198351) [10:38:34] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459523 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:38:44] !log START - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki (volans@sarin) [10:38:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:52] !log END (PASS) - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki (exit_code=0) (volans@sarin) [10:38:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:52] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:459523|Bumping portals to master (T128546)]] (duration: 00m 50s) [10:40:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:57] (03CR) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459523 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:40:57] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [10:41:22] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 17 probes of 317 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [10:41:26] we should throw out icinga... [10:41:42] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:459523|Bumping portals to master (T128546)]] (duration: 00m 50s) [10:41:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:32] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 18 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [10:48:40] !log START - Cookbook sre.switchdc.mediawiki.04-switch-traffic (volans@sarin) [10:48:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:55] !log END (PASS) - Cookbook sre.switchdc.mediawiki.04-switch-traffic (exit_code=0) (volans@sarin) [10:50:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:39] (03CR) 10Mholloway: [C: 031] maps: migrate maps2004 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/457408 (https://phabricator.wikimedia.org/T198622) (owner: 10Gehel) [10:52:49] !log START - Cookbook sre.switchdc.mediawiki.05-invert-redis-sessions (volans@sarin) [10:52:51] !log END (PASS) - Cookbook sre.switchdc.mediawiki.05-invert-redis-sessions (exit_code=0) (volans@sarin) [10:52:52] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 21 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [10:52:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:21] (03PS9) 10Marostegui: mariadb: Set pages for multi-instance hosts [puppet] - 10https://gerrit.wikimedia.org/r/449711 (https://phabricator.wikimedia.org/T200509) [10:57:52] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 19 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [10:57:54] !log START - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite (volans@sarin) [10:57:57] !log END (PASS) - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite (exit_code=0) (volans@sarin) [10:57:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:50] just observing and noticed that stop-maintenance failed? is that fine? is this a dry run? or? [11:00:05] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: (Dis)respected human, time to deploy European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180910T1100). Please do the needful. [11:00:05] Amir1: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:07] (03PS10) 10Marostegui: mariadb: Set pages for multi-instance hosts [puppet] - 10https://gerrit.wikimedia.org/r/449711 (https://phabricator.wikimedia.org/T200509) [11:00:23] mine is not testable at all [11:00:24] o/ [11:00:28] aaah, I see the first SAL entry is --live-test, so I guess this is a test [11:00:30] Amir1: you're deploying yourself? [11:00:40] if you want me, sure thing [11:00:45] Amir1: please do :) [11:00:48] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Set pages for multi-instance hosts [puppet] - 10https://gerrit.wikimedia.org/r/449711 (https://phabricator.wikimedia.org/T200509) (owner: 10Marostegui) [11:00:51] volans: ^^ i had a question about the cookbook, but I guess it is all fine [11:00:53] let's do [11:00:55] I'm around if help is needed for swat [11:01:36] (03CR) 10Ladsgroup: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458770 (https://phabricator.wikimedia.org/T201009) (owner: 10Ladsgroup) [11:02:25] addshore: sure shoot [11:02:34] just observing and noticed that stop-maintenance failed? is that fine? is this a dry run? or? << volans [11:02:37] aaah, I see the first SAL entry is --live-test, so I guess this is a test << volans [11:03:07] yes it's a live test codfw->eqiad so basically a noop but doing stuff [11:03:14] aaaah, gotcha :) [11:03:21] it failed because in codfw there is already no crontab, but it shouldn't have [11:03:35] I'll patch the code to make sure it passes there too also in this live-test situation [11:03:41] (03Merged) 10jenkins-bot: Add $wgPasswordConfig['null'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458770 (https://phabricator.wikimedia.org/T201009) (owner: 10Ladsgroup) [11:04:15] volans: cool :) [11:05:12] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 20 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [11:06:26] !log ladsgroup@deploy1001 Synchronized wmf-config/CommonSettings.php: [[gerrit:458770|Add ['null'] (T201009)]] (duration: 00m 50s) [11:06:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:32] T201009: Run deleteLocalPasswords.php in WMF prod (Central Auth wikis only!) after 1.32.0-wmf.16 is everywhere - https://phabricator.wikimedia.org/T201009 [11:06:35] addshore: thanks for checking ;) [11:07:06] np :) [11:08:18] !log uploaded a co-installable 4.14 kernel to apt.wikimedia.org (to be used for installer tests) [11:08:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:39] !log START - Cookbook sre.switchdc.mediawiki.07-set-readwrite (volans@sarin) [11:09:40] !log [DRY-RUN] MediaWiki read-only period ends at: 2018-09-10 11:09:40.587760 (volans@sarin) [11:09:40] !log END (PASS) - Cookbook sre.switchdc.mediawiki.07-set-readwrite (exit_code=0) (volans@sarin) [11:09:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:25] !log START - Cookbook sre.switchdc.mediawiki.08-restore-ttl (volans@sarin) [11:11:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:34] !log END (PASS) - Cookbook sre.switchdc.mediawiki.08-restore-ttl (exit_code=0) (volans@sarin) [11:11:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:40] (03CR) 10jenkins-bot: Add $wgPasswordConfig['null'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458770 (https://phabricator.wikimedia.org/T201009) (owner: 10Ladsgroup) [11:13:24] (03PS4) 10Mathew.onipe: elasticsearch shard size check * Checks shard size and sends alert if more than 30gb. * Added puppet commands and plugin options [puppet] - 10https://gerrit.wikimedia.org/r/458891 (https://phabricator.wikimedia.org/T203546) [11:13:33] !log START - Cookbook sre.switchdc.mediawiki.08-start-maintenance (volans@sarin) [11:13:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:05] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch shard size check * Checks shard size and sends alert if more than 30gb. * Added puppet commands and plugin options [puppet] - 10https://gerrit.wikimedia.org/r/458891 (https://phabricator.wikimedia.org/T203546) (owner: 10Mathew.onipe) [11:14:40] !log END (PASS) - Cookbook sre.switchdc.mediawiki.08-start-maintenance (exit_code=0) (volans@sarin) [11:14:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:13] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 18 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [11:16:03] !log ladsgroup@deploy1001 Synchronized php-1.32.0-wmf.20/extensions/CentralAuth/maintenance/deleteLocalPasswords.php: [[gerrit:459504|SWAT: Fix typo (T201009)]] (duration: 00m 50s) [11:16:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:08] T201009: Run deleteLocalPasswords.php in WMF prod (Central Auth wikis only!) after 1.32.0-wmf.16 is everywhere - https://phabricator.wikimedia.org/T201009 [11:16:13] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [11:18:05] zeljkof: I'm done! [11:18:54] !log START - Cookbook sre.switchdc.mediawiki.08-update-tendril (volans@sarin) [11:18:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:09] !log END (PASS) - Cookbook sre.switchdc.mediawiki.08-update-tendril (exit_code=0) (volans@sarin) [11:19:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:32] 10Operations, 10Operations-Software-Development, 10User-Joe, 10User-jijiki: Covert deploy_apache_change.sh to a spicerack cookbook - https://phabricator.wikimedia.org/T203948 (10jijiki) p:05Triage>03Normal [11:21:16] Amir1: cool! [11:21:30] !log START - Cookbook sre.switchdc.mediawiki.08-restart-parsoid (volans@sarin) [11:21:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:33] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [11:27:58] !log END (FAIL) - Cookbook sre.switchdc.mediawiki.08-restart-parsoid (exit_code=99) (volans@sarin) [11:28:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:10] damn parsoid [11:28:41] <_joe_> volans: what's wrong there [11:29:02] wtp1043.eqiad.wmnet failed [11:29:11] <_joe_> ofc it's down for maintenance [11:29:14] <_joe_> right? [11:29:23] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 20 probes of 317 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [11:29:39] niah it's fine [11:29:52] well, I can ssh, that's what I mean [11:29:55] * akosiaris digging deeper [11:29:59] node=wtp1043.eqiad.wmnet, rc=1, command='restart-parsoid' [11:30:22] parsoid seems restarted fine though [11:30:45] some timeout ? [11:30:58] non on cumin side [11:31:05] if there is in the restart-parsoid command maybe [11:31:29] <_joe_> yes, if pybal fails to respond N times or something [11:32:02] (03PS1) 10Gehel: maps: migrate maps1004 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/459535 (https://phabricator.wikimedia.org/T198622) [11:32:04] (03PS1) 10Gehel: maps: change partitioning scheme for new SSDs in maps1004 [puppet] - 10https://gerrit.wikimedia.org/r/459536 (https://phabricator.wikimedia.org/T195285) [11:32:10] volans: it's inactive [11:32:18] that probably explains it [11:32:30] the /usr/local/bin/restart-parsoid script does a check though [11:32:37] but only for depool/repool [11:32:41] the restart is done anyway [11:32:47] ah it's the box with the broken disk ? [11:32:50] (03CR) 10Gehel: "With the DC switch, let's start by migrating eqiad instead of codfw." [puppet] - 10https://gerrit.wikimedia.org/r/459535 (https://phabricator.wikimedia.org/T198622) (owner: 10Gehel) [11:32:52] yes [11:32:59] the infamous one [11:33:08] still waiting for that disk ? [11:33:13] dunno [11:33:53] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 20 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [11:34:32] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 17 probes of 317 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [11:41:57] akosiaris: I guess we can deem the test completed as the restart parsoid is the last command of the cookbook [11:42:06] I'll work on the few improvements we agreed after lunch [11:43:09] (03CR) 10Mholloway: [C: 031] maps: migrate maps1004 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/459535 (https://phabricator.wikimedia.org/T198622) (owner: 10Gehel) [11:43:59] yeah I guess so [11:44:03] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 19 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [11:44:08] (03CR) 10Gehel: elasticsearch shard size check * Checks shard size and sends alert if more than 30gb. * Added puppet commands and plugin options (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/458891 (https://phabricator.wikimedia.org/T203546) (owner: 10Mathew.onipe) [11:44:09] !log completed execution of "cookbook sre.switchdc.mediawiki --live-test codfw eqiad" - T199073 [11:44:09] but keep in mind we will need to skip mw1043 tomorrow somehow [11:44:12] ack [11:44:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:15] T199073: Perform a datacenter switchover (2018-19 Q1) - https://phabricator.wikimedia.org/T199073 [11:44:25] ok I can updated the cookbook to simply skip it [11:44:47] in my whishlist we'll have a confctl-backend in cumin ;) [11:45:57] <_joe_> volans: heh it should be easy to create [11:46:02] <_joe_> if only I had time [11:46:12] same here [11:49:03] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [11:56:23] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [11:56:33] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 20 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [11:57:14] !log installing chromium on proton* (tested on deployment-prep with the new release) [11:57:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:33] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [12:01:43] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 19 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [12:08:43] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 23 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [12:08:53] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 22 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [12:09:08] (03CR) 10Gehel: [C: 04-1] elasticsearch shard size check * Checks shard size and sends alert if more than 30gb. * Added puppet commands and plugin options (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/458891 (https://phabricator.wikimedia.org/T203546) (owner: 10Mathew.onipe) [12:09:19] !log installing lcms2 security updates [12:09:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:23] (03PS9) 10Banyek: Labs: Config template generation for pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/458810 [12:10:56] (03CR) 10jerkins-bot: [V: 04-1] Labs: Config template generation for pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/458810 (owner: 10Banyek) [12:14:02] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 19 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [12:14:04] (03PS10) 10Banyek: Labs: Config template generation for pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/458810 [12:14:35] (03CR) 10jerkins-bot: [V: 04-1] Labs: Config template generation for pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/458810 (owner: 10Banyek) [12:14:36] 10Operations, 10SRE-Access-Requests: Requesting access to researchers for kharlan - https://phabricator.wikimedia.org/T203847 (10ArielGlenn) Can we get manager sign-off on this please? Thanks! [12:15:58] (03PS1) 10Muehlenhoff: Add library hint for lcms2 [puppet] - 10https://gerrit.wikimedia.org/r/459541 [12:17:01] (03CR) 10Muehlenhoff: [C: 032] Add library hint for lcms2 [puppet] - 10https://gerrit.wikimedia.org/r/459541 (owner: 10Muehlenhoff) [12:18:53] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [12:23:02] !log restarting hhvm on mw1261-mw1265 to pick up lcms security update [12:23:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:13] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 22 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [12:26:23] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 20 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [12:31:22] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [12:31:32] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 19 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [12:41:43] (03CR) 10MSantos: [C: 031] "LGTM. I assume the new disks will be controlled via RAID10 (just making sure is not a typo for RAID1)" [puppet] - 10https://gerrit.wikimedia.org/r/459536 (https://phabricator.wikimedia.org/T195285) (owner: 10Gehel) [12:42:16] (03CR) 10Gehel: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/459536 (https://phabricator.wikimedia.org/T195285) (owner: 10Gehel) [12:43:05] (03CR) 10MSantos: "Duplicated?" [puppet] - 10https://gerrit.wikimedia.org/r/457409 (https://phabricator.wikimedia.org/T195285) (owner: 10Gehel) [12:43:43] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 22 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [12:43:53] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 21 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [12:44:36] (03CR) 10MSantos: [C: 031] maps: migrate maps1004 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/459535 (https://phabricator.wikimedia.org/T198622) (owner: 10Gehel) [12:45:04] (03PS5) 10Mathew.onipe: Elasticsearch shard size check [puppet] - 10https://gerrit.wikimedia.org/r/458891 (https://phabricator.wikimedia.org/T203546) [12:45:50] (03CR) 10jerkins-bot: [V: 04-1] Elasticsearch shard size check [puppet] - 10https://gerrit.wikimedia.org/r/458891 (https://phabricator.wikimedia.org/T203546) (owner: 10Mathew.onipe) [12:48:53] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [12:49:02] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 18 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [12:49:41] (03PS6) 10Mathew.onipe: Elasticsearch shard size check [puppet] - 10https://gerrit.wikimedia.org/r/458891 (https://phabricator.wikimedia.org/T203546) [12:50:10] (03CR) 10Mathew.onipe: Elasticsearch shard size check (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/458891 (https://phabricator.wikimedia.org/T203546) (owner: 10Mathew.onipe) [12:50:34] (03CR) 10jerkins-bot: [V: 04-1] Elasticsearch shard size check [puppet] - 10https://gerrit.wikimedia.org/r/458891 (https://phabricator.wikimedia.org/T203546) (owner: 10Mathew.onipe) [12:52:41] (03PS11) 10Banyek: Labs: Config template generation for pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/458810 [12:53:13] (03CR) 10jerkins-bot: [V: 04-1] Labs: Config template generation for pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/458810 (owner: 10Banyek) [12:56:12] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [12:56:38] 10Operations: SRE quarterly goal: allow MediaWiki requests to be served by PHP7 alongside HHVM - https://phabricator.wikimedia.org/T203959 (10Joe) [12:58:09] 10Operations: SRE quarterly goal: allow MediaWiki requests to be served by PHP7 alongside HHVM - https://phabricator.wikimedia.org/T203959 (10Joe) We're probably not going to get to the stretch goals, but it should be noted that MediaWiki is still not ready to run on PHP 7.2 itself, so we don't really have an al... [13:00:21] (03PS12) 10Banyek: Labs: Config template generation for pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/458810 [13:01:12] (03CR) 10Gehel: [C: 04-1] "Very minor alignment issue, otherwise LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/458891 (https://phabricator.wikimedia.org/T203546) (owner: 10Mathew.onipe) [13:01:14] (03CR) 10jerkins-bot: [V: 04-1] Labs: Config template generation for pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/458810 (owner: 10Banyek) [13:02:22] (03PS2) 10Ema: ATS request routing: fix yarn and scolarships [puppet] - 10https://gerrit.wikimedia.org/r/459520 (https://phabricator.wikimedia.org/T199720) [13:02:59] (03PS13) 10Banyek: Labs: Config template generation for pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/458810 [13:03:18] (03PS11) 10Marostegui: mariadb: Set pages for multi-instance hosts [puppet] - 10https://gerrit.wikimedia.org/r/449711 (https://phabricator.wikimedia.org/T200509) [13:03:29] (03CR) 10jerkins-bot: [V: 04-1] Labs: Config template generation for pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/458810 (owner: 10Banyek) [13:03:55] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Set pages for multi-instance hosts [puppet] - 10https://gerrit.wikimedia.org/r/449711 (https://phabricator.wikimedia.org/T200509) (owner: 10Marostegui) [13:05:08] (03PS12) 10Marostegui: mariadb: Set pages for multi-instance hosts [puppet] - 10https://gerrit.wikimedia.org/r/449711 (https://phabricator.wikimedia.org/T200509) [13:05:21] (03PS14) 10Banyek: Labs: Config template generation for pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/458810 [13:05:48] (03CR) 10Ema: ATS request routing: fix yarn and scolarships (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/459520 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema) [13:05:51] (03CR) 10jerkins-bot: [V: 04-1] Labs: Config template generation for pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/458810 (owner: 10Banyek) [13:06:22] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [13:07:37] (03PS3) 10Ema: ATS request routing: fix yarn and scholarships [puppet] - 10https://gerrit.wikimedia.org/r/459520 (https://phabricator.wikimedia.org/T199720) [13:07:39] (03CR) 10Gehel: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/457409 (https://phabricator.wikimedia.org/T195285) (owner: 10Gehel) [13:08:18] (03PS15) 10Banyek: Labs: Config template generation for pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/458810 [13:08:47] (03CR) 10jerkins-bot: [V: 04-1] Labs: Config template generation for pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/458810 (owner: 10Banyek) [13:09:02] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/compiler1002/12394/" [puppet] - 10https://gerrit.wikimedia.org/r/449711 (https://phabricator.wikimedia.org/T200509) (owner: 10Marostegui) [13:10:04] (03PS16) 10Banyek: Labs: Config template generation for pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/458810 [13:10:33] (03CR) 10jerkins-bot: [V: 04-1] Labs: Config template generation for pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/458810 (owner: 10Banyek) [13:11:01] :/ [13:11:36] (03PS1) 10Muehlenhoff: Install backup2001 with 4.14 kernel [puppet] - 10https://gerrit.wikimedia.org/r/459546 [13:12:21] (03PS7) 10Mathew.onipe: Elasticsearch shard size check [puppet] - 10https://gerrit.wikimedia.org/r/458891 (https://phabricator.wikimedia.org/T203546) [13:13:33] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [13:13:52] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 20 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [13:18:35] (03CR) 10Gehel: Elasticsearch shard size check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/458891 (https://phabricator.wikimedia.org/T203546) (owner: 10Mathew.onipe) [13:18:42] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [13:18:53] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 18 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [13:20:33] 10Operations, 10cloud-services-team, 10Upstream: New anti-stackclash (4.9.25-1~bpo8+3 ) kernel super bad for NFS - https://phabricator.wikimedia.org/T169290 (10MoritzMuehlenhoff) 05Open>03Resolved This is resolved, the jessie-based labstore servers are running 4.9 since a few weeks. [13:22:46] 10Operations, 10ops-codfw: mw2182 crash - https://phabricator.wikimedia.org/T194835 (10MoritzMuehlenhoff) 05Open>03Resolved Server is running fine since a while, closing the task [13:31:02] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 22 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [13:31:22] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 20 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [13:36:23] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 19 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [13:38:32] (03CR) 10Gehel: "very, very minor naming issues (yes, I know, I have OCDs)." (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/458891 (https://phabricator.wikimedia.org/T203546) (owner: 10Mathew.onipe) [13:43:43] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 20 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [13:45:26] !log Drop user metrics and wikilytics from dbstore1002 [13:45:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:13] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [13:48:52] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 18 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [13:53:33] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [13:56:03] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 20 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [14:03:49] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [14:06:13] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 19 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [14:07:35] (03CR) 10DCausse: Elasticsearch shard size check (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/458891 (https://phabricator.wikimedia.org/T203546) (owner: 10Mathew.onipe) [14:10:19] (03PS1) 10Gehel: admins: add Mathew Onipe as member of elasticsearch-roots and wdqs-admins [puppet] - 10https://gerrit.wikimedia.org/r/459556 (https://phabricator.wikimedia.org/T202708) [14:10:42] (03CR) 10jerkins-bot: [V: 04-1] admins: add Mathew Onipe as member of elasticsearch-roots and wdqs-admins [puppet] - 10https://gerrit.wikimedia.org/r/459556 (https://phabricator.wikimedia.org/T202708) (owner: 10Gehel) [14:11:03] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [14:14:33] (03PS1) 10Muehlenhoff: Print group memberships which granted Hadoop access to check for HDFS cleanups [puppet] - 10https://gerrit.wikimedia.org/r/459558 (https://phabricator.wikimedia.org/T200312) [14:15:29] (03CR) 10Ottomata: "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/459558 (https://phabricator.wikimedia.org/T200312) (owner: 10Muehlenhoff) [14:18:44] (03PS8) 10Mathew.onipe: Elasticsearch shard size check [puppet] - 10https://gerrit.wikimedia.org/r/458891 (https://phabricator.wikimedia.org/T203546) [14:19:24] (03CR) 10jerkins-bot: [V: 04-1] Elasticsearch shard size check [puppet] - 10https://gerrit.wikimedia.org/r/458891 (https://phabricator.wikimedia.org/T203546) (owner: 10Mathew.onipe) [14:21:07] herron: so for https://phabricator.wikimedia.org/T203607 seems like the 'localhost' mx works properly based on my friday test :] [14:21:27] I am going to switch it on the CI Jenkins, run a job that triggers an email and confirm it works [14:21:56] great, and fwiw we are using the same configuration in gerrit, phabricator, etc as well [14:22:13] sounds good [14:23:43] yeah that is great [14:24:32] so https://integration.wikimedia.org/ci/job/T203607-send-email/1/console should have sent us an email [14:24:36] using the current mx [14:26:02] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 20 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [14:26:03] !log Switching CI Jenkins mail server from mx1001 to localhost | T203607 [14:26:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:08] T203607: Ensure Jenkins mail configuration supports outbound smtp server failover - https://phabricator.wikimedia.org/T203607 [14:26:29] and https://integration.wikimedia.org/ci/job/T203607-send-email/2/console should have send it via localhost mx [14:28:10] !log anomie@deploy1001 Synchronized php-1.32.0-wmf.20/includes/parser/ParserOutput.php: Backport for T203716 (duration: 00m 50s) [14:28:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:18] T203716: Duplicate mw-parser-output elements in action=parse API output - https://phabricator.wikimedia.org/T203716 [14:28:37] hashar: received the email with Subject: Build failed in Jenkins: T203607-send-email #2 [14:28:44] (03CR) 10C. Scott Ananian: [C: 04-1] "Ie30c9174e6e3b60bce5a692296a9de1e30192e2c is merged now, but won't be finished riding the train until Sep 13." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443645 (https://phabricator.wikimedia.org/T175706) (owner: 10C. Scott Ananian) [14:29:05] herron: yeah that one went through localhost [14:29:17] Received: from localhost ([127.0.0.1]:50496 helo=contint1001.wikimedia.org) by contint1001.wikimedia.org with esmtp (Exim 4.84_2) [14:29:18] looks good! [14:29:27] apparently the localhost one relays using ipv6 :] [14:30:25] going to update the release jenkins on the other box [14:31:03] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 19 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [14:31:03] kk [14:31:44] (03CR) 10Gehel: [C: 04-1] "Still a few unaddressed comments" (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/458891 (https://phabricator.wikimedia.org/T203546) (owner: 10Mathew.onipe) [14:32:01] herron: they dont have an mx configured. So solved!!!!!!!!!!!! :] [14:32:14] ah! that’s easy [14:32:18] defaults to localhost right? [14:32:24] !log reboot analytics1003 for kernel + openjdk-8 upgrades [14:32:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:08] 10Operations, 10Continuous-Integration-Infrastructure, 10Mail, 10Jenkins, and 2 others: Ensure Jenkins mail configuration supports outbound smtp server failover - https://phabricator.wikimedia.org/T203607 (10hashar) 05Open>03Resolved Turns out releases1001 / releases2002 Jenkins do not have email confi... [14:33:12] herron: yeah it is now hardcoded to localhost [14:33:21] previously that was mx1001 which is very inconvenient [14:33:27] \o/ [14:33:30] thank you herron ! [14:33:32] wonderful! thanks hashar! [14:33:34] I have closed the task [14:33:46] perfect [14:34:12] (03CR) 10Alexandros Kosiaris: [C: 04-1] tor: add an additional relay instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/399972 (owner: 10Faidon Liambotis) [14:36:33] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [14:36:56] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/459520 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema) [14:37:05] (03PS17) 10Banyek: Labs: Config template generation for pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/458810 [14:37:43] (03CR) 10jerkins-bot: [V: 04-1] Labs: Config template generation for pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/458810 (owner: 10Banyek) [14:42:12] PROBLEM - puppet last run on analytics-tool1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[init_superset] [14:42:41] ottomata: --^ [14:43:03] (03PS1) 10Herron: ripe-atlas: bump alert threshold from 19 to 25 failed probes [puppet] - 10https://gerrit.wikimedia.org/r/459559 [14:43:45] (03PS2) 10Hashar: zuul: allow email connection [puppet] - 10https://gerrit.wikimedia.org/r/376739 (https://phabricator.wikimedia.org/T93414) [14:43:49] 10Operations, 10Operations-Software-Development, 10User-Joe, 10User-jijiki: Convert makevm το spicerack cookbook - https://phabricator.wikimedia.org/T203963 (10akosiaris) [14:43:53] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 22 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [14:44:06] (03CR) 10Hashar: [C: 04-1] "That is a WIP" [puppet] - 10https://gerrit.wikimedia.org/r/376739 (https://phabricator.wikimedia.org/T93414) (owner: 10Hashar) [14:45:00] (03PS1) 10Mark Bergsma: Increase ATLAS probe unreachability threshold to 25 [puppet] - 10https://gerrit.wikimedia.org/r/459560 [14:45:46] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2053 - https://phabricator.wikimedia.org/T203623 (10Papaul) a:05Papaul>03Marostegui Disk replaced [14:46:03] (03CR) 10Jcrespo: [C: 031] mariadb: Set pages for multi-instance hosts [puppet] - 10https://gerrit.wikimedia.org/r/449711 (https://phabricator.wikimedia.org/T200509) (owner: 10Marostegui) [14:46:24] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2053 - https://phabricator.wikimedia.org/T203623 (10Marostegui) Thanks! I will close this once it has finished correctly [14:48:32] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 21 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [14:48:49] (03CR) 10C. Scott Ananian: [C: 04-1] "Hm. I neglected to notice the deploy freeze this week. Idf246d05d116f63a73105b50a1929a7721fbe7b9 won't be done riding the train until Se" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443645 (https://phabricator.wikimedia.org/T175706) (owner: 10C. Scott Ananian) [14:49:16] 10Operations, 10Operations-Software-Development, 10User-Joe, 10User-jijiki: Create a spicerack cookbook to empty a ganeti node from VMs - https://phabricator.wikimedia.org/T203964 (10akosiaris) [14:50:13] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: request to add phedenskog to perf-roots - https://phabricator.wikimedia.org/T202658 (10MoritzMuehlenhoff) @Peter Does the access work for you? [14:50:52] 10Operations, 10SRE-Access-Requests, 10wikidiff2, 10Patch-For-Review, 10User-Addshore: Give thiemowmde permission to upload wikidiff2 releases (releasers-wikidiff2) - https://phabricator.wikimedia.org/T202476 (10MoritzMuehlenhoff) @thiemowmde Does the access work for you? [14:51:32] (03CR) 10Ema: [C: 032] ATS request routing: fix yarn and scholarships [puppet] - 10https://gerrit.wikimedia.org/r/459520 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema) [14:51:40] (03CR) 10Alexandros Kosiaris: [C: 031] Install backup2001 with 4.14 kernel [puppet] - 10https://gerrit.wikimedia.org/r/459546 (owner: 10Muehlenhoff) [14:53:42] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 19 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [14:57:03] RECOVERY - Device not healthy -SMART- on db2053 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2053&var-datasource=codfw%2520prometheus%252Fops [14:57:13] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet operation_type={create_container,start_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [14:57:16] (03CR) 10Marostegui: Labs: Config template generation for pt-kill (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/458810 (owner: 10Banyek) [14:57:34] (03CR) 10Gehel: [C: 031] "It looks like this code is used only on WMCS, not sure if there is an easy way to run puppet compiler on it. That being said, the structur" [puppet] - 10https://gerrit.wikimedia.org/r/458907 (owner: 10Bearloga) [14:57:42] (03CR) 10Jcrespo: Labs: Config template generation for pt-kill (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/458810 (owner: 10Banyek) [14:57:59] (03CR) 10Ayounsi: [C: 031] "Ah! I was about to open a task for the exact same thing!" [puppet] - 10https://gerrit.wikimedia.org/r/459560 (owner: 10Mark Bergsma) [14:58:22] RECOVERY - kubelet operational latencies on kubernetes1004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:00:01] (03CR) 10Herron: "proposing this as a stopgap for related flapping alerts." [puppet] - 10https://gerrit.wikimedia.org/r/459559 (owner: 10Herron) [15:01:02] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 21 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [15:02:25] 10Operations, 10ops-codfw: wtp2020 correctable memory errors - https://phabricator.wikimedia.org/T194176 (10Papaul) @fgiunchedi this server is out of warranty. The IDRAC log also is not showing any memory errors and the firmware on the server is way out of date. The first options will be to upgrade the firmw... [15:05:22] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Root for Giovanni Tirloni - https://phabricator.wikimedia.org/T203494 (10bd808) Manager approval: +1 [15:06:07] (03PS3) 10Hashar: zuul: allow email connection [puppet] - 10https://gerrit.wikimedia.org/r/376739 (https://phabricator.wikimedia.org/T93414) [15:07:20] !log reboot analytics100[1,2] for kernel security upgrades [15:07:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:33] RECOVERY - puppet last run on analytics-tool1003 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [15:08:38] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Root for Giovanni Tirloni - https://phabricator.wikimedia.org/T203494 (10Andrew) [15:16:33] PROBLEM - Disk space on elastic1024 is CRITICAL: DISK CRITICAL - free space: /srv 52357 MB (10% inode=99%) [15:18:42] RECOVERY - Disk space on elastic1024 is OK: DISK OK [15:21:58] (03PS1) 10Joal: Update druid-public datasource for AQS [puppet] - 10https://gerrit.wikimedia.org/r/459565 [15:24:33] !log reducing elasticsearch low watermark to 75% on cirrus / eqiad cluster [15:24:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:44] 10Operations, 10ops-eqdfw: unrack/decom cr1-eqdfw - https://phabricator.wikimedia.org/T202700 (10Papaul) [15:28:26] 10Operations, 10Wikimedia-Mailing-lists: Open Foundation West Africa (OFWA) mailing list - https://phabricator.wikimedia.org/T203966 (10Flixtey) [15:28:30] 10Operations, 10ops-eqdfw: unrack/decom cr1-eqdfw - https://phabricator.wikimedia.org/T202700 (10Papaul) a:05Papaul>03ayounsi [15:34:33] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 20 probes of 317 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [15:36:22] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [15:36:33] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 19 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [15:39:33] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 18 probes of 317 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [15:44:02] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 21 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [15:44:07] (03PS18) 10Banyek: Labs: Config template generation for pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/458810 [15:44:48] (03CR) 10jerkins-bot: [V: 04-1] Labs: Config template generation for pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/458810 (owner: 10Banyek) [15:45:13] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [15:46:53] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 20 probes of 317 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [15:47:24] (03Abandoned) 10Herron: ripe-atlas: bump alert threshold from 19 to 25 failed probes [puppet] - 10https://gerrit.wikimedia.org/r/459559 (owner: 10Herron) [15:48:06] (03CR) 10Jcrespo: Labs: Config template generation for pt-kill (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/458810 (owner: 10Banyek) [15:49:03] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 19 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [15:50:32] (03CR) 10Volans: "Few minor comments inline" (0311 comments) [puppet] - 10https://gerrit.wikimedia.org/r/459558 (https://phabricator.wikimedia.org/T200312) (owner: 10Muehlenhoff) [15:51:50] 10Operations, 10SRE-Access-Requests: Requesting access to researchers for kharlan - https://phabricator.wikimedia.org/T203847 (10kaldari) Approved. [15:52:03] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 17 probes of 317 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [15:53:39] (03CR) 10EBernhardson: [C: 031] Convert elasticsearch to systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/440498 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [15:54:07] 10Operations, 10Operations-Software-Development, 10User-Joe, 10User-jijiki: Convert makevm to spicerack cookbook - https://phabricator.wikimedia.org/T203963 (10akosiaris) [15:55:27] 10Operations, 10Traffic, 10Patch-For-Review: Deploy initial ATS test clusters in core DCs - https://phabricator.wikimedia.org/T199720 (10ema) [15:55:47] (03CR) 10Muehlenhoff: Print group memberships which granted Hadoop access to check for HDFS cleanups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/459558 (https://phabricator.wikimedia.org/T200312) (owner: 10Muehlenhoff) [15:56:22] 10Operations, 10Traffic, 10Patch-For-Review: Deploy initial ATS test clusters in core DCs - https://phabricator.wikimedia.org/T199720 (10ema) 05Open>03Resolved Request routing to all current applications added. Closing! [15:56:23] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 21 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [15:56:25] (03PS19) 10Banyek: Labs: Config template generation for pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/458810 [15:57:07] (03CR) 10jerkins-bot: [V: 04-1] Labs: Config template generation for pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/458810 (owner: 10Banyek) [15:58:25] (03PS20) 10Banyek: Labs: Config template generation for pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/458810 [15:59:25] (03CR) 10DCausse: [C: 031] Convert elasticsearch to systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/440498 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [15:59:28] (03CR) 10jerkins-bot: [V: 04-1] Labs: Config template generation for pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/458810 (owner: 10Banyek) [15:59:38] (03CR) 10Volans: Print group memberships which granted Hadoop access to check for HDFS cleanups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/459558 (https://phabricator.wikimedia.org/T200312) (owner: 10Muehlenhoff) [16:01:41] (03CR) 10Elukey: [C: 032] Update druid-public datasource for AQS [puppet] - 10https://gerrit.wikimedia.org/r/459565 (owner: 10Joal) [16:02:12] 10Operations: Add which ldap groups can login on netbox login form - https://phabricator.wikimedia.org/T203840 (10MoritzMuehlenhoff) That would require code changes in Netbox and doesn't seem to warrant the overhead. Alex documented the access https://wikitech.wikimedia.org/wiki/LDAP/Groups#Specific_groups and I... [16:03:40] (03CR) 10Muehlenhoff: Print group memberships which granted Hadoop access to check for HDFS cleanups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/459558 (https://phabricator.wikimedia.org/T200312) (owner: 10Muehlenhoff) [16:04:33] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 21 probes of 317 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [16:09:42] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 18 probes of 317 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [16:12:27] (03PS1) 10Arturo Borrero Gonzalez: cloudvps: eqiad1: add compat network link with main deployment network [puppet] - 10https://gerrit.wikimedia.org/r/459573 (https://phabricator.wikimedia.org/T202636) [16:13:33] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 20 probes of 317 (alerts on 19) - https://atlas.ripe.net/measurements/11645088/#!map [16:13:42] RECOVERY - HP RAID on db2053 is OK: OK: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Controller: OK - Battery/Capacitor: OK [16:15:14] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2053 - https://phabricator.wikimedia.org/T203623 (10Marostegui) 05Open>03Resolved All good! Thank you ``` root@db2053:~# hpssacli controller all show config Smart Array P420i in Slot 0 (Embedded) (sn: 0014380337E0BF0) Port Name: 1I Port Na... [16:17:02] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 20 probes of 317 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [16:18:43] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 19 probes of 317 (alerts on 19) - https://atlas.ripe.net/measurements/11645088/#!map [16:21:53] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 19 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [16:22:12] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 18 probes of 317 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [16:23:13] (03PS1) 10Ottomata: Blacklist WebClientError from EventLogging MySQL [puppet] - 10https://gerrit.wikimedia.org/r/459574 (https://phabricator.wikimedia.org/T203814) [16:24:25] (03CR) 10Ottomata: [C: 032] Blacklist WebClientError from EventLogging MySQL [puppet] - 10https://gerrit.wikimedia.org/r/459574 (https://phabricator.wikimedia.org/T203814) (owner: 10Ottomata) [16:28:15] (03PS2) 10Arturo Borrero Gonzalez: cloudvps: eqiad1: add compat network link with main deployment network [puppet] - 10https://gerrit.wikimedia.org/r/459573 (https://phabricator.wikimedia.org/T202636) [16:29:13] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 21 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [16:34:22] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 19 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [16:35:29] 10Operations, 10Puppet: Debian package or files managed my puppet for pt-kill-wmf - https://phabricator.wikimedia.org/T203674 (10Joe) >>! In T203674#4565565, @jcrespo wrote: >> This is not true if a binary debian package is built, as proposed. In fact, you can consider a binary-only package (built with dpkg-de... [16:37:24] !log rolling restart of aqs on aqs* to pick new druid backend settings [16:37:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:24] (03CR) 10Dzahn: "we have to add an end date because his contract has one" [puppet] - 10https://gerrit.wikimedia.org/r/458877 (https://phabricator.wikimedia.org/T202708) (owner: 10Dzahn) [16:46:43] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 21 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [16:50:38] (03PS1) 10Alex Monk: _trigger_dns_zone_update: Also pass in remote servers to sync to [software/certcentral] - 10https://gerrit.wikimedia.org/r/459581 [16:51:16] (03PS2) 10Elukey: Allow analytics-admins to use journalctl to inspect logs [puppet] - 10https://gerrit.wikimedia.org/r/459508 (https://phabricator.wikimedia.org/T172532) [16:51:32] (03PS2) 10Alex Monk: _trigger_dns_zone_update: Also pass in remote servers to sync to [software/certcentral] - 10https://gerrit.wikimedia.org/r/459581 [16:51:40] (03PS3) 10Andrew Bogott: Add key and root access for Giovanni Tirloni [puppet] - 10https://gerrit.wikimedia.org/r/457972 (https://phabricator.wikimedia.org/T203494) [16:52:02] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 21 probes of 317 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [16:52:08] (03CR) 10Elukey: [C: 032] Allow analytics-admins to use journalctl to inspect logs [puppet] - 10https://gerrit.wikimedia.org/r/459508 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [16:52:20] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Root for Giovanni Tirloni - https://phabricator.wikimedia.org/T203494 (10Andrew) [16:52:28] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Root for Giovanni Tirloni - https://phabricator.wikimedia.org/T203494 (10Andrew) Approved during today's SRE meeting [16:52:37] (03PS4) 10Andrew Bogott: Add key and root access for Giovanni Tirloni [puppet] - 10https://gerrit.wikimedia.org/r/457972 (https://phabricator.wikimedia.org/T203494) [16:53:09] 10Operations, 10SRE-Access-Requests, 10Discovery-Search (Current work), 10Patch-For-Review: Onboarding Mathew Onipe - https://phabricator.wikimedia.org/T202708 (10Gehel) Shell access and membership to elasticsearch-roots and wdqs-admins has been approved in weekly SRE meeting. [16:54:12] (03CR) 10jerkins-bot: [V: 04-1] _trigger_dns_zone_update: Also pass in remote servers to sync to [software/certcentral] - 10https://gerrit.wikimedia.org/r/459581 (owner: 10Alex Monk) [16:54:59] 10Operations, 10SRE-Access-Requests, 10Release-Engineering-Team (Watching / External): Add contint-roots to releases{1,2}001 - https://phabricator.wikimedia.org/T201470 (10Dzahn) a:05ArielGlenn>03None [16:56:43] PROBLEM - puppet last run on wtp2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:56:51] (03PS3) 10Alex Monk: _trigger_dns_zone_update: Also pass in remote servers to sync to [software/certcentral] - 10https://gerrit.wikimedia.org/r/459581 [16:57:02] PROBLEM - puppet last run on db2063 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:57:03] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 18 probes of 317 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [16:57:13] PROBLEM - puppet last run on pollux is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:57:23] PROBLEM - puppet last run on mw1307 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:57:27] <_joe_> elukey: roolback please [16:57:33] PROBLEM - puppet last run on analytics1071 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:57:33] PROBLEM - puppet last run on analytics1061 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:57:42] PROBLEM - puppet last run on ms-be1027 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:57:42] PROBLEM - puppet last run on labtestmetal2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:57:42] PROBLEM - puppet last run on authdns2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:57:43] PROBLEM - puppet last run on elastic2012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:57:43] PROBLEM - puppet last run on puppetdb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:57:43] PROBLEM - puppet last run on kubestagetcd1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:57:52] PROBLEM - puppet last run on logstash1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:57:59] <_joe_> Evaluation Error: Error while evaluating a Function Call, gtirloni is not a valid ssh_keys array: [16:58:02] PROBLEM - puppet last run on mw1278 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:58:04] PROBLEM - puppet last run on labvirt1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:58:04] PROBLEM - puppet last run on cp1090 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:58:04] PROBLEM - puppet last run on mw2154 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:58:14] PROBLEM - puppet last run on labvirt1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:58:15] (03CR) 10jerkins-bot: [V: 04-1] _trigger_dns_zone_update: Also pass in remote servers to sync to [software/certcentral] - 10https://gerrit.wikimedia.org/r/459581 (owner: 10Alex Monk) [16:58:16] <_joe_> andrewbogott, elukey please fix that [16:58:23] that's me, I'm fixing it [16:58:25] sorry [16:58:31] <_joe_> how did CI not catch it? [16:58:32] PROBLEM - puppet last run on bohrium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:58:32] PROBLEM - puppet last run on sodium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:58:42] PROBLEM - puppet last run on debmonitor2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:58:43] PROBLEM - puppet last run on phab1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:58:43] PROBLEM - puppet last run on mw2161 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:58:44] <_joe_> andrewbogott: more CI's fail than yours honestly [16:58:53] PROBLEM - puppet last run on cobalt is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:58:53] PROBLEM - puppet last run on wdqs1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:58:53] PROBLEM - puppet last run on mw1319 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:59:02] PROBLEM - puppet last run on mw2173 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:59:03] PROBLEM - puppet last run on relforge1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:59:03] PROBLEM - puppet last run on elastic1033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:59:03] PROBLEM - puppet last run on cp2024 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:59:03] PROBLEM - puppet last run on dbmonitor2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:59:03] PROBLEM - puppet last run on ms-be2014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:59:04] PROBLEM - puppet last run on db2086 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:59:12] creating a user and adding it into groups must be 2 separate steps [16:59:14] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:59:14] PROBLEM - puppet last run on labstore1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:59:14] PROBLEM - puppet last run on mw1308 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:59:14] PROBLEM - puppet last run on dbproxy1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:59:15] PROBLEM - puppet last run on cp2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:59:21] <_joe_> admin runtests: commands[0] | nosetests modules/admin/data [16:59:27] <_joe_> CI has run [16:59:42] PROBLEM - puppet last run on mw2222 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:59:44] PROBLEM - puppet last run on cloudvirt1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:59:44] PROBLEM - puppet last run on mw2248 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:59:45] PROBLEM - puppet last run on ganeti1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:59:45] PROBLEM - puppet last run on mw1305 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:59:47] <_joe_> mutante: that's not the issue here [16:59:52] PROBLEM - puppet last run on analytics1048 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:59:52] PROBLEM - puppet last run on deploy1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:59:52] PROBLEM - puppet last run on ms-be2028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:59:52] PROBLEM - puppet last run on cp5011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:59:54] PROBLEM - puppet last run on labvirt1017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:59:54] PROBLEM - puppet last run on mc2035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:59:54] <_joe_> the data format is invalid [16:59:56] (03PS1) 10Andrew Bogott: Fix formatting of gtirloni's key [puppet] - 10https://gerrit.wikimedia.org/r/459585 (https://phabricator.wikimedia.org/T203494) [17:00:03] PROBLEM - puppet last run on labpuppetmaster1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:00:04] gehel: My dear minions, it's time we take the moon! Just kidding. Time for Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180910T1700). [17:00:04] PROBLEM - puppet last run on cloudcontrol1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:00:04] PROBLEM - puppet last run on oresrdb1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:00:10] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), and 3 others: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832 (10mmodell) a:05mmodell>03None Unassigning but I'm sti... [17:00:12] PROBLEM - puppet last run on ms-be2040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:00:13] PROBLEM - puppet last run on mw1239 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:00:13] PROBLEM - puppet last run on db1079 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:00:13] PROBLEM - puppet last run on rdb1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:00:13] PROBLEM - puppet last run on mw1252 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:00:13] PROBLEM - puppet last run on elastic1047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:00:23] PROBLEM - puppet last run on mw1283 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:00:23] PROBLEM - puppet last run on scb2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:00:23] PROBLEM - puppet last run on ganeti1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:00:24] PROBLEM - puppet last run on mw1311 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:00:24] PROBLEM - puppet last run on lvs1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:00:34] PROBLEM - puppet last run on mc2031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:00:34] (03CR) 10Giuseppe Lavagetto: [C: 031] Fix formatting of gtirloni's key [puppet] - 10https://gerrit.wikimedia.org/r/459585 (https://phabricator.wikimedia.org/T203494) (owner: 10Andrew Bogott) [17:00:38] jouncebot: o/ [17:00:42] PROBLEM - puppet last run on analytics1028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:00:43] PROBLEM - puppet last run on labvirt1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:00:43] PROBLEM - puppet last run on seaborgium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:00:43] PROBLEM - puppet last run on labmon1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:00:44] PROBLEM - puppet last run on db2075 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:00:44] PROBLEM - puppet last run on mw2281 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:00:44] PROBLEM - puppet last run on db2090 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:00:45] PROBLEM - puppet last run on mw2182 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:00:52] PROBLEM - puppet last run on mw1314 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:00:52] PROBLEM - puppet last run on mw1248 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:00:52] PROBLEM - puppet last run on mw1247 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:00:52] PROBLEM - puppet last run on db1092 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:00:52] PROBLEM - puppet last run on mw1289 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:00:53] PROBLEM - puppet last run on sarin is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:00:53] PROBLEM - puppet last run on analytics1029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:00:54] (03CR) 10Andrew Bogott: [C: 032] Fix formatting of gtirloni's key [puppet] - 10https://gerrit.wikimedia.org/r/459585 (https://phabricator.wikimedia.org/T203494) (owner: 10Andrew Bogott) [17:00:54] PROBLEM - puppet last run on db1117 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:00:54] PROBLEM - puppet last run on mw1326 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:00:55] PROBLEM - puppet last run on analytics1073 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:00:55] PROBLEM - puppet last run on aqs1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:00:56] PROBLEM - puppet last run on cp2023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:01:03] PROBLEM - puppet last run on elastic1050 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:01:03] PROBLEM - puppet last run on cp5008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:01:03] PROBLEM - puppet last run on mwdebug2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:01:03] PROBLEM - puppet last run on mw2280 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:01:04] PROBLEM - puppet last run on mw2288 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:01:04] PROBLEM - puppet last run on cp2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:01:04] PROBLEM - puppet last run on ms-be1030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:01:04] PROBLEM - puppet last run on kafka1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:01:04] PROBLEM - puppet last run on mw1238 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:01:12] PROBLEM - puppet last run on mw1300 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:01:12] PROBLEM - puppet last run on wdqs1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:01:13] PROBLEM - puppet last run on cp2010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:01:13] PROBLEM - puppet last run on bast4001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:01:13] PROBLEM - puppet last run on maps2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:01:13] PROBLEM - puppet last run on wtp2012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:01:13] (03CR) 10Mathew.onipe: "> Patch Set 7:" (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/458891 (https://phabricator.wikimedia.org/T203546) (owner: 10Mathew.onipe) [17:01:22] PROBLEM - puppet last run on stat1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:01:32] PROBLEM - puppet last run on labtestvirt2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:01:33] PROBLEM - puppet last run on db2042 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:01:33] PROBLEM - puppet last run on db2050 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:01:33] PROBLEM - puppet last run on db2085 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:01:33] PROBLEM - puppet last run on cp1080 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:01:42] PROBLEM - puppet last run on wtp2005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:01:42] PROBLEM - puppet last run on db2045 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:01:42] PROBLEM - puppet last run on rdb1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:01:42] PROBLEM - puppet last run on etcd1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:01:42] PROBLEM - puppet last run on mw2183 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:01:43] PROBLEM - puppet last run on mw2135 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:01:43] PROBLEM - puppet last run on mw2212 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:01:52] PROBLEM - puppet last run on mw2168 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:01:52] PROBLEM - puppet last run on es2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:01:52] PROBLEM - puppet last run on cp3044 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:01:53] PROBLEM - puppet last run on restbase1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:01:53] PROBLEM - puppet last run on labsdb1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:02:03] PROBLEM - puppet last run on mw1303 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:02:03] PROBLEM - puppet last run on mw1323 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:02:03] PROBLEM - puppet last run on db1072 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:02:03] PROBLEM - puppet last run on ores1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:02:03] PROBLEM - puppet last run on boron is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:02:03] PROBLEM - puppet last run on elastic1045 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:02:03] PROBLEM - puppet last run on db1073 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:02:04] PROBLEM - puppet last run on rhodium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:02:12] PROBLEM - puppet last run on analytics1065 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:02:12] PROBLEM - puppet last run on dbproxy1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:02:13] PROBLEM - puppet last run on snapshot1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:02:15] PROBLEM - puppet last run on labmon1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:02:22] PROBLEM - puppet last run on restbase2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:02:23] PROBLEM - puppet last run on labtestneutron2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:02:23] PROBLEM - puppet last run on labstore2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:02:23] PROBLEM - puppet last run on mw2287 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:02:23] PROBLEM - puppet last run on mw2258 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:02:23] PROBLEM - puppet last run on ores2007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:02:24] PROBLEM - puppet last run on db2039 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:02:32] PROBLEM - puppet last run on mw2234 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:02:32] PROBLEM - puppet last run on mw2165 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:02:32] PROBLEM - puppet last run on mw2172 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:02:32] PROBLEM - puppet last run on mw2157 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:02:33] PROBLEM - puppet last run on lvs3003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:02:33] andrewbogott: when the fix is deployed, consider running https://wikitech.wikimedia.org/wiki/Cumin#Run_Puppet_only_if_last_run_failed [17:02:41] (03PS4) 10Alex Monk: _trigger_dns_zone_update: Also pass in remote servers to sync to [software/certcentral] - 10https://gerrit.wikimedia.org/r/459581 [17:02:43] PROBLEM - puppet last run on kubestagetcd1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:02:43] PROBLEM - puppet last run on db2074 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:02:43] PROBLEM - puppet last run on mw2255 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:02:43] PROBLEM - puppet last run on mw2221 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:02:43] PROBLEM - puppet last run on deploy2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:02:43] PROBLEM - puppet last run on mw2152 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:02:43] PROBLEM - puppet last run on mw2195 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:02:44] PROBLEM - puppet last run on cp3049 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:02:44] PROBLEM - puppet last run on ms-be1028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:02:52] PROBLEM - puppet last run on db1094 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:02:52] PROBLEM - puppet last run on db1099 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:02:52] PROBLEM - puppet last run on db1121 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:02:52] PROBLEM - puppet last run on analytics1064 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:02:53] PROBLEM - puppet last run on elastic1043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:02:53] PROBLEM - puppet last run on lvs2010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:02:53] PROBLEM - puppet last run on mc1035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:02:54] PROBLEM - puppet last run on es1017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:03:02] PROBLEM - puppet last run on elastic2011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:03:02] PROBLEM - puppet last run on conf2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:03:03] PROBLEM - puppet last run on wtp2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:03:03] PROBLEM - puppet last run on elastic2030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:03:03] PROBLEM - puppet last run on ms-be2015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:03:03] PROBLEM - puppet last run on db2044 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:03:03] PROBLEM - puppet last run on elastic2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:03:04] PROBLEM - puppet last run on cloudservices1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:03:25] so long icinga-wm [17:03:27] !log stop ircecho to avoid excessive spamming [17:03:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:00] _joe_ sorry I stepped afk for 2 mins and came back only now [17:04:21] <_joe_> elukey: not your fault, and we already have a patch [17:04:34] <_joe_> go for your well-earned evening [17:04:39] volans: doing [17:04:51] (03CR) 10jerkins-bot: [V: 04-1] _trigger_dns_zone_update: Also pass in remote servers to sync to [software/certcentral] - 10https://gerrit.wikimedia.org/r/459581 (owner: 10Alex Monk) [17:05:31] 10Operations, 10Puppet: Debian package or files managed my puppet for pt-kill-wmf - https://phabricator.wikimedia.org/T203674 (10jcrespo) > There is really no disagreement; I would prefer a proper debian package to be built, but my order of preference is: > > 1 - Proper debian package (following the rules for... [17:07:28] andrewbogott: need to step away a sec from keyboard for a quick errand, ircecho and puppet are stopped/disabled on einstenium now [17:07:54] elukey: ok, I'll restart after my cumin run finishes [17:07:58] super thanks :) [17:10:03] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Root for Giovanni Tirloni - https://phabricator.wikimedia.org/T203494 (10Andrew) 05Open>03Resolved [17:14:22] (03PS2) 10Smalyshev: Enable dailies everywhere [puppet] - 10https://gerrit.wikimedia.org/r/456170 (https://phabricator.wikimedia.org/T201217) [17:15:54] (03CR) 10EBernhardson: [C: 031] search.wikimedia.org should properly handle multivalue separation char (0x1F) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446801 (owner: 10DCausse) [17:26:19] (03PS9) 10Mathew.onipe: Elasticsearch shard size check [puppet] - 10https://gerrit.wikimedia.org/r/458891 (https://phabricator.wikimedia.org/T203546) [17:26:58] (03CR) 10jerkins-bot: [V: 04-1] Elasticsearch shard size check [puppet] - 10https://gerrit.wikimedia.org/r/458891 (https://phabricator.wikimedia.org/T203546) (owner: 10Mathew.onipe) [17:35:53] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [17:36:58] (03PS10) 10Mathew.onipe: Elasticsearch shard size check [puppet] - 10https://gerrit.wikimedia.org/r/458891 (https://phabricator.wikimedia.org/T203546) [17:42:47] !log trunk cloud-instances1-b-eqiad to cloudnet1003/4:eth1 - T202636 [17:42:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:54] T202636: Allow routing between eqiad and eqiad1 regions - https://phabricator.wikimedia.org/T202636 [17:45:23] 10Operations, 10CommRel-Specialists-Support (Jul-Sep-2018), 10User-Johan: Community Relations support for the 2018 data center switchover - https://phabricator.wikimedia.org/T199676 (10Elitre) Is there a public list of the other planned types of announcements? (prompted by https://meta.wikimedia.org/wiki/Tal... [17:47:02] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [17:48:54] (03CR) 10Volans: "Nice! I know you've already gone forth and back a bit. I'm sorry but I didn't had the time to read the whole backlog of comments, so excus" (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/458891 (https://phabricator.wikimedia.org/T203546) (owner: 10Mathew.onipe) [17:57:12] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [18:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: (Dis)respected human, time to deploy Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180910T1800). Please do the needful. [18:00:04] Ebe123: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:22] * Ebe123 is ready [18:00:45] 10Operations, 10ops-eqdfw: unrack/decom cr1-eqdfw - https://phabricator.wikimedia.org/T202700 (10ayounsi) a:05ayounsi>03Papaul [18:06:53] 10Operations, 10Performance-Team: Revisit Grafana/Icinga notification strategy - https://phabricator.wikimedia.org/T203485 (10Krinkle) [18:12:57] 10Operations, 10Discovery, 10Discovery-Search, 10Elasticsearch: Use SSL certificates with discovery entry for elasticsearch - https://phabricator.wikimedia.org/T162037 (10EBernhardson) FWIW, the transfer_to_es script (and everything else in analytics) no longer talks to elasticsearch directly and doesn't b... [18:15:12] (03PS1) 10Volans: mediawiki: ignore exit codes on stop_cronjobs [software/spicerack] - 10https://gerrit.wikimedia.org/r/459595 (https://phabricator.wikimedia.org/T199079) [18:17:36] Is the SWAT canceled? [18:19:52] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [18:20:02] Ebe123: looks like your patch hasn't been merged to master yet. Generally for SWAT we backport critical fixes to master to the current deployment branch. Once your code has been reviewed by someone who knows that area of the code and merges to the master branch, I can deploy to production during SWAT. [18:20:25] sorry for the delay with that response :( [18:22:19] It's a somewhat lesser-known extension :) [18:23:13] 10Operations, 10CommRel-Specialists-Support (Jul-Sep-2018), 10User-Johan: Community Relations support for the 2018 data center switchover - https://phabricator.wikimedia.org/T199676 (10Johan) Probably not. But this is public, so let's list it here. In short, dealt with: * Included in Tech News (twice) * Pos... [18:24:09] 10Operations, 10CommRel-Specialists-Support (Jul-Sep-2018), 10User-Johan: Community Relations support for the 2018 data center switchover - https://phabricator.wikimedia.org/T199676 (10Johan) And those involved in the work around the European Parliament copyright vote, to make sure we don't cause problems fo... [18:24:40] Didn't realize there was this requirement [18:24:53] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [18:32:13] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [18:33:31] jynus: hi! how's it going? Any perspective on how https://phabricator.wikimedia.org/T203925 might be affecting the whole cluster? I see a bunch of DB performance warnings, for example here: https://logstash.wikimedia.org/goto/906488374c9542b5cdfe63c5800552ca , with stuff like: Query returned 22224 row(s): [18:33:33] query: SELECT * FROM `translate_metadata` [18:33:42] thx in advance for your input! [18:35:51] Even if it wasnt negatively effecting other things...that sounds rather wrong to scan the whole table... [18:36:58] bawolff: heheh indeed [18:37:35] maybe it was from an old version of SQL before the added the WHERE clause [18:37:39] they [18:38:28] Lol [18:38:53] 8p [18:42:23] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [18:42:33] (03PS1) 10Volans: logging: minor improvements and a fix [software/spicerack] - 10https://gerrit.wikimedia.org/r/459606 (https://phabricator.wikimedia.org/T199079) [18:44:55] 10Operations, 10Performance-Team, 10Wikidata, 10Wikidata-Query-Service, 10User-Addshore: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 (10Krinkle) a:03Krinkle I'll investigate a bit to try to determine whether MediaWiki recentchanges may ha... [18:45:39] (03PS1) 10Volans: sre.switchdc.mediawiki: parsoid skip broken host [cookbooks] - 10https://gerrit.wikimedia.org/r/459607 (https://phabricator.wikimedia.org/T199079) [18:49:17] (03PS5) 10Alex Monk: _trigger_dns_zone_update: Also pass in remote servers to sync to [software/certcentral] - 10https://gerrit.wikimedia.org/r/459581 [18:51:00] (03CR) 10jerkins-bot: [V: 04-1] _trigger_dns_zone_update: Also pass in remote servers to sync to [software/certcentral] - 10https://gerrit.wikimedia.org/r/459581 (owner: 10Alex Monk) [18:54:28] 10Operations, 10ops-eqiad, 10Analytics: rack/setup/install stat1007.eqiad.wmnet (stat1005 user replacement) - https://phabricator.wikimedia.org/T203852 (10RobH) [19:02:03] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [19:03:53] !log stopped requeueTranscodes.php job on mwmaint1001.eqiad pending dc switch [19:03:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:13] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [19:10:43] PROBLEM - Disk space on elastic1030 is CRITICAL: DISK CRITICAL - free space: /srv 50389 MB (10% inode=99%) [19:14:32] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 23 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [19:20:52] PROBLEM - Disk space on elastic1030 is CRITICAL: DISK CRITICAL - free space: /srv 51949 MB (10% inode=99%) [19:21:53] RECOVERY - Disk space on elastic1030 is OK: DISK OK [19:25:18] (03PS2) 10Dzahn: wikistats (vps): remove scope.lookupvar from erb template [puppet] - 10https://gerrit.wikimedia.org/r/458338 [19:28:13] (03CR) 10Dzahn: [C: 032] wikistats (vps): remove scope.lookupvar from erb template [puppet] - 10https://gerrit.wikimedia.org/r/458338 (owner: 10Dzahn) [19:29:46] (03CR) 10BryanDavis: quarry: Move the install into a venv and upgrade to Python 3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/451698 (https://phabricator.wikimedia.org/T192698) (owner: 10Zhuyifei1999) [19:29:52] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [19:33:44] (03CR) 10Dzahn: "don't take my comments as a blocker" [puppet] - 10https://gerrit.wikimedia.org/r/451698 (https://phabricator.wikimedia.org/T192698) (owner: 10Zhuyifei1999) [19:35:27] (03CR) 10Zhuyifei1999: quarry: Move the install into a venv and upgrade to Python 3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/451698 (https://phabricator.wikimedia.org/T192698) (owner: 10Zhuyifei1999) [19:37:03] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [19:38:03] where's that merge [19:38:23] PROBLEM - Disk space on elastic1030 is CRITICAL: DISK CRITICAL - free space: /srv 51833 MB (10% inode=99%) [19:38:35] 20 < 25, that's all 'm sayin [19:39:57] (03CR) 10Zhuyifei1999: "mw-vagrant's ::virtualenv is indeed very nice :)" [puppet] - 10https://gerrit.wikimedia.org/r/451698 (https://phabricator.wikimedia.org/T192698) (owner: 10Zhuyifei1999) [19:41:12] PROBLEM - High lag on wdqs1003 is CRITICAL: 3629 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [19:42:13] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [19:42:53] PROBLEM - Disk space on elastic1030 is CRITICAL: DISK CRITICAL - free space: /srv 51755 MB (10% inode=99%) [19:44:09] (03PS3) 10Dzahn: authdns::server: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/454448 [19:45:01] (03CR) 10Dzahn: [C: 032] "second to last change regarding this. all others already done with identical change. refactoring only." [puppet] - 10https://gerrit.wikimedia.org/r/454448 (owner: 10Dzahn) [19:49:33] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [19:50:06] (03CR) 10Dzahn: [C: 032] "noop on authdns1001/2001, eeden" [puppet] - 10https://gerrit.wikimedia.org/r/454448 (owner: 10Dzahn) [19:54:42] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [19:57:23] (03PS3) 10Anomie: Set MCR migration to write-both/read-new on testwiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454534 (https://phabricator.wikimedia.org/T198309) (owner: 10Daniel Kinzler) [20:00:35] (03PS3) 10Gehel: Enable dailies everywhere [puppet] - 10https://gerrit.wikimedia.org/r/456170 (https://phabricator.wikimedia.org/T201217) (owner: 10Smalyshev) [20:01:18] (03CR) 10Gehel: [C: 032] Enable dailies everywhere [puppet] - 10https://gerrit.wikimedia.org/r/456170 (https://phabricator.wikimedia.org/T201217) (owner: 10Smalyshev) [20:06:02] RECOVERY - Disk space on elastic1030 is OK: DISK OK [20:06:18] (03PS5) 10Herron: mtail: add exim tls ciphersuite metrics [puppet] - 10https://gerrit.wikimedia.org/r/458289 (https://phabricator.wikimedia.org/T203260) [20:06:53] (03CR) 10jerkins-bot: [V: 04-1] mtail: add exim tls ciphersuite metrics [puppet] - 10https://gerrit.wikimedia.org/r/458289 (https://phabricator.wikimedia.org/T203260) (owner: 10Herron) [20:07:03] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [20:07:22] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review: Deploy a scalable service for ACME (LetsEncrypt) certificate management - https://phabricator.wikimedia.org/T199711 (10Dzahn) [20:11:36] I downtimed that alert for 2 days until https://gerrit.wikimedia.org/r/#/c/459560/ is merged [20:12:12] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [20:12:41] (03PS6) 10Herron: mtail: add exim tls ciphersuite metrics [puppet] - 10https://gerrit.wikimedia.org/r/458289 (https://phabricator.wikimedia.org/T203260) [20:15:24] 10Operations: reinstall rdb100[56] with RAID - https://phabricator.wikimedia.org/T140442 (10Dzahn) @Joe can rdb1005, rdb1006 be reimaged without too many problems? [20:17:41] (03CR) 10Herron: mtail: add exim tls ciphersuite metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/458289 (https://phabricator.wikimedia.org/T203260) (owner: 10Herron) [20:20:09] mutante: ^ [20:20:30] all the ripe-atlas-${site} [20:20:46] XioNoX: ok, yep! [20:21:47] XioNoX: some ipv6 probes were alerting earlier today (like ~ 8 hours ago or so) [20:21:53] but I guess you have the history [20:22:22] hashar: yeah, that's what I downtimed [20:22:32] we discussed it this morning, but thanks! [20:22:49] ;]] [20:22:50] will see to get it linked to https://wikitech.wikimedia.org/wiki/Network_monitoring#RIPE_alerts [20:23:00] have a good day! sleep & [20:26:57] (03PS1) 10Dzahn: icinga/ripeatlas: add wikitech link in Icinga alert [puppet] - 10https://gerrit.wikimedia.org/r/459626 (https://phabricator.wikimedia.org/T197873) [20:29:13] !log running warmup script against codfw appservers [20:29:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:14] (03PS1) 10Dzahn: netops::check: add 3 playbook links to Icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/459630 (https://phabricator.wikimedia.org/T197873) [20:41:35] (03PS21) 10Banyek: Labs: Config template generation for pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/458810 [20:45:40] 10Operations, 10netops: Intermittent connectivity issues in eqiad's row C - https://phabricator.wikimedia.org/T201139 (10ayounsi) I looked at it some time ago, the spike of DDOS_PROTOCOL_VIOLATION matches spikes of broadcast/multicast traffic we observed on asw2-a {F25757789} Spike of syslog messages from pro... [20:57:25] (03CR) 10Dzahn: [C: 04-1] "not yet, can be better than this and use action_url or notes_url to not have the link in the service description field" [puppet] - 10https://gerrit.wikimedia.org/r/459626 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [20:57:37] (03CR) 10Dzahn: [C: 04-1] "not yet, can be better than this and use action_url or notes_url to not have the link in the service description field" [puppet] - 10https://gerrit.wikimedia.org/r/459630 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [21:00:04] bawolff and Reedy: Your horoscope predicts another unfortunate Weekly Security deployment window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180910T2100). [21:00:28] Right. Almost forgot i have to deploy things! [21:11:20] Hmm. If I want to deploy something to wikitech, do I have to do something special? Its not a normal wiki right? [21:12:02] https://wikitech.wikimedia.org/wiki/Wikitech seems to indicate I just deploy normally [21:12:42] bawolff_: a lot of work has gone into normalizing it as much as possible even tho not SUL. you could ask andrewbogott but I think it's deploy as normal [21:13:16] bawolff_: it should get any changes that are deployed to the regular wiki cluster [21:13:30] there are a few config hacks in place but they're all visible in the config repo [21:14:01] And I'm guessing the standard test on mwdebug1002 step won't work [21:16:53] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [21:20:10] (03PS2) 10Dzahn: icinga/ripeatlas: add playbook link as notes_url in Icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/459626 (https://phabricator.wikimedia.org/T197873) [21:20:43] (03CR) 10jerkins-bot: [V: 04-1] icinga/ripeatlas: add playbook link as notes_url in Icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/459626 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [21:21:22] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [21:24:57] !log bawolff@deploy1001 Synchronized php-1.32.0-wmf.20/extensions/OpenStackManager/special/SpecialNovaSudoer.php: T203885 (duration: 00m 50s) [21:25:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:26] bawolff_: The canary won't detect any changes that are specific to the wikitech code [21:25:46] Ok, well I'll double check carefully I didn't break anything [21:26:08] (its a simple patch, I'm not worried, I've just never deployed anything wikitech specific before) [21:32:23] * bawolff_ all done with his deploy [21:34:02] (03CR) 10Smalyshev: "> It might make sense to have a separate config for all the 'misc' dumps that are run out of cron" [puppet] - 10https://gerrit.wikimedia.org/r/456439 (owner: 10Smalyshev) [21:37:28] 10Operations, 10ops-eqiad: rename/reimage labnodepool1002.eqiad.wmnet as cloudservices1003.wikimedia.org - https://phabricator.wikimedia.org/T201439 (10Andrew) This is all done except for the physical label changes in eqiad. [21:38:38] (03PS3) 10Dzahn: icinga/ripeatlas: add playbook link as notes_url in Icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/459626 (https://phabricator.wikimedia.org/T197873) [21:43:04] (03PS1) 10Dzahn: icinga: add notes_url parameter to NRPE monitor service [puppet] - 10https://gerrit.wikimedia.org/r/459641 (https://phabricator.wikimedia.org/T197873) [21:46:10] (03PS1) 10Dzahn: icinga/planet: add notes_url param to planet https check [puppet] - 10https://gerrit.wikimedia.org/r/459643 (https://phabricator.wikimedia.org/T197873) [21:49:18] (03CR) 10Dzahn: [C: 032] icinga/planet: add notes_url param to planet https check [puppet] - 10https://gerrit.wikimedia.org/r/459643 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [21:54:16] 10Puppet, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): wmcs-roots group not granted access to new eqiad1 region bare metal servers - https://phabricator.wikimedia.org/T203488 (10Andrew) 05Open>03Resolved >>! In T203488#4557091, @bd808 wrote: > Do we need to add documentation somewhe... [21:56:11] (03PS1) 10Dzahn: monitoring:: add action_url next to notes_url parameter [puppet] - 10https://gerrit.wikimedia.org/r/459645 [21:56:16] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [21:56:58] (03CR) 10jerkins-bot: [V: 04-1] monitoring:: add action_url next to notes_url parameter [puppet] - 10https://gerrit.wikimedia.org/r/459645 (owner: 10Dzahn) [21:58:17] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [22:20:55] (03PS6) 10Smalyshev: Create wikidata ntriples dump from ttl dump [puppet] - 10https://gerrit.wikimedia.org/r/447922 (https://phabricator.wikimedia.org/T144103) [22:27:44] (03PS1) 10Andrew Bogott: labvirt1019/1020: rename to cloudvirt and move to eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/459650 (https://phabricator.wikimedia.org/T204003) [22:40:37] (03PS1) 10Andrew Bogott: Rename labvirt1019/1020 to cloudvirt1019/1020 [dns] - 10https://gerrit.wikimedia.org/r/459651 (https://phabricator.wikimedia.org/T204004) [22:40:39] (03PS1) 10Andrew Bogott: Clean up old labvirt1019/1020 entries [dns] - 10https://gerrit.wikimedia.org/r/459652 (https://phabricator.wikimedia.org/T204004) [22:48:11] (03PS2) 10Andrew Bogott: labvirt1019/1020: rename to cloudvirt and move to eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/459650 (https://phabricator.wikimedia.org/T204003) [22:49:45] (03PS3) 10Andrew Bogott: labvirt1019/1020: rename to cloudvirt and move to eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/459650 (https://phabricator.wikimedia.org/T204003) [22:50:38] (03CR) 10Andrew Bogott: [C: 032] labvirt1019/1020: rename to cloudvirt and move to eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/459650 (https://phabricator.wikimedia.org/T204003) (owner: 10Andrew Bogott) [22:52:06] (03CR) 10Andrew Bogott: [C: 032] Rename labvirt1019/1020 to cloudvirt1019/1020 [dns] - 10https://gerrit.wikimedia.org/r/459651 (https://phabricator.wikimedia.org/T204004) (owner: 10Andrew Bogott) [22:56:19] !log rebooting/reimaging labvirt1019 and 1020 for T204003 [22:56:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:25] T204003: Move labvirt1019 and 1020 to eqiad1 - https://phabricator.wikimedia.org/T204003 [23:00:05] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Your horoscope predicts another unfortunate Evening SWAT (Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180910T2300). [23:00:05] No GERRIT patches in the queue for this window AFAICS. [23:04:38] RECOVERY - High lag on wdqs1003 is OK: (C)3600 ge (W)1200 ge 1133 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [23:34:06] 10Operations, 10monitoring, 10Patch-For-Review: come up with a suggestion how to structure wiki pages for Icinga reaction play books - https://phabricator.wikimedia.org/T197873 (10Dzahn) This is how it looks in the web UI when a service has the "notes_url" parameter set. An icion with "addiitonal notes" get... [23:37:02] (03CR) 10Dzahn: [C: 032] "https://phabricator.wikimedia.org/T197873#4572835" [puppet] - 10https://gerrit.wikimedia.org/r/459643 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [23:41:25] (03CR) 10Dzahn: "this is how this looks in web ui: https://phabricator.wikimedia.org/T197873#4572835" [puppet] - 10https://gerrit.wikimedia.org/r/459626 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [23:45:53] (03CR) 10Smalyshev: Create wikidata ntriples dump from ttl dump (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/447922 (https://phabricator.wikimedia.org/T144103) (owner: 10Smalyshev) [23:49:23] 10Operations: Add which ldap groups can login on netbox login form - https://phabricator.wikimedia.org/T203840 (10Dzahn) >>! In T203840#4571522, @MoritzMuehlenhoff wrote: > That would require code changes in Netbox and doesn't seem to warrant the overhead. Fair enough, i said that assuming it was a puppet chang...