[00:00:07] <wikibugs_>	 (03CR) 10Filippo Giunchedi: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/448779 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn)
[00:03:03] <wikibugs_>	 (03PS2) 10Aaron Schulz: Only do cache writes to mcrouter for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449605 (https://phabricator.wikimedia.org/T198239)
[00:05:58] <wikibugs_>	 (03CR) 10Aaron Schulz: [C: 032] Only do cache writes to mcrouter for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449605 (https://phabricator.wikimedia.org/T198239) (owner: 10Aaron Schulz)
[00:07:17] <wikibugs_>	 (03Merged) 10jenkins-bot: Only do cache writes to mcrouter for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449605 (https://phabricator.wikimedia.org/T198239) (owner: 10Aaron Schulz)
[00:09:04] <wikibugs_>	 (03CR) 10jenkins-bot: Only do cache writes to mcrouter for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449605 (https://phabricator.wikimedia.org/T198239) (owner: 10Aaron Schulz)
[00:09:31] <logmsgbot>	 !log aaron@deploy1001 Synchronized wmf-config/mc.php: Only do cache writes to mcrouter for all wikis (duration: 00m 52s)
[00:09:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:12:11] <wikibugs_>	 (03CR) 10Dzahn: [C: 032] graphite::carbon_c_relay: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/448779 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn)
[00:13:20] <wikibugs_>	 (03PS2) 10Aaron Schulz: Allow broadcasted mcrouter cache operations for purges [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449606 (https://phabricator.wikimedia.org/T198239)
[00:15:28] <mutante>	 !log graphite2002 - stopping carbon-local-relay, running puppet to start it again to confirm no issues with gerrit:448779
[00:15:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:17:07] <wikibugs_>	 (03CR) 10Dzahn: [C: 032] "puppet run is no-op. stopped carbon-local-relay on graphite2002, then ran puppet to have it started again. then the same with carbon-front" [puppet] - 10https://gerrit.wikimedia.org/r/448779 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn)
[00:17:20] <godog>	 mutante: thanks! I'm off but let me know if you run into issues
[00:17:36] <mutante>	 godog: i tested it on codfw but did not touch eqiad. no problem
[00:27:53] <wikibugs_>	 (03PS3) 10Aaron Schulz: Enable broadcasted mcrouter cache operations for testwikis and mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449606 (https://phabricator.wikimedia.org/T198239)
[00:29:12] <wikibugs_>	 (03CR) 10Aaron Schulz: [C: 032] Enable broadcasted mcrouter cache operations for testwikis and mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449606 (https://phabricator.wikimedia.org/T198239) (owner: 10Aaron Schulz)
[00:29:28] <wikibugs_>	 (03PS4) 10Aaron Schulz: Enable broadcasted mcrouter cache operations for test wikis and mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449606 (https://phabricator.wikimedia.org/T198239)
[00:31:42] <wikibugs_>	 (03CR) 10Aaron Schulz: [C: 032] Enable broadcasted mcrouter cache operations for test wikis and mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449606 (https://phabricator.wikimedia.org/T198239) (owner: 10Aaron Schulz)
[00:33:00] <wikibugs_>	 (03Merged) 10jenkins-bot: Enable broadcasted mcrouter cache operations for test wikis and mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449606 (https://phabricator.wikimedia.org/T198239) (owner: 10Aaron Schulz)
[00:35:25] <logmsgbot>	 !log aaron@deploy1001 Synchronized wmf-config/mc.php: Enable broadcasted mcrouter cache operations for test wikis and mw.org (duration: 00m 49s)
[00:35:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:38:06] <wikibugs_>	 (03PS1) 10Aaron Schulz: Enable broadcasted mcrouter operations for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452592 (https://phabricator.wikimedia.org/T198239)
[00:38:52] <wikibugs_>	 (03CR) 10jenkins-bot: Enable broadcasted mcrouter cache operations for test wikis and mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449606 (https://phabricator.wikimedia.org/T198239) (owner: 10Aaron Schulz)
[00:55:20] <wikibugs_>	 (03CR) 10Legoktm: "What you have so far looks nice, but I think the main thing that's missing is details of the arguments. If you look at <https://manpages.d" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/437054 (https://phabricator.wikimedia.org/T95097) (owner: 10Nehajha)
[00:56:15] <wikibugs_>	 10Operations, 10Packaging, 10Toolforge, 10Patch-For-Review: Please add php-imagick and php-redis packages to apt.wikimedia.org thirdparty/php72 - https://phabricator.wikimedia.org/T200666 (10Legoktm)
[00:56:44] <wikibugs_>	 (03CR) 10Krinkle: [C: 031] Enable broadcasted mcrouter operations for all wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452592 (https://phabricator.wikimedia.org/T198239) (owner: 10Aaron Schulz)
[01:41:27] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 22 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[01:46:27] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[01:58:37] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 22 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[02:02:57] <wikibugs_>	 10Operations, 10TechCom-RFC, 10Traffic, 10Patch-For-Review, and 3 others: Harmonise the identification of requests across our stack - https://phabricator.wikimedia.org/T201409 (10CCicalese_WMF)
[02:08:37] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[02:15:38] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 22 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[02:22:26] <wikibugs_>	 10Operations, 10Performance-Team, 10Traffic: Significant increase in Time To First Byte on 2018-08-08, between 16:00 and 20:00 UTC - https://phabricator.wikimedia.org/T201769 (10Imarlier) 05Open>03Resolved Confirmed that WPT agents are resolving to the codfw edge.  Given that this means that they're goin...
[02:23:48] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 20 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[02:28:48] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 18 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[02:31:57] <wikibugs_>	 10Operations, 10Wikimedia-General-or-Unknown, 10Wikimedia-log-errors: Jobrunners generate mediawiki exceptions upon calling Closure$RecentChange::save - https://phabricator.wikimedia.org/T169884 (10Krinkle) 05Open>03declined Not seen in Logstash for at least 7 days (searching mediawiki-errors for `Recent...
[02:35:21] <logmsgbot>	 !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.16) (duration: 13m 57s)
[02:35:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:45:47] <logmsgbot>	 !log l10nupdate@deploy1001 ResourceLoader cache refresh completed at Tue Aug 14 02:45:46 UTC 2018 (duration 10m 25s)
[02:45:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:50:48] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[02:57:18] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1290 is CRITICAL: CRITICAL - load average: 44.57, 42.28, 40.13
[02:57:57] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[03:01:18] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1290 is CRITICAL: CRITICAL - load average: 41.14, 41.04, 40.10
[03:06:48] <icinga-wm>	 PROBLEM - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is alerting: 70% GET drop in 30min alert.
[03:07:57] <icinga-wm>	 RECOVERY - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is not alerting.
[03:07:58] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[03:14:58] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 27 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[03:26:47] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 778.63 seconds
[03:44:17] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 261.99 seconds
[03:51:18] <icinga-wm>	 PROBLEM - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is alerting: 70% GET drop in 30min alert.
[03:55:08] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[03:59:37] <icinga-wm>	 RECOVERY - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is not alerting.
[04:02:08] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[04:06:47] <icinga-wm>	 PROBLEM - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is alerting: 70% GET drop in 30min alert.
[04:07:47] <icinga-wm>	 RECOVERY - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is not alerting.
[04:12:08] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[04:19:27] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[04:27:36] <wikibugs_>	 (03PS1) 10Andrew Bogott: openstack glance: move active service for eqiad1 and main to cloudcontrol1003 [puppet] - 10https://gerrit.wikimedia.org/r/452595 (https://phabricator.wikimedia.org/T191791)
[04:27:38] <wikibugs_>	 (03PS1) 10Andrew Bogott: Openstack glance: remove glance service from labcontrol1001 [puppet] - 10https://gerrit.wikimedia.org/r/452596 (https://phabricator.wikimedia.org/T191791)
[04:36:37] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 35.97, 33.90, 32.07
[04:39:27] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[04:44:03] <wikibugs_>	 10Operations, 10TechCom-RFC, 10Traffic, 10Patch-For-Review, and 3 others: Harmonise the identification of requests across our stack - https://phabricator.wikimedia.org/T201409 (10Joe) We also need internal requests to be traced, so I would assume we need all services to generate a request Id whenever they...
[04:44:47] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 20 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[04:46:37] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[04:48:51] <wikibugs_>	 (03PS1) 10Tim Starling: Update ~tstarling/.bashrc [puppet] - 10https://gerrit.wikimedia.org/r/452597
[04:49:48] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 19 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[04:53:08] <wikibugs_>	 (03CR) 10Tim Starling: [C: 032] Update ~tstarling/.bashrc [puppet] - 10https://gerrit.wikimedia.org/r/452597 (owner: 10Tim Starling)
[04:53:28] <wikibugs_>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Karen Brown - https://phabricator.wikimedia.org/T201668 (10JanWMF) Thank you, Daniel, I appreciate you checking and working on this :)
[04:56:37] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[05:03:47] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 22 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[05:13:21] <TimStarling>	 !log killed populateContentTables.php for s2 at jcrespo's request
[05:13:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:28:57] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[05:35:58] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[05:45:59] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[05:53:08] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[05:58:08] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[06:05:18] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 23 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[06:11:38] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 41.08, 35.61, 32.20
[06:13:46] <wikibugs_>	 10Operations, 10SRE-Access-Requests: Subscribe user mepps to security@wikimedia.org - https://phabricator.wikimedia.org/T201856 (10Zoranzoki21)
[06:20:18] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[06:27:28] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 52 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[06:28:47] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1232 is CRITICAL: CRITICAL - load average: 39.73, 35.54, 31.74
[06:32:48] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1232 is CRITICAL: CRITICAL - load average: 37.02, 35.10, 32.36
[06:36:06] <_joe_>	 oh nice
[06:36:11] <_joe_>	 a good way to start the day
[06:37:41] <_joe_>	 !log depooling mw1233 from live traffic for debugging
[06:37:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:43:48] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 36.97, 35.17, 32.20
[06:45:23] <_joe_>	 !log rolling restart of hhvm in eqiad api, high load
[06:45:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:51:58] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 37.60, 33.55, 32.28
[06:52:38] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[06:53:00] <wikibugs_>	 (03PS2) 10Jcrespo: Revert "mariadb: Depool db1101 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452324
[06:53:23] <wikibugs_>	 (03PS1) 10Volans: OpenStack: add custom parameters for the client [software/cumin] - 10https://gerrit.wikimedia.org/r/452608 (https://phabricator.wikimedia.org/T201881)
[06:54:07] <wikibugs_>	 (03PS1) 10Volans: cumin: add region_name to the WMCS openstack config [puppet] - 10https://gerrit.wikimedia.org/r/452609 (https://phabricator.wikimedia.org/T201881)
[06:57:27] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1235 is CRITICAL: CRITICAL - load average: 36.21, 33.87, 32.19
[06:57:38] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1232 is OK: OK - load average: 1.21, 12.28, 23.90
[06:59:07] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 19.53, 19.68, 23.85
[06:59:48] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 22 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[07:01:11] <wikibugs_>	 (03CR) 10Volans: "Compiler results available here:" [puppet] - 10https://gerrit.wikimedia.org/r/452609 (https://phabricator.wikimedia.org/T201881) (owner: 10Volans)
[07:01:50] <wikibugs_>	 (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db1101 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452324 (owner: 10Jcrespo)
[07:03:12] <wikibugs_>	 (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1101 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452324 (owner: 10Jcrespo)
[07:03:52] <wikibugs_>	 (03CR) 10jenkins-bot: Revert "mariadb: Depool db1101 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452324 (owner: 10Jcrespo)
[07:05:34] <logmsgbot>	 !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1101:s7 and db1101:s8 (duration: 00m 52s)
[07:05:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:05:52] <wikibugs_>	 (03PS2) 10Muehlenhoff: Add php-imagick and php-redis to thirdparty/php72 [puppet] - 10https://gerrit.wikimedia.org/r/452274 (https://phabricator.wikimedia.org/T200666) (owner: 10Legoktm)
[07:09:11] <wikibugs_>	 (03CR) 10Muehlenhoff: [C: 032] Add php-imagick and php-redis to thirdparty/php72 [puppet] - 10https://gerrit.wikimedia.org/r/452274 (https://phabricator.wikimedia.org/T200666) (owner: 10Legoktm)
[07:12:58] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 36.68, 33.19, 32.07
[07:14:24] <wikibugs_>	 (03PS6) 10Ema: ATS: add Lua scripting support [puppet] - 10https://gerrit.wikimedia.org/r/451838 (https://phabricator.wikimedia.org/T199720)
[07:14:26] <wikibugs_>	 (03PS1) 10Ema: contint: add support for testing ts-lua plugins [puppet] - 10https://gerrit.wikimedia.org/r/452612 (https://phabricator.wikimedia.org/T199720)
[07:15:58] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] contint: add support for testing ts-lua plugins [puppet] - 10https://gerrit.wikimedia.org/r/452612 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema)
[07:18:27] <wikibugs_>	 (03PS2) 10Ema: contint: add support for testing ts-lua plugins [puppet] - 10https://gerrit.wikimedia.org/r/452612 (https://phabricator.wikimedia.org/T199720)
[07:20:18] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1235 is CRITICAL: CRITICAL - load average: 34.36, 32.33, 32.03
[07:27:38] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1235 is CRITICAL: CRITICAL - load average: 35.92, 32.79, 32.24
[07:29:58] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[07:32:31] <wikibugs_>	 (03CR) 10Legoktm: [C: 04-1] "If you want to use this for operations/puppet, then adding the packages to the contint manifests won't work. You'll have to update https:/" [puppet] - 10https://gerrit.wikimedia.org/r/452612 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema)
[07:34:57] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1235 is CRITICAL: CRITICAL - load average: 33.75, 32.46, 32.06
[07:37:17] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[07:41:31] <wikibugs_>	 (03PS1) 10Jcrespo: mariadb: Set s2 in read only mode due to maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452620 (https://phabricator.wikimedia.org/T201694)
[07:42:17] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[07:42:56] <wikibugs_>	 (03PS1) 10Muehlenhoff: Revert "Add php-imagick and php-redis to thirdparty/php72" [puppet] - 10https://gerrit.wikimedia.org/r/452621
[07:47:03] <wikibugs_>	 (03CR) 10Muehlenhoff: [C: 032] Revert "Add php-imagick and php-redis to thirdparty/php72" [puppet] - 10https://gerrit.wikimedia.org/r/452621 (owner: 10Muehlenhoff)
[07:54:17] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1226 is OK: OK - load average: 5.61, 9.11, 23.32
[07:54:28] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 24 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[07:56:17] <moritzm>	 !log installing ghostscript security updates
[07:56:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:59:28] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[08:00:12] <icinga-wm>	 PROBLEM - HHVM rendering on mw1281 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[08:01:11] <icinga-wm>	 RECOVERY - HHVM rendering on mw1281 is OK: HTTP OK: HTTP/1.1 200 OK - 75412 bytes in 3.445 second response time
[08:01:12] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1290 is OK: OK - load average: 11.16, 13.04, 29.63
[08:03:09] <wikibugs_>	 (03CR) 10Gehel: Add cookbook entry point script (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/450937 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans)
[08:04:02] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1231 is OK: OK - load average: 8.41, 11.66, 23.76
[08:06:07] <wikibugs_>	 (03CR) 10Gehel: [C: 031] "LGTM, trivial enough" [software/cumin] - 10https://gerrit.wikimedia.org/r/452608 (https://phabricator.wikimedia.org/T201881) (owner: 10Volans)
[08:06:32] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[08:07:05] <wikibugs_>	 (03CR) 10Volans: Add cookbook entry point script (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/450937 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans)
[08:07:51] <wikibugs_>	 (03CR) 10Volans: [C: 032] OpenStack: add custom parameters for the client [software/cumin] - 10https://gerrit.wikimedia.org/r/452608 (https://phabricator.wikimedia.org/T201881) (owner: 10Volans)
[08:08:11] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1235 is OK: OK - load average: 7.63, 13.84, 23.44
[08:10:47] <wikibugs_>	 (03Merged) 10jenkins-bot: OpenStack: add custom parameters for the client [software/cumin] - 10https://gerrit.wikimedia.org/r/452608 (https://phabricator.wikimedia.org/T201881) (owner: 10Volans)
[08:12:01] <wikibugs_>	 (03CR) 10jenkins-bot: OpenStack: add custom parameters for the client [software/cumin] - 10https://gerrit.wikimedia.org/r/452608 (https://phabricator.wikimedia.org/T201881) (owner: 10Volans)
[08:13:42] <wikibugs_>	 (03CR) 10Vgutierrez: [C: 032] varnish: get rid of AES128-SHA redirection to /sec-warning [puppet] - 10https://gerrit.wikimedia.org/r/450020 (https://phabricator.wikimedia.org/T192555) (owner: 10Vgutierrez)
[08:13:50] <wikibugs_>	 (03PS2) 10Vgutierrez: varnish: get rid of AES128-SHA redirection to /sec-warning [puppet] - 10https://gerrit.wikimedia.org/r/450020 (https://phabricator.wikimedia.org/T192555)
[08:16:32] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[08:19:39] <moritzm>	 !log upgrading wikidiff to 1.7.2 on mw1266-mw1275 (HHVM bytecode cache is pruned during update)
[08:19:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:20:52] <icinga-wm>	 PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 20 probes of 314 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[08:22:17] <wikibugs_>	 (03PS1) 10Jcrespo: mariadb: Set s2 as read-write and promote db1122 as the new s2 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452632 (https://phabricator.wikimedia.org/T201694)
[08:23:42] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[08:25:52] <icinga-wm>	 RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 16 probes of 314 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[08:31:42] <wikibugs_>	 (03PS3) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: move the other private wikis to the define [puppet] - 10https://gerrit.wikimedia.org/r/451255 (https://phabricator.wikimedia.org/T196968)
[08:31:44] <wikibugs_>	 (03PS3) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: make includes explicit in more wikis [puppet] - 10https://gerrit.wikimedia.org/r/451257 (https://phabricator.wikimedia.org/T196968)
[08:31:46] <wikibugs_>	 (03PS3) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert loginwiki, chapterwiki [puppet] - 10https://gerrit.wikimedia.org/r/451258 (https://phabricator.wikimedia.org/T196968)
[08:31:48] <wikibugs_>	 (03PS3) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: expand include everywhere in remnant.conf [puppet] - 10https://gerrit.wikimedia.org/r/451259 (https://phabricator.wikimedia.org/T196968)
[08:31:50] <wikibugs_>	 (03PS3) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: expand the includes in sites in main.conf (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/451260 (https://phabricator.wikimedia.org/T196968)
[08:31:52] <wikibugs_>	 (03PS2) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: expand the includes in sites in main.conf (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/452322 (https://phabricator.wikimedia.org/T196968)
[08:31:54] <wikibugs_>	 (03PS3) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert simple wikis in remnant.conf [puppet] - 10https://gerrit.wikimedia.org/r/452323 (https://phabricator.wikimedia.org/T196968)
[08:31:56] <wikibugs_>	 (03PS2) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: enable HHVM on some sites(!!!) [puppet] - 10https://gerrit.wikimedia.org/r/452325
[08:31:58] <wikibugs_>	 (03PS1) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert usability wiki [puppet] - 10https://gerrit.wikimedia.org/r/452635 (https://phabricator.wikimedia.org/T196968)
[08:32:00] <wikibugs_>	 (03PS1) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: migrate wikispecies [puppet] - 10https://gerrit.wikimedia.org/r/452636 (https://phabricator.wikimedia.org/T196968)
[08:32:02] <wikibugs_>	 (03PS1) 10Jcrespo: mariadb: Failover db1066 (eqiad s2 master) to db1122 [puppet] - 10https://gerrit.wikimedia.org/r/452637 (https://phabricator.wikimedia.org/T197073)
[08:33:13] <wikibugs_>	 (03PS7) 10Ema: ATS: add Lua scripting support [puppet] - 10https://gerrit.wikimedia.org/r/451838 (https://phabricator.wikimedia.org/T199720)
[08:33:15] <wikibugs_>	 (03PS3) 10Ema: tox: add ts-lua tests for trafficserver [puppet] - 10https://gerrit.wikimedia.org/r/452612 (https://phabricator.wikimedia.org/T199720)
[08:37:42] <wikibugs_>	 (03CR) 10Gehel: [C: 031] "very minor comments inline, otherwise LGTM" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/450937 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans)
[08:38:00] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 20 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[08:38:27] <wikibugs_>	 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1049.eqiad.wmnet', 'elastic1046...
[08:40:03] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] tox: add ts-lua tests for trafficserver [puppet] - 10https://gerrit.wikimedia.org/r/452612 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema)
[08:40:39] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[08:43:00] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 19 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[08:43:20] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet operation_type={create_container,start_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[08:44:20] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[08:46:16] <wikibugs_>	 (03CR) 10Gehel: [C: 031] "LGTM, minor comment inline." (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/452378 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans)
[08:47:11] <icinga-wm>	 PROBLEM - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is alerting: 70% GET drop in 30min alert.
[08:49:10] <icinga-wm>	 RECOVERY - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is not alerting.
[08:49:42] <vgutierrez>	 uh ema ^^
[08:50:03] <wikibugs_>	 (03CR) 10Gehel: [C: 031] "Nice cleanup!" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/452379 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans)
[08:50:19] <ema>	 vgutierrez: T200673
[08:50:21] <stashbot>	 T200673: varnish-http-requests false positives when a DC is depooled  - https://phabricator.wikimedia.org/T200673
[08:50:48] <vgutierrez>	 ema: <3
[08:51:51] <addshore>	 !log addshore@labweb1001:~$ mwscript extensions/OATHAuth/maintenance/disableOATHAuthForUser.php --wiki=labswiki GoranSMilovanovic # T201122
[08:51:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:51:59] <stashbot>	 T201122: Cannot login to Wikitech w. my LDAP account - https://phabricator.wikimedia.org/T201122
[08:57:00] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[08:59:19] <addshore>	 Is there a way to automatically !log things from a machine while running a command?
[09:00:45] <_joe_>	 addshore: heh, not really
[09:01:01] <_joe_>	 but it's not so hard to add a simple script that does that for you
[09:01:30] <_joe_>	 it's what scap and/or conftool do, after all
[09:01:38] <wikibugs_>	 (03PS1) 10Jcrespo: mariadb: Point s2-master CNAME to db1122 [dns] - 10https://gerrit.wikimedia.org/r/452642 (https://phabricator.wikimedia.org/T201694)
[09:01:48] <addshore>	 it would be nice to just be able to run "sal some_command_here" for example
[09:01:53] <_joe_>	 addshore: are you aware all your !log messages end up on twitter, right?
[09:03:09] <addshore>	 yup, that bot is the thing that probably says my username the most on twitter
[09:03:27] <_joe_>	 eheheh indeed
[09:03:45] <_joe_>	 I remember before knowing it to have done some very snarky !logs
[09:04:38] <wikibugs_>	 (03CR) 10Gehel: "LGTM, very minor comment inline." (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/451254 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans)
[09:05:04] <wikibugs_>	 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1039.eqiad.wmnet', 'elastic1046.eqiad.wmnet', 'elastic1049.eqiad.wmnet'] ```  an...
[09:07:01] <addshore>	 !log for i in {1..1000}; do echo Lexeme:L$i; done | mwscript purgePage.php --wiki wikidatawiki --skip-exists-check # T198301
[09:07:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:07:08] <stashbot>	 T198301: Poke existing lexemes to be reflected on SpecialPage - https://phabricator.wikimedia.org/T198301
[09:08:54] <addshore>	 !log for i in {1000..3000}; do echo Lexeme:L$i; done | mwscript purgePage.php --wiki wikidatawiki --skip-exists-check # T198301
[09:08:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:13:14] <addshore>	 !log for i in {3000..6000}; do echo Lexeme:L$i; done | mwscript purgePage.php --wiki wikidatawiki --skip-exists-check # T198301
[09:13:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:13:20] <stashbot>	 T198301: Poke existing lexemes to be reflected on SpecialPage - https://phabricator.wikimedia.org/T198301
[09:15:56] <wikibugs_>	 (03CR) 10Gehel: [C: 031] "LGTM, trivial enough" [software/spicerack] - 10https://gerrit.wikimedia.org/r/451537 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans)
[09:16:33] <wikibugs_>	 10Operations, 10Wikidata, 10monitoring, 10Patch-For-Review, 10User-Addshore: Add Addshore & possibly other WMDE devs/deployers to the wikidata icinga contact list - https://phabricator.wikimedia.org/T195289 (10Addshore) Quite some time has passed now, any update here?
[09:16:51] <wikibugs_>	 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 4 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 (10phuedx) Further to the above, AIUI both {T181623} and {T177765} block Proton's deployment. @pmiazga has submitted two changes for the form...
[09:17:14] <addshore>	 !log for i in {6000..10000}; do echo Lexeme:L$i; done | mwscript purgePage.php --wiki wikidatawiki --skip-exists-check # T198301
[09:17:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:20:53] <moritzm>	 !log upgrading wikidiff to 1.7.2 on mw1238-mw1258 (HHVM bytecode cache is pruned during update)
[09:20:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:23:01] <wikibugs_>	 (03PS1) 10Jcrespo: mariadb: Depool db1102 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452644 (https://phabricator.wikimedia.org/T201694)
[09:23:29] <addshore>	 !log for i in {10000..12500}; do echo Lexeme:L$i; done | mwscript purgePage.php --wiki wikidatawiki --skip-exists-check # T198301
[09:23:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:23:35] <stashbot>	 T198301: Poke existing lexemes to be reflected on SpecialPage - https://phabricator.wikimedia.org/T198301
[09:24:24] <wikibugs_>	 (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1102 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452644 (https://phabricator.wikimedia.org/T201694) (owner: 10Jcrespo)
[09:25:05] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[09:25:40] <wikibugs_>	 (03Merged) 10jenkins-bot: mariadb: Depool db1102 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452644 (https://phabricator.wikimedia.org/T201694) (owner: 10Jcrespo)
[09:27:15] <logmsgbot>	 !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1102 (duration: 00m 52s)
[09:27:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:27:53] <wikibugs_>	 (03PS1) 10Jcrespo: Revert "mariadb: Depool db1102 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452645
[09:32:13] <jynus>	 !log stop and restart db1122 for maintenance
[09:32:14] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[09:32:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:33:08] <wikibugs_>	 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1044.eqiad.wmnet', 'elastic1045...
[09:33:23] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[09:33:26] <wikibugs_>	 (03CR) 10Gehel: [C: 04-1] Add remote module to interact with Cumin (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/451538 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans)
[09:33:33] <wikibugs_>	 10Operations, 10Wikidata, 10monitoring, 10Patch-For-Review, 10User-Addshore: Add Addshore & possibly other WMDE devs/deployers to the wikidata icinga contact list - https://phabricator.wikimedia.org/T195289 (10ArielGlenn) @Addshore and @Ladsgroup should be on the contact list (patch merged at the end of...
[09:35:49] <wikibugs_>	 (03PS1) 10Jcrespo: mariadb: Switch db1122 binlog format to ROW [puppet] - 10https://gerrit.wikimedia.org/r/452648 (https://phabricator.wikimedia.org/T201694)
[09:36:29] <wikibugs_>	 (03CR) 10Jcrespo: [C: 032] mariadb: Switch db1122 binlog format to ROW [puppet] - 10https://gerrit.wikimedia.org/r/452648 (https://phabricator.wikimedia.org/T201694) (owner: 10Jcrespo)
[09:37:05] <wikibugs_>	 (03CR) 10jenkins-bot: mariadb: Depool db1102 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452644 (https://phabricator.wikimedia.org/T201694) (owner: 10Jcrespo)
[09:38:58] <wikibugs_>	 (03PS1) 10Jcrespo: mariadb: Switch db1122 binlog format to STATEMENT [puppet] - 10https://gerrit.wikimedia.org/r/452649 (https://phabricator.wikimedia.org/T201694)
[09:39:46] <wikibugs_>	 (03CR) 10Jcrespo: [C: 032] mariadb: Switch db1122 binlog format to STATEMENT [puppet] - 10https://gerrit.wikimedia.org/r/452649 (https://phabricator.wikimedia.org/T201694) (owner: 10Jcrespo)
[09:42:14] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[09:42:20] <wikibugs_>	 (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db1102 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452645 (owner: 10Jcrespo)
[09:43:38] <wikibugs_>	 (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1102 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452645 (owner: 10Jcrespo)
[09:46:26] <wikibugs_>	 (03CR) 10Gehel: "Mostly minor comments inline" (035 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/451814 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans)
[09:46:45] <wikibugs_>	 (03PS2) 10Jcrespo: mariadb: Failover db1066 (eqiad s2 master) to db1122 [puppet] - 10https://gerrit.wikimedia.org/r/452637 (https://phabricator.wikimedia.org/T197073)
[09:47:23] <wikibugs_>	 (03PS2) 10Jcrespo: mariadb: Set s2 in read only mode due to maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452620 (https://phabricator.wikimedia.org/T201694)
[09:49:23] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[09:50:17] <logmsgbot>	 !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1102 with low load (duration: 00m 50s)
[09:50:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:52:40] <wikibugs_>	 (03CR) 10jenkins-bot: Revert "mariadb: Depool db1102 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452645 (owner: 10Jcrespo)
[09:54:24] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[09:57:45] <_joe_>	 addshore: /win 22
[09:57:47] <_joe_>	 argh
[09:57:51] <_joe_>	 sorry
[09:58:01] <_joe_>	 keyboard shortcut error
[09:58:04] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[09:59:46] <wikibugs_>	 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1044.eqiad.wmnet', 'elastic1045.eqiad.wmnet', 'elastic1048.eqiad.wmnet'] ```  an...
[10:01:30] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[10:02:00] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1299 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.003 second response time
[10:02:41] <wikibugs_>	 (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "I believe you should use 'profile::openstack::main::region' instead of 'profile::openstack::base::region'." [puppet] - 10https://gerrit.wikimedia.org/r/452609 (https://phabricator.wikimedia.org/T201881) (owner: 10Volans)
[10:03:00] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1299 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.003 second response time
[10:04:12] <wikibugs_>	 10Operations, 10MediaWiki-extensions-Translate, 10Language-2018-July-September, 10MW-1.32-release-notes (WMF-deploy-2018-08-14 (1.32.0-wmf.17)), and 4 others: 503 error attempting to open multiple projects (Wikipedia and meta wiki are loading very slowly) - https://phabricator.wikimedia.org/T195293 (10Petar...
[10:04:40] <wikibugs_>	 (03CR) 10Volans: "> Patch Set 1: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/452609 (https://phabricator.wikimedia.org/T201881) (owner: 10Volans)
[10:10:27] <wikibugs_>	 10Operations, 10LDAP-Access-Requests, 10User-Addshore: Give access to graphite and grafana-admin to Aleksey Bekh-Ivanov (WMDE) - https://phabricator.wikimedia.org/T199233 (10WMDE-leszek) 05declined>03Open As Aleksey's manager I hereby sign off this request. He is an WMDE engineer and needs the requested...
[10:12:32] <volans>	 !log uploaded cumin_3.0.2-2_amd64.deb to apt.wikimedia.org jessie-wikimedia
[10:12:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:13:22] <wikibugs_>	 (03CR) 10Arturo Borrero Gonzalez: [C: 031] openstack glance: move active service for eqiad1 and main to cloudcontrol1003 [puppet] - 10https://gerrit.wikimedia.org/r/452595 (https://phabricator.wikimedia.org/T191791) (owner: 10Andrew Bogott)
[10:14:39] <volans>	 arturo: if you have a minute https://gerrit.wikimedia.org/r/c/operations/puppet/+/452609 (see my reply)
[10:14:45] <wikibugs_>	 (03CR) 10Arturo Borrero Gonzalez: [C: 031] "Could we also drop the `modules/profile/manifests/openstack/main/glance.pp` file if it isn't referenced anymore?" [puppet] - 10https://gerrit.wikimedia.org/r/452596 (https://phabricator.wikimedia.org/T191791) (owner: 10Andrew Bogott)
[10:15:03] <arturo>	 ack
[10:15:46] <wikibugs_>	 (03CR) 10Volans: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452620 (https://phabricator.wikimedia.org/T201694) (owner: 10Jcrespo)
[10:15:52] <volans>	 thx
[10:16:41] <moritzm>	 !log upgrading wikidiff to 1.7.2 on mw1221-mw1235 (HHVM bytecode cache is pruned during update)
[10:16:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:17:07] <wikibugs_>	 (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/452637 (https://phabricator.wikimedia.org/T197073) (owner: 10Jcrespo)
[10:17:36] <wikibugs_>	 (03CR) 10Volans: [C: 031] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/452642 (https://phabricator.wikimedia.org/T201694) (owner: 10Jcrespo)
[10:18:00] <wikibugs_>	 (03CR) 10Tim Starling: [C: 031] mariadb: Set s2 as read-write and promote db1122 as the new s2 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452632 (https://phabricator.wikimedia.org/T201694) (owner: 10Jcrespo)
[10:19:11] <wikibugs_>	 (03CR) 10Volans: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452632 (https://phabricator.wikimedia.org/T201694) (owner: 10Jcrespo)
[10:22:07] <wikibugs_>	 (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "> > Patch Set 1: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/452609 (https://phabricator.wikimedia.org/T201881) (owner: 10Volans)
[10:23:22] <volans>	 arturo: thanks for the clarification, at this point I have additional questions :)
[10:23:40] <arturo>	 go ahead :-)
[10:23:54] <volans>	 the cluster that was until yesterday without region, and is the one used by WMCS and queried by cumin on labpuppetmaster*, which one is it? main?
[10:24:06] <arturo>	 yes, main
[10:24:14] <arturo>	 which region is `eqiad`
[10:24:20] <volans>	 ok and the eqiad1 what is it?
[10:24:29] <volans>	 will you need to query that too soon?
[10:24:58] <volans>	 will it replace main or in addition to it?
[10:26:42] <wikibugs_>	 (03PS2) 10Volans: cumin: add region_name to the WMCS openstack config [puppet] - 10https://gerrit.wikimedia.org/r/452609 (https://phabricator.wikimedia.org/T201881)
[10:26:43] <volans>	 in the meanwhile, code updated :) ^^^
[10:27:23] <wikibugs_>	 (03CR) 10Volans: "Ack, done." [puppet] - 10https://gerrit.wikimedia.org/r/452609 (https://phabricator.wikimedia.org/T201881) (owner: 10Volans)
[10:31:51] <wikibugs_>	 (03PS1) 10Volans: Rebuild for Django security upgrade [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/452656
[10:32:46] <arturo>	 volans: `eqiad1` is another deployment
[10:33:11] <arturo>	 the one that will eventually replace `main` 
[10:33:22] <arturo>	 `main` -> nova-network based openstack deployment
[10:33:29] <volans>	 ok, so I guess for the interim period we'll use 2 different cumin configs
[10:33:40] <arturo>	 `eqiad1` -> neutron based openstack deployment
[10:33:47] <volans>	 with one default and the other to be specified with -c /etc/cumin/eqiad1.yaml
[10:34:14] <volans>	 for which we could add an alias/wrapper :)
[10:35:04] <arturo>	 apart of the many deployments, we introduced the concepts of `regions', and since yesterday the main and eqiad1 deployment share some components by means of this regions mechanism
[10:35:34] <arturo>	 the main deployment region is called `eqiad` and the eqiad1 deployment region is called `eqiad1-r`
[10:35:42] <arturo>	 naming is hard.... :-P
[10:36:05] <volans>	 ehehe
[10:36:10] <volans>	 is the patch ok now to merge?
[10:36:23] <volans>	 at least to unblock cumin for the current main deployment
[10:36:25] <arturo>	 checking
[10:36:33] <volans>	 thanks!
[10:37:00] <wikibugs_>	 (03CR) 10Arturo Borrero Gonzalez: [C: 031] cumin: add region_name to the WMCS openstack config [puppet] - 10https://gerrit.wikimedia.org/r/452609 (https://phabricator.wikimedia.org/T201881) (owner: 10Volans)
[10:37:06] <arturo>	 +1
[10:37:28] <wikibugs_>	 (03CR) 10Volans: [C: 032] cumin: add region_name to the WMCS openstack config [puppet] - 10https://gerrit.wikimedia.org/r/452609 (https://phabricator.wikimedia.org/T201881) (owner: 10Volans)
[10:37:29] <volans>	 great
[10:39:11] <volans>	 working!
[10:40:08] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[10:41:26] <volans>	 !log upgraded cumin on labpuppetmaster* to fix cumin with the new openstack region - T201881
[10:41:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:41:33] <stashbot>	 T201881: Cumin's OpenStack backend appears to be broken after labs keystone region merge - https://phabricator.wikimedia.org/T201881
[10:44:08] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[10:45:16] <moritzm>	 !log repooled mw1227 (was probably overlooked to repool after previous debugging)
[10:45:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:51:39] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[11:00:04] <jouncebot>	 addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor I � Unicode. All rise for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180814T1100).
[11:00:04] <jouncebot>	 Zoranzoki21: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[11:03:10] <Zoranzoki21>	 I am here
[11:03:12] <Zoranzoki21>	 Who swating?
[11:03:31] <zeljkof>	 I can SWAT today
[11:03:48] <Zoranzoki21>	 Ok. Can you?
[11:03:50] <zeljkof>	 Zoranzoki21: I'll ping you when the first patch is at mwdebug1002 for testing
[11:04:03] <Zoranzoki21>	 zeljkof: Testing is not needed
[11:04:12] <zeljkof>	 Zoranzoki21: for both patches?
[11:04:17] <Zoranzoki21>	 zeljkof: No
[11:04:31] <Zoranzoki21>	 zeljkof: This two patches only fix typo in comment
[11:05:13] <wikibugs_>	 (03CR) 10Volans: [C: 031] "Diff looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/451255 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto)
[11:10:10] <zeljkof>	 Zoranzoki21: ok, both patches look good to me, will merge and deploy, I guess there is nothing to test :)
[11:10:34] <Zoranzoki21>	 zeljkof: Yes
[11:11:27] <wikibugs_>	 (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452050 (https://phabricator.wikimedia.org/T201491) (owner: 10Zoranzoki21)
[11:11:47] <wikibugs_>	 (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452051 (https://phabricator.wikimedia.org/T201491) (owner: 10Zoranzoki21)
[11:12:58] <wikibugs_>	 (03Merged) 10jenkins-bot: Fix 'the the' typo in wmf-config/CirrusSearch-common.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452050 (https://phabricator.wikimedia.org/T201491) (owner: 10Zoranzoki21)
[11:13:17] <wikibugs_>	 (03Merged) 10jenkins-bot: Fix 'the the' typo in vendor/perftools/xhgui-collector/external/header.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452051 (https://phabricator.wikimedia.org/T201491) (owner: 10Zoranzoki21)
[11:13:35] <zeljkof>	 Zoranzoki21: both patches merged, no rebase was needed
[11:13:46] <Zoranzoki21>	 zeljkof: Excellent than
[11:14:11] <zeljkof>	 well, except the rebase that gerrit does automatically
[11:14:50] <moritzm>	 !log repooled mw1233 (debugging completed)
[11:14:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:14:59] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[11:17:20] <logmsgbot>	 !log zfilipin@deploy1001 Synchronized wmf-config/CirrusSearch-common.php: SWAT: [[gerrit:452050|Fix the the typo in wmf-config/CirrusSearch-common.php (T201491)]] (duration: 00m 51s)
[11:17:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:17:27] <stashbot>	 T201491: Fix common typos in code - https://phabricator.wikimedia.org/T201491
[11:19:16] <logmsgbot>	 !log zfilipin@deploy1001 Synchronized vendor/perftools/xhgui-collector/external/header.php: SWAT: [[gerrit:452051|Fix "the the" typo in vendor/perftools/xhgui-collector/external/header.php (T201491)]] (duration: 00m 49s)
[11:19:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:19:53] <moritzm>	 !log upgrading wikidiff to 1.7.2 on mw1299-mw1306 (HHVM bytecode cache is pruned during update)
[11:19:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:20:14] <zeljkof>	 Zoranzoki21, Zoranzoki21_: both patches deployed, thanks for deploying with #releng! :D
[11:20:26] <Zoranzoki21_>	 Your welcome
[11:20:50] <zeljkof>	 !log EU SWAT finished
[11:20:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:25:01] <wikibugs_>	 (03CR) 10jenkins-bot: Fix 'the the' typo in wmf-config/CirrusSearch-common.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452050 (https://phabricator.wikimedia.org/T201491) (owner: 10Zoranzoki21)
[11:25:04] <wikibugs_>	 (03CR) 10jenkins-bot: Fix 'the the' typo in vendor/perftools/xhgui-collector/external/header.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452051 (https://phabricator.wikimedia.org/T201491) (owner: 10Zoranzoki21)
[11:35:29] <icinga-wm>	 PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 21 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[11:37:26] <DanielK_WMDE>	 what's the procedure for setting up a new ssh key for production shell access? I forgot my passphrase, since I never use it >_<
[11:37:29] <DanielK_WMDE>	 don't judge ;)
[11:38:19] <moritzm>	 DanielK_WMDE: please create a Phab task and tag is SRE-Access-Requests, then it'll be processed by clinic duty
[11:39:21] <DanielK_WMDE>	 moritzm: can do that, but where does the new private key go?
[11:39:26] <DanielK_WMDE>	 err, public :)
[11:39:31] <DanielK_WMDE>	 the private key doesn't go anywhere :)
[11:40:29] <icinga-wm>	 RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 16 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[11:40:41] <TimStarling>	 !log restarted populateContentTables.php on s2 (T183488)
[11:40:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:40:47] <stashbot>	 T183488: MCR schema migration stage 2: populate new fields - https://phabricator.wikimedia.org/T183488
[11:40:55] <moritzm>	 DanielK_WMDE: simply paste it in the Phab task, the person on clinic duty will take care of merging/deploying it via puppet
[11:40:59] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[11:41:18] <DanielK_WMDE>	 moritzm: ok
[11:43:20] <wikibugs_>	 10Operations, 10SRE-Access-Requests: new ssh key for daniel - https://phabricator.wikimedia.org/T201913 (10daniel)
[11:44:59] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[11:52:09] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[11:58:50] <moritzm>	 !log upgrading wikidiff to 1.7.2 on mw1276-mw1283 (HHVM bytecode cache is pruned during update)
[11:58:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:03:59] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[12:09:36] <wikibugs_>	 (03PS6) 10Jcrespo: db backup statistics: Initial implementation of the backup stats [puppet] - 10https://gerrit.wikimedia.org/r/449681 (https://phabricator.wikimedia.org/T198987)
[12:10:22] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] db backup statistics: Initial implementation of the backup stats [puppet] - 10https://gerrit.wikimedia.org/r/449681 (https://phabricator.wikimedia.org/T198987) (owner: 10Jcrespo)
[12:12:24] <wikibugs_>	 (03CR) 10Jcrespo: "I have implemented the quick fixes. The with, logger and dict changes require more work (specially logger on the whole file), and will do " (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/449681 (https://phabricator.wikimedia.org/T198987) (owner: 10Jcrespo)
[12:14:16] <wikibugs_>	 10Operations, 10TCB-Team, 10wikidiff2, 10WMDE-QWERTY-Sprint-2018-07-17, and 2 others: Update wikidiff2 library on the WMF production cluster to v1.7.2 - https://phabricator.wikimedia.org/T199801 (10WMDE-Fisch)
[12:15:02] <wikibugs_>	 (03PS7) 10Jcrespo: db backup statistics: Initial implementation of the backup stats [puppet] - 10https://gerrit.wikimedia.org/r/449681 (https://phabricator.wikimedia.org/T198987)
[12:17:19] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[12:18:12] <wikibugs_>	 10Operations, 10ops-eqiad, 10Discovery, 10Discovery-Search, 10Elasticsearch: check elastic1022 power supply redundancy - https://phabricator.wikimedia.org/T177631 (10Gehel) 05Resolved>03Open Re-opening as Icinga is still alerting on this. I can confirm that `ipmi-sensors --output-sensor-state --ignor...
[12:18:26] <wikibugs_>	 10Operations, 10ops-eqiad, 10Discovery, 10Discovery-Search, 10Elasticsearch: check elastic1022 power supply redundancy - https://phabricator.wikimedia.org/T177631 (10Gehel) p:05Triage>03High
[12:24:29] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[12:31:33] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - elasticsearch http://10.2.2.30:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.2.2.30, port=9200): Read timed out. (read timeout=4)
[12:31:56] <paravoid>	 uh?
[12:31:58] <_joe_>	 uh
[12:32:01] <gehel>	 ^ master re-election slower than expected, should be back up in a second
[12:32:03] <_joe_>	 that looks bad
[12:32:10] <_joe_>	 gehel: is that serving traffic?
[12:32:14] <paravoid>	 were you doing something gehel?
[12:32:18] <gehel>	 should be, checking
[12:32:27] <gehel>	 yep, reimaging of elastic nodes
[12:32:33] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on search.svc.eqiad.wmnet is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 32, unassigned_shards: 141, number_of_pending_tasks: 589, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3139, task_max_waiting_in_queue_millis: 65118, cluster_name: production-search-eqiad, relocating_shards: 61, active_shards_percent_as_nu
[12:32:33] <icinga-wm>	 , active_shards: 9276, initializing_shards: 10, number_of_data_nodes: 32, delayed_unassigned_shards: 141
[12:32:41] <paravoid>	 please !log :)
[12:32:49] <moritzm>	 search wfm
[12:32:51] <gehel>	 ok, let's make that check not paging...
[12:33:22] <gehel>	 paravoid: there should be a log from the reimage script
[12:34:24] <wikibugs_>	 (03PS2) 10Gehel: elasticsearch: shards check should not page. [puppet] - 10https://gerrit.wikimedia.org/r/451583
[12:34:30] <paravoid>	 gehel: I only see one from yesterday
[12:35:06] <gehel>	 yeah, scrolling back, I can't see it either
[12:35:31] <wikibugs_>	 (03CR) 10Gehel: [C: 032] elasticsearch: shards check should not page. [puppet] - 10https://gerrit.wikimedia.org/r/451583 (owner: 10Gehel)
[12:36:48] <icinga-wm>	 PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [600.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&panelId=9&fullscreen
[12:37:59] <icinga-wm>	 PROBLEM - tilerator on maps1004 is CRITICAL: connect to address 10.64.48.154 and port 6534: Connection refused
[12:39:33] <gehel>	 tilerator issue seems transient, a npm worker was killed and automatically restarted
[12:39:39] <wikibugs_>	 (03PS1) 10Giuseppe Lavagetto: [WIP] PHP: create module for modern Debian-based distributions [puppet] - 10https://gerrit.wikimedia.org/r/452664 (https://phabricator.wikimedia.org/T201140)
[12:40:22] <gehel>	 cirrus failures should be going down in a minute, the trend on the graph are not amazingly clear though
[12:41:11] <Nikerabbit>	 https://horizon.wikimedia.org/ is not loading for me, btw
[12:42:04] <_joe_>	 wfm
[12:42:59] <icinga-wm>	 RECOVERY - Number of backend failures per minute from CirrusSearch on graphite1001 is OK: OK: Less than 20.00% above the threshold [300.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&panelId=9&fullscreen
[12:43:18] <icinga-wm>	 PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 43 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[12:43:35] <volans>	 Nikerabbit: wfm too, I also logged in
[12:43:39] <Nikerabbit>	 works in incognito... something messed up with session state I suppose
[12:44:04] <volans>	 yeah, try clearing the cookies and session
[12:44:38] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[12:45:43] <Nikerabbit>	 curiously, https://horizon.wikimedia.org redirectors to http://horizon.wikimedia.org/project/ (not https!), that redirects to same url in https that never loads
[12:46:30] <wikibugs_>	 10Operations: onboarding Effie Mouzeli - https://phabricator.wikimedia.org/T201855 (10Joe)
[12:46:32] <Nikerabbit>	 since it doesn't load at all, I don't have easy access to delete cookies for that domain... annoying browsers
[12:46:32] <wikibugs_>	 10Operations: Onboarding Effie Mouzeli - https://phabricator.wikimedia.org/T201816 (10Joe)
[12:47:15] <volans>	 Nikerabbit: try https://horizon.wikimedia.org/auth/
[12:47:22] <wikibugs_>	 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 4 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 (10debt) awesome, thanks for the updates, @Pchelolo and @phuedx :)
[12:47:30] <volans>	 it gives you an error, but should load and allow you to clear them ;)
[12:48:03] <Nikerabbit>	 nope, same behavior :(
[12:48:17] <volans>	 that's weird, I get a 404 The page you were looking for doesn't exist
[12:48:43] <volans>	 then try the https://horizon.wikimedia.org/auth/logout/ logout page
[12:49:07] <moritzm>	 !log upgrading wikidiff to 1.7.2 on snapshot hosts
[12:49:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:49:31] <Nikerabbit>	 volans: I found a way to delete cookies for a specific site from Chrome's settings, but thanks for help anyway
[12:49:39] <volans>	 ack
[12:49:44] <volans>	 no problem :)
[12:52:10] <icinga-wm>	 PROBLEM - tilerator on maps1003 is CRITICAL: connect to address 10.64.32.117 and port 6534: Connection refused
[12:52:49] <icinga-wm>	 PROBLEM - tilerator on maps1001 is CRITICAL: connect to address 10.64.0.79 and port 6534: Connection refused
[12:53:00] <icinga-wm>	 PROBLEM - tilerator on maps1002 is CRITICAL: connect to address 10.64.16.42 and port 6534: Connection refused
[12:53:09] <wikibugs_>	 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1042.eqiad.wmnet', 'elastic1040.eqiad.wmnet', 'elastic1041.eqiad.wmnet'] ```  an...
[12:53:19] <icinga-wm>	 RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 17 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[12:53:36] <gehel>	 tilerator is suspicious, looking
[12:54:32] <Trey314159>	 !log reindexing Indonesian wikis on elastic@eqiad and elastic@codfw complete (T200204)
[12:54:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:54:39] <stashbot>	 T200204: Re-index Malay and Indonesian Wikis to use new unpacked analysis chain - https://phabricator.wikimedia.org/T200204
[12:57:04] <gehel>	 !log restarting tilerator on maps eqiad
[12:57:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:57:19] <icinga-wm>	 RECOVERY - tilerator on maps1003 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.035 second response time
[12:57:40] <icinga-wm>	 RECOVERY - tilerator on maps1004 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.030 second response time
[12:57:50] <icinga-wm>	 RECOVERY - tilerator on maps1001 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.007 second response time
[12:58:09] <icinga-wm>	 RECOVERY - tilerator on maps1002 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.031 second response time
[12:59:24] <wikibugs_>	 (03PS8) 10Jcrespo: db backup statistics: Initial implementation of the backup stats [puppet] - 10https://gerrit.wikimedia.org/r/449681 (https://phabricator.wikimedia.org/T198987)
[13:02:04] <wikibugs_>	 (03CR) 10Jcrespo: "This is still untested, but can give you an idea of the suggestions of the review being implemented (even if I am not 100% some are actual" [puppet] - 10https://gerrit.wikimedia.org/r/449681 (https://phabricator.wikimedia.org/T198987) (owner: 10Jcrespo)
[13:05:30] <wikibugs_>	 (03CR) 10Jcrespo: "It would be nice to know your high level opinion on arch decisions- for example, I chose to connect to mysql directly and not implement an" [puppet] - 10https://gerrit.wikimedia.org/r/449681 (https://phabricator.wikimedia.org/T198987) (owner: 10Jcrespo)
[13:11:14] <moritzm>	 !log upgrading wikidiff to 1.7.2 on mw1308-mw1311/mw1293-mw1296 (HHVM bytecode cache is pruned during update)
[13:11:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:11:41] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[13:13:50] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[13:15:46] <wikibugs_>	 10Operations, 10Wikimedia-Mailing-lists: wikimedia-us-mn administration password reset - https://phabricator.wikimedia.org/T201920 (10MarkTraceur)
[13:16:41] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[13:21:49] <gehel>	 !log restarting elasticsearch on elastic1043 (overloaded)
[13:21:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:23:17] <wikibugs_>	 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on neodymium.eqiad.wmnet for hosts: ``` cp5005.eqsin.wmnet ``` The log can be found in `/var/log/wmf-auto-reimage/201808...
[13:23:50] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[13:26:51] <wikibugs_>	 (03PS1) 10Gehel: elasticsearch: storage device is md1 after reimage to stretch [puppet] - 10https://gerrit.wikimedia.org/r/452669
[13:28:51] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[13:29:20] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[13:30:30] <wikibugs_>	 (03CR) 10Ottomata: Remove geowiki cron jobs and make puppet delete related files/dirs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/450040 (https://phabricator.wikimedia.org/T190059) (owner: 10Fdans)
[13:35:53] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[13:37:33] <icinga-wm>	 PROBLEM - Disk space on elastic1030 is CRITICAL: DISK CRITICAL - free space: /srv 52146 MB (10% inode=99%)
[13:40:33] <icinga-wm>	 RECOVERY - Disk space on elastic1030 is OK: DISK OK
[13:45:42] <wikibugs_>	 (03PS14) 10Giuseppe Lavagetto: webperf: Split Redis from the rest of the arclamp profile [puppet] - 10https://gerrit.wikimedia.org/r/444331 (https://phabricator.wikimedia.org/T195312) (owner: 10Krinkle)
[13:45:52] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[13:46:27] <_joe_>	 Krinkle: puppet patches don't get merged by gerrit
[13:46:31] <_joe_>	 I need to merge them myself
[13:46:50] <Krinkle>	 right
[13:46:50] <_joe_>	 that's why I gave +2 previously but didn't merge it
[13:47:08] <gehel>	 !log restarting elasticsearch on elastic1051 (overloaded)
[13:47:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:48:00] <_joe_>	 Krinkle: I am running puppet manually on mwlog1001
[13:48:32] <_joe_>	 noop as expected
[13:49:10] <Krinkle>	 cool.
[13:50:06] <wikibugs_>	 (03PS12) 10Giuseppe Lavagetto: webperf: Add arclamp profile to webperf::profiling_tools role [puppet] - 10https://gerrit.wikimedia.org/r/445066 (https://phabricator.wikimedia.org/T195312) (owner: 10Krinkle)
[13:50:15] <_joe_>	 ok the next one is the first that should do something
[13:50:40] <wikibugs_>	 (03CR) 10Giuseppe Lavagetto: [C: 032] webperf: Add arclamp profile to webperf::profiling_tools role [puppet] - 10https://gerrit.wikimedia.org/r/445066 (https://phabricator.wikimedia.org/T195312) (owner: 10Krinkle)
[13:51:57] <Krinkle>	 Yeah, I'm logged-in on webperf2002/1002 and expect user[xenon] and the xenon-log service to start showing up there
[13:52:28] <Krinkle>	 Which just reminded me, I do not know whether or not reading Redis from mwlog1001 will work just as-is or whether that needs a firewall rule. Completely forgot about that.
[13:52:53] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[13:52:58] <_joe_>	 Notice: /Stage[main]/Httpd/Service[apache2]: Triggered 'refresh' from 4 events
[13:53:09] <_joe_>	 Krinkle: let's see
[13:53:33] <_joe_>	 it works
[13:53:53] <wikibugs_>	 10Operations, 10TCB-Team, 10wikidiff2, 10WMDE-QWERTY-Sprint-2018-07-31, 10WMDE-QWERTY-Sprint-2018-08-14: Update wikidiff2 library on the WMF production cluster to v1.7.2 - https://phabricator.wikimedia.org/T199801 (10WMDE-Fisch)
[13:53:55] <_joe_>	 as in I can connect to redis on mwlog1001
[13:54:07] <Krinkle>	 from a webperf ?
[13:54:08] <Krinkle>	 cool
[13:54:13] <_joe_>	 yes
[13:54:22] <_joe_>	 I think I checked when I reviewed the patch
[13:54:30] <_joe_>	 but you know, reality can be tricky
[13:54:42] <_joe_>	 so I guess you want to verify something
[13:55:13] <_joe_>	 uhm I see a problem
[13:55:25] <_joe_>	 xenon.conf has no ServerName directive
[13:55:34] <_joe_>	 which means it will never serve requests
[13:55:52] <_joe_>	 oh it's the only vhost though
[13:55:53] <_joe_>	 ok
[13:56:14] <Krinkle>	 Yeah, it's an old issue. I've got a todo to fix that.
[13:56:47] <wikibugs_>	 (03PS1) 10Mforns: Add salt file path to EventLoggingSanitization cron job [puppet] - 10https://gerrit.wikimedia.org/r/452674 (https://phabricator.wikimedia.org/T199902)
[13:56:47] <_joe_>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/451107/ is beta only
[13:56:51] <Krinkle>	 I realised it when things just "worked" in Beta Cluster when routing performance-beta.wmflabs.org to a host serving performance.wikimedia.org, and it worked because it didn't identify as that.
[13:56:54] <_joe_>	 I assume it's safe to merge?
[13:56:57] <Krinkle>	 Yeah, already picked there.
[13:57:06] <wikibugs_>	 (03CR) 10Giuseppe Lavagetto: [C: 032] webperf: Switch arclamp_host in Beta from mwlog host to webperf12 [puppet] - 10https://gerrit.wikimedia.org/r/451107 (https://phabricator.wikimedia.org/T195312) (owner: 10Krinkle)
[13:57:08] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp1087 is CRITICAL: 4 gt 3 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1087&var-datasource=eqiad+prometheus/ops
[13:57:17] <wikibugs_>	 (03PS6) 10Giuseppe Lavagetto: webperf: Switch arclamp_host in Beta from mwlog host to webperf12 [puppet] - 10https://gerrit.wikimedia.org/r/451107 (https://phabricator.wikimedia.org/T195312) (owner: 10Krinkle)
[13:57:38] <Krinkle>	 The one after that for prod should impact data served from https://performance.wikimedia.org/xenon/svgs/daily/ to be from webperfX002 instead of mwlog1001
[13:57:51] <Krinkle>	 Which at first will be visible by there being almost no data (I'll backfil later)
[13:58:18] <_joe_>	 Krinkle: do you want to backfill first, switch afterwards?
[13:58:25] <_joe_>	 either choice is ok
[13:58:36] <Krinkle>	 No, that's alright. It's just for human use. Nothing depends on this programmatically.
[13:58:54] <Krinkle>	 I can see it's working on the new host, /srv/xenon is being populated already from Redis
[13:59:28] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[13:59:44] <_joe_>	 uh?
[13:59:54] <_joe_>	 can someone check what's up with cache_text?
[14:00:20] <ema>	 looking
[14:02:09] <ema>	 brief spike likely due to a backend child crash on cp1087
[14:02:18] <_joe_>	 thanks for looking <3
[14:02:28] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[14:02:58] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[14:03:02] <ema>	 interestingly only two hosts in eqiad have been affected by the crashes, it's gonna be a fun issue to debug
[14:05:33] <ema>	 !log restart varnish-be on cp108[79], fetch failures after child crashes
[14:05:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:06:04] <wikibugs_>	 (03CR) 10Giuseppe Lavagetto: [C: 032] webperf: Switch webperf::site to use arclamp from webperf-2 [puppet] - 10https://gerrit.wikimedia.org/r/452449 (https://phabricator.wikimedia.org/T195312) (owner: 10Krinkle)
[14:06:14] <wikibugs_>	 (03PS2) 10Giuseppe Lavagetto: webperf: Switch webperf::site to use arclamp from webperf-2 [puppet] - 10https://gerrit.wikimedia.org/r/452449 (https://phabricator.wikimedia.org/T195312) (owner: 10Krinkle)
[14:07:12] <_joe_>	 https://memegenerator.net/img/instances/65289046/waiting-for-jenkins-to-finish-build.jpg me right now
[14:07:29] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp1089 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1089&var-datasource=eqiad+prometheus/ops
[14:09:18] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp1087 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1087&var-datasource=eqiad+prometheus/ops
[14:09:59] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[14:10:18] <_joe_>	 Krinkle: uhm I ran puppet on webperf1001 but I see more files than i expected
[14:10:36] <_joe_>	 ah ok, varnish caches these
[14:10:39] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[14:10:43] <wikibugs_>	 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp5005.eqsin.wmnet'] ```  and were **ALL** successful.
[14:10:48] <_joe_>	 not sure it's what we wanted
[14:10:58] <_joe_>	 (varnish caching such files)
[14:11:14] <Krinkle>	 files appeared on webperf1001?
[14:12:15] <_joe_>	 no, on https://performance.wikimedia.org/xenon/logs/daily
[14:12:35] <_joe_>	 but if you bust the cache with any bogus query parameter, you can seelthe actual shortlist
[14:12:47] <Krinkle>	 right
[14:13:05] <Krinkle>	 Yeah, we may need to revise the caching of that.
[14:13:16] <Krinkle>	 It's also multi-dc.
[14:13:17] <_joe_>	 I would assume no caching is what we want
[14:13:51] <Krinkle>	 I've been refreshing that page every few seconds and I do see the timestamps an order change constantly until now
[14:13:58] <Krinkle>	 so I think it's already not caching
[14:14:05] <Krinkle>	 but maybe it was routing to codfw randomly as well?
[14:14:08] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[14:15:08] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[14:15:49] <_joe_>	 Krinkle: uhm
[14:15:49] <Krinkle>	 _joe_: does a typical misc/eqiad+codfw director route round-robin to both? I'd assume it uses eqiad for eqiad and codfw for codfw.
[14:15:58] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[14:15:58] <_joe_>	 misc is no more
[14:16:09] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[14:16:20] <_joe_>	 in theory, requests going to ulsfo/eqsin/codfw should go to codfw
[14:16:29] <_joe_>	 and requests going to eqiad and esams should go to eqiad
[14:16:34] <_joe_>	 for active/active things
[14:16:58] <Krinkle>	 right
[14:17:29] <_joe_>	   performance:                                                                                                                                                                                                     
[14:17:30] <Krinkle>	 but refreshing this url still gives me sometimes webperfX001->mwlog1001 (old) and sometimes webperfX001->webperfX002 (new)
[14:17:33] <_joe_>	     backends:                                                                                                                                                                                                      
[14:17:37] <_joe_>	       eqiad: 'webperf1001.eqiad.wmnet'                                                                                                                                                                             
[14:17:41] <_joe_>	       codfw: 'webperf2001.codfw.wmnet'   
[14:17:50] <_joe_>	 that's baffling, yeah
[14:18:04] <_joe_>	 can you look at the caching headers for both cases?
[14:18:12] <_joe_>	 X-cache should tell us what's going on
[14:18:30] <Krinkle>	 (old) x-cache: cp1089 pass, cp3041 hit/5, cp3033 pass; x-cache-status: hit-local ;;
[14:18:31] <Krinkle>	  (new)  x-cache: cp1089 pass, cp3033 hit/4, cp3033 pass; x-cache-status: hit-local
[14:18:50] <_joe_>	 so both cached, but with different values?
[14:18:52] <_joe_>	 wow
[14:18:57] <_joe_>	 ema, bblack any idea?
[14:18:58] <icinga-wm>	 PROBLEM - Varnish backend child restarted on cp1089 is CRITICAL: 4 gt 3 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1089&var-datasource=eqiad+prometheus/ops
[14:19:17] <_joe_>	 well it's cp1089, which is having some issues AFAICS
[14:19:32] <Krinkle>	 getting HTTP 200 and age:0 on both
[14:19:37] <wikibugs_>	 10Operations, 10Wikimedia-Mailing-lists: Mailing list for Wikimedians of Tamazight User Group - https://phabricator.wikimedia.org/T201929 (10Vikoula5)
[14:20:03] <_joe_>	 ok I got it
[14:20:14] <_joe_>	 we have two frontends with different versions of that page cached
[14:20:29] <_joe_>	 and that can happen ofc
[14:20:34] <Krinkle>	 Right
[14:20:42] <_joe_>	 now we should purge that url, or we wait
[14:20:45] <_joe_>	 I vote we wait
[14:20:49] <Krinkle>	 Yeah, no need to purge.
[14:20:58] <Krinkle>	 but I'm confused as to how/why it caches 
[14:21:07] <Krinkle>	 Should it not give a non-zero age in that case?
[14:21:46] <_joe_>	 let me see
[14:21:53] <logmsgbot>	 !log ema@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp1089.eqiad.wmnet,service=varnish-be
[14:21:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:22:57] <_joe_>	 the server sends no caching headers whatsoever
[14:23:11] <_joe_>	 I just tried the xenon/ directory
[14:23:34] <Krinkle>	 yeah, it's pretty much default static files over apache
[14:23:47] <Krinkle>	 I thought age would be computed in varnish though
[14:26:51] <logmsgbot>	 !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1102 with full weight (duration: 00m 52s)
[14:26:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:30:52] <wikibugs_>	 10Operations, 10netops, 10Patch-For-Review: Move servers off asw2-a-eqiad - https://phabricator.wikimedia.org/T201694 (10Cmjohnson) @ayounsi I added sfp-t's to asw2-a5-eqiad for the new server in that rack. For the remainder of the 10G servers in rack's 2/4/6 do you want me to run cross connects to asw2-a5?...
[14:32:32] <apergos>	 I'm off until Sunday evening. See folks then!
[14:33:29] <Krinkle>	 _joe_: Is it applied to codfw as well?
[14:33:38] <Krinkle>	 webperf2001.
[14:35:55] <Krinkle>	 !log Copying xenon/logs/daily/2018-*{all,load,index,api,RunSingleJob}.log from mwlog1001 to webperfX002 hosts
[14:35:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:36:19] <_joe_>	 Krinkle: maybe not
[14:36:26] <_joe_>	 let me see if puppet has run there
[14:37:05] * Krinkle makes a note to figure out how to add a "Server:" header to these so that it's easier to see where stuff came from
[14:38:07] <_joe_>	 Krinkle: now it's applied everywhere
[14:38:12] <Krinkle>	 perfect
[14:39:10] <Krinkle>	 So the cache_misc , it's gone completely? I got hte impression it was in progress because I found the performance_director in both text and misc.yaml
[14:40:01] <wikibugs_>	 (03PS4) 10Herron: WIP: logstash: add ids to filter configs [puppet] - 10https://gerrit.wikimedia.org/r/452461
[14:40:21] <_joe_>	 Krinkle: it's pending cleanup
[14:40:25] <logmsgbot>	 !log bstorm@deploy1001 Started deploy [striker/deploy@13da520]: (no justification provided)
[14:40:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:40:57] <Krinkle>	 OK. I'd love to figure out why there is no Age header on these, but it also seems relatively unimportant right now, so I'll get back to stuff now.
[14:40:58] <Krinkle>	 Thanks !
[14:41:38] <logmsgbot>	 !log bstorm@deploy1001 Finished deploy [striker/deploy@13da520]: (no justification provided) (duration: 01m 13s)
[14:41:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:41:52] <wikibugs_>	 (03PS1) 10Ema: cache_text: do not limit transient memory usage [puppet] - 10https://gerrit.wikimedia.org/r/452680
[14:42:27] <wikibugs_>	 (03PS2) 10Andrew Bogott: openstack glance: move active service for eqiad1 and main to cloudcontrol1003 [puppet] - 10https://gerrit.wikimedia.org/r/452595 (https://phabricator.wikimedia.org/T191791)
[14:42:29] <wikibugs_>	 (03PS2) 10Andrew Bogott: Openstack glance: remove glance service from labcontrol1001 [puppet] - 10https://gerrit.wikimedia.org/r/452596 (https://phabricator.wikimedia.org/T191791)
[14:42:31] <wikibugs_>	 (03PS1) 10Andrew Bogott: Designate: use $keystone_host for keystone rather than $nova_controller [puppet] - 10https://gerrit.wikimedia.org/r/452682
[14:42:44] <wikibugs_>	 (03CR) 10BBlack: [C: 031] cache_text: do not limit transient memory usage [puppet] - 10https://gerrit.wikimedia.org/r/452680 (owner: 10Ema)
[14:43:07] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[14:43:12] <wikibugs_>	 (03CR) 10Ema: [C: 032] cache_text: do not limit transient memory usage [puppet] - 10https://gerrit.wikimedia.org/r/452680 (owner: 10Ema)
[14:43:28] <icinga-wm>	 PROBLEM - puppet last run on webperf2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:44:41] <wikibugs_>	 (03PS2) 10Andrew Bogott: Designate: use $keystone_host for keystone rather than $nova_controller [puppet] - 10https://gerrit.wikimedia.org/r/452682
[14:45:44] <wikibugs_>	 (03CR) 10Andrew Bogott: [C: 032] Designate: use $keystone_host for keystone rather than $nova_controller [puppet] - 10https://gerrit.wikimedia.org/r/452682 (owner: 10Andrew Bogott)
[14:48:07] <icinga-wm>	 PROBLEM - puppet last run on cp1089 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:48:10] <wikibugs_>	 (03PS1) 10Ema: cache_text: set be_transient_gb: 0 [puppet] - 10https://gerrit.wikimedia.org/r/452684
[14:48:21] <ema>	 puppetfails on cache are my fault, fixing 
[14:48:37] <icinga-wm>	 RECOVERY - puppet last run on webperf2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[14:48:58] <icinga-wm>	 PROBLEM - puppet last run on cp5009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:49:16] <wikibugs_>	 (03CR) 10Ema: [C: 032] cache_text: set be_transient_gb: 0 [puppet] - 10https://gerrit.wikimedia.org/r/452684 (owner: 10Ema)
[14:50:42] <gehel>	 !log reimage of elastic103[678]
[14:50:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:51:00] <wikibugs_>	 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1036.eqiad.wmnet', 'elastic1037...
[14:53:07] <icinga-wm>	 RECOVERY - puppet last run on cp1089 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[14:53:08] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[14:53:09] <moritzm>	 !log rebooting cloudelastic* for kernel update
[14:53:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:53:58] <icinga-wm>	 RECOVERY - puppet last run on cp5009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[14:54:13] <wikibugs_>	 (03CR) 10Ayounsi: "Not sure how I can review this." [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/452656 (owner: 10Volans)
[14:54:37] <icinga-wm>	 PROBLEM - puppet last run on cp4030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:55:50] <sjoerddebruin>	 Elastic issues again?
[14:56:23] <wikibugs_>	 (03CR) 10Krinkle: [C: 031] mediawiki::web::prod_sites: enable HHVM on some sites(!!!) [puppet] - 10https://gerrit.wikimedia.org/r/452325 (owner: 10Giuseppe Lavagetto)
[14:57:08] <gehel>	 sjoerddebruin: reimaging in progress, I see a rise in response times, but if experience is any predictor it should be back to normal in < 1 minute
[14:57:13] <gehel>	 sjoerddebruin: or do you see something else
[14:57:48] <sjoerddebruin>	 On Wikidata, the suggester based on elastic sometimes takes quite some time or shows no results.
[14:58:25] <gehel>	 sjoerddebruin: on newly created entries? Or in general?
[14:58:32] <sjoerddebruin>	 In general.
[14:58:39] <wikibugs_>	 (03PS1) 10Jijiki: admin: added user jiji to the list of users Bug: T201816 [puppet] - 10https://gerrit.wikimedia.org/r/452685 (https://phabricator.wikimedia.org/T201816)
[14:58:41] <wikibugs_>	 (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [puppet] - 10https://gerrit.wikimedia.org/r/452685 (https://phabricator.wikimedia.org/T201816) (owner: 10Jijiki)
[14:58:54] <gehel>	 sjoerddebruin: and do you have a timeline for that issue? Is it just in the last 2 or 3 minutes? Or has it been going for logner?
[14:59:02] <gehel>	 s/logner/longer/
[14:59:10] <sjoerddebruin>	 20 minutes I guess?
[14:59:15] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] admin: added user jiji to the list of users Bug: T201816 [puppet] - 10https://gerrit.wikimedia.org/r/452685 (https://phabricator.wikimedia.org/T201816) (owner: 10Jijiki)
[14:59:38] <icinga-wm>	 RECOVERY - puppet last run on cp4030 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[14:59:48] <sjoerddebruin>	 It's responding nicely now, just ups and downs.
[14:59:59] <ema>	 !log cache_text eqiad: restart varnish-be without transient storage caps 
[15:00:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:00:06] <gehel>	 sjoerddebruin: that's interesting... I see a peak, but fairly short one: https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?orgId=1&from=now-1h&to=now&panelId=50&fullscreen&refresh=1m
[15:00:24] <vgutierrez>	 jijiki: please meet our lovely commit message validator /o\
[15:00:46] <jijiki>	 lol tx :p
[15:01:24] <volans>	 second line should be empty ;)
[15:01:49] <moritzm>	 !log rebooting meitnerium/archiva.wikimedia.org for kernel security update
[15:01:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:02:00] <gehel>	 sjoerddebruin: please let me know if you see the issue again! We might have a hole in our monitoring
[15:02:25] <sjoerddebruin>	 I can see the spikes for yesterday as well (had the same thing then), and will do. :)
[15:04:37] <wikibugs_>	 10Operations, 10SRE-Access-Requests: Requesting access to restricted production access and analytics-privatedata-users for Patrick Earley - https://phabricator.wikimedia.org/T201667 (10Jalexander) >>! In T201667#4500192, @Dzahn wrote: > Hi @PEarleyWMF @Jalexander Could you please create a user on Wikitech/LDAP...
[15:08:21] <wikibugs_>	 (03PS2) 10Jijiki: admin: added user jiji to the list of users [puppet] - 10https://gerrit.wikimedia.org/r/452685 (https://phabricator.wikimedia.org/T201816)
[15:12:57] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[15:13:38] <moritzm>	 jijiki: you don't need to remove reviewers :-)  if Gerrit adds people as reviewers to a patch set, that happens because people are subscribed to patches matching a certain pattern
[15:14:08] <jijiki>	 I actually wanted to save you having one more patch in your list :p
[15:14:23] <ema>	 he likes patches
[15:14:26] <jijiki>	 I'll spam away then no worries :p
[15:14:38] <moritzm>	 it's entirely my own fault, I subscribed to that pattern :-)
[15:14:57] <icinga-wm>	 RECOVERY - Varnish backend child restarted on cp1089 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1089&var-datasource=eqiad+prometheus/ops
[15:15:08] <icinga-wm>	 PROBLEM - Host elastic1036 is DOWN: PING CRITICAL - Packet loss = 100%
[15:15:36] <gehel>	 ^downtime failed
[15:16:07] <icinga-wm>	 RECOVERY - Host elastic1036 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms
[15:16:42] <wikibugs_>	 (03CR) 10Ottomata: [C: 031] "Ok to merge?" [puppet] - 10https://gerrit.wikimedia.org/r/452674 (https://phabricator.wikimedia.org/T199902) (owner: 10Mforns)
[15:17:38] <wikibugs_>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "The uid for the user is wrong, please change it with the correct one you can find in the comments" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/452685 (https://phabricator.wikimedia.org/T201816) (owner: 10Jijiki)
[15:17:57] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[15:19:27] <wikibugs_>	 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1037.eqiad.wmnet', 'elastic1036.eqiad.wmnet', 'elastic1038.eqiad.wmnet'] ```  an...
[15:24:16] <wikibugs_>	 (03PS9) 10Jcrespo: db backup statistics: Initial implementation of the backup stats [puppet] - 10https://gerrit.wikimedia.org/r/449681 (https://phabricator.wikimedia.org/T198987)
[15:24:50] <wikibugs_>	 (03PS1) 10Volans: LDAP: allow to specify multiple search strings [software/debmonitor] - 10https://gerrit.wikimedia.org/r/452686
[15:24:58] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[15:25:33] <wikibugs_>	 (03CR) 10Jcrespo: "This should be working after the fixes." [puppet] - 10https://gerrit.wikimedia.org/r/449681 (https://phabricator.wikimedia.org/T198987) (owner: 10Jcrespo)
[15:25:35] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] LDAP: allow to specify multiple search strings [software/debmonitor] - 10https://gerrit.wikimedia.org/r/452686 (owner: 10Volans)
[15:26:06] <volans>	 yeah I know, CI is "broken" :D
[15:26:49] <wikibugs_>	 (03CR) 10Krinkle: [C: 031] "It would make some unused legacy rewrites available and also make https://usability.wikimedia.org/api/ work, which seems fine and actually" [puppet] - 10https://gerrit.wikimedia.org/r/452635 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto)
[15:27:18] <wikibugs_>	 (03CR) 10Volans: "Tests fail on CI for a series of reason, mainly python 3.4 only. Result of tests locally:" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/452686 (owner: 10Volans)
[15:29:46] <wikibugs_>	 10Operations, 10ops-codfw, 10cloud-services-team, 10decommission: Decommission labtestnet2001.codfw.wmnet - https://phabricator.wikimedia.org/T201440 (10Papaul)
[15:29:58] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[15:35:36] <wikibugs_>	 (03CR) 10Volans: [V: 032 C: 032] "ACK" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/452656 (owner: 10Volans)
[15:36:12] <wikibugs_>	 (03PS3) 10Jijiki: admin: added user jiji to the list of users [puppet] - 10https://gerrit.wikimedia.org/r/452685 (https://phabricator.wikimedia.org/T201816)
[15:37:00] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[15:38:24] <logmsgbot>	 !log volans@deploy1001 Started deploy [netbox/deploy@792d4d5]: Security upgrade of dependency
[15:38:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:38:30] <wikibugs_>	 (03PS1) 10Krinkle: apache: Remove unused apache::static_site type [puppet] - 10https://gerrit.wikimedia.org/r/452687
[15:39:28] <logmsgbot>	 !log volans@deploy1001 Finished deploy [netbox/deploy@792d4d5]: Security upgrade of dependency (duration: 01m 03s)
[15:39:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:40:09] <icinga-wm>	 RECOVERY - Check systemd state on notebook1003 is OK: OK - running: The system is fully operational
[15:41:28] <wikibugs_>	 (03CR) 10EBernhardson: [C: 031] elasticsearch: storage device is md1 after reimage to stretch [puppet] - 10https://gerrit.wikimedia.org/r/452669 (owner: 10Gehel)
[15:41:58] <wikibugs_>	 (03PS2) 10Gehel: elasticsearch: storage device is md1 after reimage to stretch [puppet] - 10https://gerrit.wikimedia.org/r/452669
[15:42:00] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[15:42:39] <wikibugs_>	 (03CR) 10Gehel: [C: 032] elasticsearch: storage device is md1 after reimage to stretch [puppet] - 10https://gerrit.wikimedia.org/r/452669 (owner: 10Gehel)
[15:42:47] <wikibugs_>	 (03CR) 10Jijiki: "done" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/452685 (https://phabricator.wikimedia.org/T201816) (owner: 10Jijiki)
[15:43:53] <wikibugs_>	 (03CR) 10Giuseppe Lavagetto: [C: 031] admin: added user jiji to the list of users [puppet] - 10https://gerrit.wikimedia.org/r/452685 (https://phabricator.wikimedia.org/T201816) (owner: 10Jijiki)
[15:45:10] <wikibugs_>	 (03PS7) 10Ottomata: Remove all geowiki puppetization except for the geowiki site [puppet] - 10https://gerrit.wikimedia.org/r/450040 (https://phabricator.wikimedia.org/T190059) (owner: 10Fdans)
[15:48:01] <wikibugs_>	 (03PS1) 10Krinkle: webperf: Add 'Server: <fqdn>' header to performance.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/452689 (https://phabricator.wikimedia.org/T158837)
[15:48:47] <wikibugs_>	 (03PS8) 10Ottomata: Remove all geowiki puppetization except for the geowiki site [puppet] - 10https://gerrit.wikimedia.org/r/450040 (https://phabricator.wikimedia.org/T190059) (owner: 10Fdans)
[15:48:53] <wikibugs_>	 (03CR) 10Ottomata: [V: 032 C: 032] Remove all geowiki puppetization except for the geowiki site [puppet] - 10https://gerrit.wikimedia.org/r/450040 (https://phabricator.wikimedia.org/T190059) (owner: 10Fdans)
[15:49:52] <wikibugs_>	 (03PS4) 10Dzahn: admin: added user jiji to the list of users [puppet] - 10https://gerrit.wikimedia.org/r/452685 (https://phabricator.wikimedia.org/T201816) (owner: 10Jijiki)
[15:51:07] <wikibugs_>	 (03CR) 10Dzahn: [C: 032] admin: added user jiji to the list of users [puppet] - 10https://gerrit.wikimedia.org/r/452685 (https://phabricator.wikimedia.org/T201816) (owner: 10Jijiki)
[15:51:50] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 21 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[15:51:50] <wikibugs_>	 10Operations, 10Performance-Team, 10monitoring, 10Patch-For-Review: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837 (10Krinkle)
[15:53:44] <wikibugs_>	 (03Abandoned) 10Fdans: Remove all geowiki references from puppet [puppet] - 10https://gerrit.wikimedia.org/r/450025 (https://phabricator.wikimedia.org/T190059) (owner: 10Fdans)
[15:55:35] <wikibugs_>	 10Operations, 10Performance-Team, 10monitoring, 10Patch-For-Review: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837 (10Krinkle)
[15:55:46] <wikibugs_>	 10Operations, 10Performance-Team, 10monitoring, 10Patch-For-Review: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837 (10Krinkle)
[15:57:53] <wikibugs_>	 10Operations, 10netops, 10Patch-For-Review: Move servers off asw2-a-eqiad - https://phabricator.wikimedia.org/T201694 (10Cmjohnson) @ayounsi I pre-cabled everything. The lvs cross connects only need to move racks to the new switch. We probably need to do those 1 at a time, because downtime may be close to 1m...
[15:58:46] <wikibugs_>	 (03CR) 10Filippo Giunchedi: "> Patch Set 8:" [puppet] - 10https://gerrit.wikimedia.org/r/449681 (https://phabricator.wikimedia.org/T198987) (owner: 10Jcrespo)
[15:59:46] <logmsgbot>	 !log volans@deploy1001 Started deploy [netbox/deploy@792d4d5]: Security upgrade of dependency
[15:59:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:00:04] <jouncebot>	 godog, moritzm, and _joe_: (Dis)respected human, time to deploy Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180814T1600). Please do the needful.
[16:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[16:01:09] <logmsgbot>	 !log volans@deploy1001 Finished deploy [netbox/deploy@792d4d5]: Security upgrade of dependency (duration: 01m 23s)
[16:01:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:01:20] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 25 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[16:01:52] <wikibugs_>	 (03CR) 10Filippo Giunchedi: "Thanks for implementing the fixes!" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/449681 (https://phabricator.wikimedia.org/T198987) (owner: 10Jcrespo)
[16:02:05] <godog>	 puppet swat at 9am in the morning is weird
[16:03:01] <wikibugs_>	 (03PS6) 10Vgutierrez: [WIP] Refactor certcentral.certificate_management() [software/certcentral] - 10https://gerrit.wikimedia.org/r/451867
[16:03:51] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] Refactor certcentral.certificate_management() [software/certcentral] - 10https://gerrit.wikimedia.org/r/451867 (owner: 10Vgutierrez)
[16:04:00] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1022 is OK: OK - running: The system is fully operational
[16:05:09] <vgutierrez>	 quit
[16:05:14] <vgutierrez>	 arg :)
[16:06:20] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[16:06:50] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 18 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[16:13:29] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[16:15:08] <wikibugs_>	 (03PS1) 10Volans: Rebuild wheels for Django security upgrade (2) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/452699
[16:16:25] <wikibugs_>	 (03Abandoned) 10Bstorm: nfs-exportd: correcting typo [puppet] - 10https://gerrit.wikimedia.org/r/452428 (owner: 10Bstorm)
[16:16:37] <wikibugs_>	 (03CR) 10Volans: [V: 032 C: 032] Rebuild wheels for Django security upgrade (2) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/452699 (owner: 10Volans)
[16:16:54] <wikibugs_>	 10Operations, 10Traffic, 10Patch-For-Review: Deploy initial ATS test clusters in core DCs - https://phabricator.wikimedia.org/T199720 (10Reedy) ``` 2018-08-14 16:01:08,650 [docker-pkg-build] INFO - Generated dockerfile for docker-registry.discovery.wmnet/releng/operations-puppet:0.3.3: FROM docker-registry.d...
[16:17:57] <logmsgbot>	 !log volans@deploy1001 Started deploy [netbox/deploy@e2fd41d]: Security upgrade of dependency
[16:18:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:18:28] <wikibugs_>	 10Operations, 10ops-codfw, 10cloud-services-team, 10decommission: Decommission labtestnet2001.codfw.wmnet - https://phabricator.wikimedia.org/T201440 (10Papaul) ```  show | compare     [edit interfaces interface-range vlan-cloud-hosts1-b-codfw] -    member ge-5/0/21; [edit interfaces interface-range cloud-...
[16:18:31] <logmsgbot>	 !log volans@deploy1001 Finished deploy [netbox/deploy@e2fd41d]: Security upgrade of dependency (duration: 00m 34s)
[16:18:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:19:02] <wikibugs_>	 10Operations, 10ops-codfw, 10cloud-services-team, 10decommission: Decommission labtestnet2001.codfw.wmnet - https://phabricator.wikimedia.org/T201440 (10Papaul)
[16:24:14] <wikibugs_>	 10Operations, 10Patch-For-Review: Onboarding Effie Mouzeli - https://phabricator.wikimedia.org/T201816 (10Dzahn)
[16:29:00] <wikibugs_>	 10Operations, 10Patch-For-Review: Onboarding Effie Mouzeli - https://phabricator.wikimedia.org/T201816 (10Dzahn)
[16:31:53] <wikibugs_>	 (03PS1) 10Papaul: DNS: Remove mgmt DNS for labtestnet2001 [dns] - 10https://gerrit.wikimedia.org/r/452706
[16:33:17] <wikibugs_>	 10Operations, 10Patch-For-Review: Onboarding Effie Mouzeli - https://phabricator.wikimedia.org/T201816 (10Dzahn)
[16:33:39] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[16:35:44] <wikibugs_>	 10Operations, 10hardware-requests: Request for swift ms-be expansion - https://phabricator.wikimedia.org/T201937 (10fgiunchedi)
[16:37:38] <wikibugs_>	 (03CR) 10Dzahn: [C: 032] DNS: Remove mgmt DNS for labtestnet2001 [dns] - 10https://gerrit.wikimedia.org/r/452706 (owner: 10Papaul)
[16:40:02] <wikibugs_>	 10Operations, 10ops-codfw, 10cloud-services-team, 10decommission, 10Patch-For-Review: Decommission labtestnet2001.codfw.wmnet - https://phabricator.wikimedia.org/T201440 (10Papaul)
[16:40:40] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[16:41:21] <wikibugs_>	 10Operations, 10hardware-requests: Request for swift ms-be refresh - https://phabricator.wikimedia.org/T201938 (10fgiunchedi)
[16:41:41] <wikibugs_>	 10Operations, 10ops-codfw, 10cloud-services-team, 10decommission, 10Patch-For-Review: Decommission labtestnet2001.codfw.wmnet - https://phabricator.wikimedia.org/T201440 (10Papaul) 05Open>03Resolved Complete
[16:42:56] <wikibugs_>	 10Operations, 10Analytics: rack/setup/install 2 new hadoop master/standby systems in eqiad - https://phabricator.wikimedia.org/T201939 (10RobH) p:05Triage>03Normal
[16:45:40] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[16:46:08] <wikibugs_>	 10Operations, 10Analytics: rack/setup/install 2 new hadoop master/standby systems in eqiad - https://phabricator.wikimedia.org/T201939 (10RobH) Assigned to @elukey for hostname feedback, but as they are on vacation perhaps someone else in #analytics would be able to provide feedback on hostname?
[16:46:20] <wikibugs_>	 10Operations, 10Analytics: rack/setup/install 2 new hadoop master/standby systems in eqiad - https://phabricator.wikimedia.org/T201939 (10RobH) a:05elukey>03Ottomata
[16:47:29] <wikibugs_>	 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): jessie support for QLogic FastLinQ 41112 Dual Port 10Gb SFP+ Adapter - https://phabricator.wikimedia.org/T201942 (10aborrero) p:05Triage>03Normal
[16:48:29] <wikibugs_>	 (03CR) 10Filippo Giunchedi: "Thanks for this! What about hieradata? Also please run PCC" [puppet] - 10https://gerrit.wikimedia.org/r/449763 (owner: 10Dzahn)
[16:50:01] <wikibugs_>	 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): jessie support for QLogic FastLinQ 41112 Dual Port 10Gb SFP+ Adapter - https://phabricator.wikimedia.org/T201942 (10RobH) a:05RobH>03None
[16:52:50] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[16:56:53] <wikibugs_>	 10Operations, 10Analytics: rack/setup/install 2 new hadoop master/standby systems in eqiad - https://phabricator.wikimedia.org/T201939 (10Ottomata) Hm, tough question!  I'd be ok with analytics-master1001 and analytics-master1002. Let's do it!
[16:57:14] <gehel>	 !log reimage of elastic103[345]
[16:57:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:57:29] <wikibugs_>	 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1033.eqiad.wmnet', 'elastic1034...
[16:57:42] <wikibugs_>	 10Operations, 10Traffic, 10Patch-For-Review: Deploy initial ATS test clusters in core DCs - https://phabricator.wikimedia.org/T199720 (10ema) Thanks @Reedy! The `luarocks` part fails with:  ``` Warning: Failed searching manifest: Failed extracting manifest file Installing https://raw.githubusercontent.com/ro...
[16:58:34] <wikibugs_>	 10Operations, 10Analytics: rack/setup/install analytics-master100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T201939 (10RobH) a:05Ottomata>03Cmjohnson
[16:59:35] <wikibugs_>	 10Operations, 10Traffic, 10Patch-For-Review: Deploy initial ATS test clusters in core DCs - https://phabricator.wikimedia.org/T199720 (10Reedy) Yay, dependancies.  Feel free to bump the package again and add unzip and I can try again
[17:00:04] <jouncebot>	 cscott, arlolra, subbu, halfak, and Amir1: It is that lovely time of the day again! You are hereby commanded to deploy Services – Graphoid / Parsoid / Citoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180814T1700).
[17:00:27] <awight>	 Nothing for ORES today!
[17:00:50] <wikibugs_>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Karen Brown - https://phabricator.wikimedia.org/T201668 (10Jalexander) >>! In T201668#4500160, @Dzahn wrote: > Note that "kbrown" is a username already taken in LDAP...
[17:01:46] <wikibugs_>	 10Operations, 10Traffic, 10Patch-For-Review: Deploy initial ATS test clusters in core DCs - https://phabricator.wikimedia.org/T199720 (10ema) >>! In T199720#4502223, @Reedy wrote: > Yay, dependancies.  Yeah. Note that the version of `luarocks` in stretch does depend on `unzip`, it's the jessie version that d...
[17:02:59] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[17:04:37] <wikibugs_>	 (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/452686 (owner: 10Volans)
[17:05:28] <wikibugs_>	 10Operations, 10Traffic, 10Patch-For-Review: Deploy initial ATS test clusters in core DCs - https://phabricator.wikimedia.org/T199720 (10Reedy) Least `unzip` isn't a heavyweight dependancy :)
[17:05:50] <icinga-wm>	 PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [600.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&panelId=9&fullscreen
[17:12:10] <icinga-wm>	 RECOVERY - Number of backend failures per minute from CirrusSearch on graphite1001 is OK: OK: Less than 20.00% above the threshold [300.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&panelId=9&fullscreen
[17:17:22] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 22 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[17:19:08] <gehel>	 !log restarting elasticsearch on elastic1050 (high load)
[17:19:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:20:56] <wikibugs_>	 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): jessie support for QLogic FastLinQ 41112 Dual Port 10Gb SFP+ Adapter - https://phabricator.wikimedia.org/T201942 (10aborrero) a:03aborrero There is a Debian non-free package with firmware for QLogic NICs, hope we didn't buy hardware...
[17:21:41] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.74 seconds
[17:22:17] <wikibugs_>	 10Operations, 10ops-eqiad: Degraded RAID on labvirt1019 - https://phabricator.wikimedia.org/T196507 (10Cmjohnson) @Bstorm I have the new battery on-site...when is a good time for you to replace?
[17:22:22] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[17:22:52] <XioNoX>	 !log configuring eqiad A switch ports for T201694
[17:22:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:22:59] <stashbot>	 T201694: Move servers off asw2-a-eqiad - https://phabricator.wikimedia.org/T201694
[17:23:05] <wikibugs_>	 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1033.eqiad.wmnet'] ```  Of which those **FAILED**: ``` ['elastic1033.eqiad.wmnet...
[17:23:55] <wikibugs_>	 10Operations, 10ops-eqiad: Degraded RAID on labvirt1019 - https://phabricator.wikimedia.org/T196507 (10Bstorm) I can stop the VMs on labvirt1019 and 1020, silence alerts and shut them down whenever you like :) @Cmjohnson
[17:24:43] <cmjohnson1>	 bstorm_ let's go ahead and do the battery now 
[17:25:02] <bstorm_>	 Sure, I already started shutting off instances.  I'll downtime the  labvirts
[17:25:09] <cmjohnson1>	 it's just the one
[17:25:23] <cmjohnson1>	 1019.....I want to see the results of this before doing 1020
[17:26:30] <wikibugs_>	 (03CR) 10Ema: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/452612 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema)
[17:28:44] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] tox: add ts-lua tests for trafficserver [puppet] - 10https://gerrit.wikimedia.org/r/452612 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema)
[17:29:03] <ema>	 too early! :P
[17:30:57] <wikibugs_>	 (03CR) 10Ema: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/452612 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema)
[17:31:25] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] tox: add ts-lua tests for trafficserver [puppet] - 10https://gerrit.wikimedia.org/r/452612 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema)
[17:31:31] <wikibugs_>	 (03CR) 10Jcrespo: [C: 04-1] db backup statistics: Initial implementation of the backup stats [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/449469 (https://phabricator.wikimedia.org/T198987) (owner: 10Jcrespo)
[17:31:50] <wikibugs_>	 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service: WDQS diskspace is low - https://phabricator.wikimedia.org/T196485 (10Cmjohnson) I have the 4 ssds on-site.
[17:33:44] <wikibugs_>	 (03CR) 10Jcrespo: "> I don't like adding more code to puppet that could live outside" [puppet] - 10https://gerrit.wikimedia.org/r/449681 (https://phabricator.wikimedia.org/T198987) (owner: 10Jcrespo)
[17:34:18] <wikibugs_>	 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1033.eqiad.wmnet'] ``` The log...
[17:36:16] <icinga-wm>	 PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 20 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[17:36:48] <wikibugs_>	 10Operations, 10SRE-Access-Requests: Subscribe user mepps to security@wikimedia.org - https://phabricator.wikimedia.org/T201856 (10mark) @Dzahn please get her added to this list. Thanks!
[17:38:21] <wikibugs_>	 (03PS1) 10Ladsgroup: etherpad: Add article to the placeholder text [puppet] - 10https://gerrit.wikimedia.org/r/452716
[17:38:38] <wikibugs_>	 (03PS2) 10Gehel: Changing day of the cron for testing [puppet] - 10https://gerrit.wikimedia.org/r/452467 (https://phabricator.wikimedia.org/T194787) (owner: 10MSantos)
[17:38:45] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] etherpad: Add article to the placeholder text [puppet] - 10https://gerrit.wikimedia.org/r/452716 (owner: 10Ladsgroup)
[17:39:03] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] Changing day of the cron for testing [puppet] - 10https://gerrit.wikimedia.org/r/452467 (https://phabricator.wikimedia.org/T194787) (owner: 10MSantos)
[17:39:27] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[17:39:36] <wikibugs_>	 (03CR) 10Gehel: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/452467 (https://phabricator.wikimedia.org/T194787) (owner: 10MSantos)
[17:39:54] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] Changing day of the cron for testing [puppet] - 10https://gerrit.wikimedia.org/r/452467 (https://phabricator.wikimedia.org/T194787) (owner: 10MSantos)
[17:40:03] <wikibugs_>	 (03CR) 10Ladsgroup: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/452716 (owner: 10Ladsgroup)
[17:40:27] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] etherpad: Add article to the placeholder text [puppet] - 10https://gerrit.wikimedia.org/r/452716 (owner: 10Ladsgroup)
[17:41:16] <icinga-wm>	 RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 19 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[17:41:27] <wikibugs_>	 (03PS2) 10Ottomata: Add salt file path to EventLoggingSanitization cron job [puppet] - 10https://gerrit.wikimedia.org/r/452674 (https://phabricator.wikimedia.org/T199902) (owner: 10Mforns)
[17:41:31] <wikibugs_>	 (03CR) 10Ottomata: [V: 032 C: 032] Add salt file path to EventLoggingSanitization cron job [puppet] - 10https://gerrit.wikimedia.org/r/452674 (https://phabricator.wikimedia.org/T199902) (owner: 10Mforns)
[17:41:32] <gehel>	 404 on the docker registry for the jenkins puppet jobs, seems to be something I've heard before
[17:42:48] <Amir1>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/452716
[17:42:52] <ema>	 gehel: that happened over the weekend because of the cache_misc -> cache_text transition, but it was fixed with https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/452182/
[17:42:58] <Amir1>	 Is the master broken?
[17:43:09] <Amir1>	 why it can't find the docker image
[17:43:12] <gehel>	 ema: yeah, I was looking at that change. So something else this time
[17:43:17] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on db1095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 643.37 seconds
[17:43:29] <gehel>	 Amir1: I'm hitting the same issue
[17:44:20] <Amir1>	 :/
[17:44:27] <ema>	 gehel: and over the weekend we had 403s, not 404s
[17:44:27] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[17:44:29] <wikibugs_>	 (03PS1) 10Cmjohnson: Adding mgmt dns for newly racked servers [dns] - 10https://gerrit.wikimedia.org/r/452718 (https://phabricator.wikimedia.org/T201343)
[17:44:30] <wikibugs_>	 10Operations, 10SRE-Access-Requests: Subscribe user mepps to security@wikimedia.org - https://phabricator.wikimedia.org/T201856 (10Dzahn) 05Open>03Resolved a:03Dzahn Done.  @Mepps You have been added to security@wikimedia.org now.
[17:45:13] <gehel>	 ema: the logs say "Error response from daemon: manifest for docker-registry.wikimedia.org/releng/operations-puppet:0.3.4 not found", so I suspect a 404, but it might actually be something entirely different
[17:45:16] <wikibugs_>	 (03CR) 10Cmjohnson: [C: 032] Adding mgmt dns for newly racked servers [dns] - 10https://gerrit.wikimedia.org/r/452718 (https://phabricator.wikimedia.org/T201343) (owner: 10Cmjohnson)
[17:45:35] <Reedy>	 https://phabricator.wikimedia.org/T200722
[17:45:40] <wikibugs_>	 10Operations, 10SRE-Access-Requests: Request production global root access for Effie Mouzeli - https://phabricator.wikimedia.org/T201849 (10Dzahn) @jijiki created her own user with https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/452685/  The next step will be to upload a second change to add her user to...
[17:45:53] <wikibugs_>	 (03PS2) 10Cmjohnson: Adding mgmt dns for newly racked servers [dns] - 10https://gerrit.wikimedia.org/r/452718 (https://phabricator.wikimedia.org/T201343)
[17:45:59] <wikibugs_>	 (03CR) 10Cmjohnson: [V: 032 C: 032] Adding mgmt dns for newly racked servers [dns] - 10https://gerrit.wikimedia.org/r/452718 (https://phabricator.wikimedia.org/T201343) (owner: 10Cmjohnson)
[17:46:47] <gehel>	 sorry, need to take a family break, Amir1 I hope that issue will fix itself in the meantime :)
[17:47:41] <wikibugs_>	 10Operations, 10ops-eqiad, 10Analytics: rack/setup/install analytics-master100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T201939 (10Cmjohnson)
[17:47:58] <wikibugs_>	 10Operations, 10ops-eqiad, 10Analytics: rack/setup/install analytics-master100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T201939 (10Cmjohnson)
[17:48:04] <wikibugs_>	 (03CR) 10Awight: [C: 031] "Better grammar is more good!" [puppet] - 10https://gerrit.wikimedia.org/r/452716 (owner: 10Ladsgroup)
[17:49:09] <wikibugs_>	 10Operations, 10ops-eqiad, 10Analytics: rack/setup/install analytics-master100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T201939 (10Cmjohnson) @ottomata the name is entirely too long for labels and tracking. can we shorten it a bit?
[17:50:32] <wikibugs_>	 10Operations, 10SRE-Access-Requests: Request production global root access for Effie Mouzeli - https://phabricator.wikimedia.org/T201849 (10Dzahn) a:05Joe>03jijiki We have tested and confirmed access to bast1002 works.  Next ssh to `rutherfordium.eqiad.wmnet` (people.wikimedia.org)  can be used to test SSH...
[17:51:27] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[17:54:05] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s6 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 331.40 seconds
[17:54:19] <wikibugs_>	 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-General-or-Unknown, and 2 others: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10Imarlier) Yesterday we had just under 20,000 requests for the copyright prot...
[17:54:32] <wikibugs_>	 10Operations, 10Patch-For-Review: Onboarding Effie Mouzeli - https://phabricator.wikimedia.org/T201816 (10Dzahn) - added to https://phabricator.wikimedia.org/project/members/974/  and then https://phabricator.wikimedia.org/project/members/61/ for access to "WMF-NDA" Phabricator tickets  - subscribed to ops mai...
[17:54:44] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1010 is OK: OK - running: The system is fully operational
[17:55:48] <wikibugs_>	 (03PS1) 10Volans: Force django-filter==1.1.0 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/452723
[17:56:29] <wikibugs_>	 10Operations, 10netops, 10Patch-For-Review: Move servers off asw2-a-eqiad - https://phabricator.wikimedia.org/T201694 (10ayounsi)
[17:56:45] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[17:57:00] <wikibugs_>	 (03CR) 10Volans: [V: 032 C: 032] Force django-filter==1.1.0 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/452723 (owner: 10Volans)
[17:57:17] <wikibugs_>	 10Operations, 10netops, 10Patch-For-Review: Move servers off asw2-a-eqiad - https://phabricator.wikimedia.org/T201694 (10ayounsi) Task description updated with Chris's info so we have everything in 1 place. Switch ports configured accordingly.
[17:57:45] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[17:58:30] <wikibugs_>	 (03PS8) 10Ema: ATS: add Lua scripting support [puppet] - 10https://gerrit.wikimedia.org/r/451838 (https://phabricator.wikimedia.org/T199720)
[17:58:40] <wikibugs_>	 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1033.eqiad.wmnet'] ```  and were **ALL** successful.
[17:58:53] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] ATS: add Lua scripting support [puppet] - 10https://gerrit.wikimedia.org/r/451838 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema)
[17:59:29] <logmsgbot>	 !log volans@deploy1001 Started deploy [netbox/deploy@eae2c9d]: Fix broken dependency
[17:59:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:59:46] <Trey314159>	 !log reindexing Polish wikis on elastic@eqiad and elastic@codfw (T200037)
[17:59:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:59:57] <stashbot>	 T200037: Re-index Polish Wikis to patch Stempel stems - https://phabricator.wikimedia.org/T200037
[18:00:02] <logmsgbot>	 !log volans@deploy1001 Finished deploy [netbox/deploy@eae2c9d]: Fix broken dependency (duration: 00m 33s)
[18:00:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:00:17] <wikibugs_>	 (03CR) 10Reedy: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/452612 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema)
[18:00:49] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] tox: add ts-lua tests for trafficserver [puppet] - 10https://gerrit.wikimedia.org/r/452612 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema)
[18:01:42] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[18:02:50] <wikibugs_>	 (03CR) 10Reedy: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/452612 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema)
[18:03:12] <ema>	 that patch just does not want to please jenkins
[18:03:19] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] tox: add ts-lua tests for trafficserver [puppet] - 10https://gerrit.wikimedia.org/r/452612 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema)
[18:03:32] <wikibugs_>	 10Operations, 10ops-eqiad, 10Analytics: rack/setup/install analytics-master100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T201939 (10RobH) >>! In T201939#4502422, @Cmjohnson wrote: > @ottomata the name is entirely too long for labels and tracking. can we shorten it a bit?   This was discussed in IRC...
[18:03:56] <godog>	 jenkins' hard to please alright
[18:04:29] <mutante>	 18:03:16 docker: Error response from daemon: manifest for docker-registry.wikimedia.org/releng/operations-puppet:0.3.4 not found.
[18:05:44] <logmsgbot>	 !log mforns@deploy1001 Started deploy [analytics/refinery@cb57843]: corresponding to refinery-source v0.0.70
[18:05:44] <mutante>	 not related to the actual content of the change it seems?
[18:05:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:06:59] <ema>	 mutante: nope, related to my attempts to add `lua-busted` to puppet's CI though
[18:07:32] <XioNoX>	 !log update NTP servers on pfw
[18:07:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:08:34] <ema>	 mutante: see https://gerrit.wikimedia.org/r/#/c/integration/config/+/452634/ https://gerrit.wikimedia.org/r/#/c/integration/config/+/452714/ https://gerrit.wikimedia.org/r/#/c/integration/config/+/452692/
[18:09:29] <ema>	 Reedy did some magic, the image seemed to have been built correctly, then apparently https://phabricator.wikimedia.org/T200722
[18:10:20] <ema>	 I wasn't planning on changing the world BTW, I initially just wanted to add one package to the image :)
[18:10:42] <XioNoX>	 !log re-activate peer 13285 on cr2-ulsfo
[18:10:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:11:05] <mutante>	 ema: all i had was to say that the image referenced in https://gerrit.wikimedia.org/r/#/c/integration/config/+/452692/2/jjb/operations-puppet.yaml  has to pushed to the docker-registry.. and i was told Reedy can do it ...
[18:11:18] <mutante>	 but if he already did magic.. hmm
[18:11:39] <Reedy>	 [contint1001.wikimedia.org] out: adding_tag latest
[18:11:39] <Reedy>	 [contint1001.wikimedia.org] out: Call: docker-registry.discovery.wmnet/releng/operations-puppet:0.3.4 docker-registry.discovery.wmnet/releng/operations-puppet latest
[18:11:39] <Reedy>	 [contint1001.wikimedia.org] out: Successfully published image docker-registry.discovery.wmnet/releng/operations-puppet
[18:11:41] <Reedy>	 Something is odd
[18:12:56] <mutante>	 pushed to docker-registry.discovery.wmnet   but the image: references docker-registry.wikimedia.org    normal?
[18:13:45] <Reedy>	 Not sure
[18:13:56] <XioNoX>	 !log bump PCCW max accepted prefixes on cr2-esams
[18:14:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:14:50] <mutante>	 yea, that is both darmstadtium as backend, looks normal
[18:14:53] <icinga-wm>	 PROBLEM - Check systemd state on cp5005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[18:14:54] <Reedy>	 https://docker-registry.wikimedia.org/v2/releng/operations-puppet/tags/list
[18:14:58] <Reedy>	 {"name":"releng/operations-puppet","tags":["0.1.0","0.2.1","0.3.0","0.3.1","0.3.2","0.3.4","latest"]}
[18:15:15] <Reedy>	 https://docker-registry.wikimedia.org/v2/releng/operations-puppet/manifests/0.3.4
[18:15:31] <Reedy>	 Is it some caching?
[18:15:46] <Reedy>	 Because the slaves hit before it was there?
[18:15:49] <XioNoX>	 !log bump PCCW max accepted prefixes on cr2-eqiad
[18:15:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:16:59] <wikibugs_>	 (03CR) 10Reedy: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/452612 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema)
[18:17:27] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] tox: add ts-lua tests for trafficserver [puppet] - 10https://gerrit.wikimedia.org/r/452612 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema)
[18:17:59] <ema>	 Reedy: you're right
[18:18:57] <Reedy>	 mutante: Is there anything in the logs on the backend?
[18:19:08] <ema>	 Reedy: the 404 is indeed cached
[18:19:30] <Reedy>	 Aha
[18:19:38] <Reedy>	 That's kinda sill
[18:19:39] <Reedy>	 y
[18:19:52] <icinga-wm>	 PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb={GET,LIST,PATCH} https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[18:19:53] <icinga-wm>	 PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb=LIST https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[18:20:02] <icinga-wm>	 PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation={get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[18:20:12] <logmsgbot>	 !log mforns@deploy1001 Finished deploy [analytics/refinery@cb57843]: corresponding to refinery-source v0.0.70 (duration: 14m 27s)
[18:20:13] <icinga-wm>	 PROBLEM - etcd request latencies on argon is CRITICAL: instance=10.64.32.133:6443 operation=list https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[18:20:13] <icinga-wm>	 PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=PUT https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[18:20:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:20:33] <icinga-wm>	 PROBLEM - etcd request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 operation={get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[18:21:06] <Reedy>	 ema: The simple answer is maybe to just to rebuild it again... Make sure I wait fully for it all to finish, then do jjb again
[18:23:02] <icinga-wm>	 RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[18:23:03] <icinga-wm>	 RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[18:23:12] <ema>	 Reedy: but yeah I get a 404 from darmstadtium too, so something is still wrong with the registry itself
[18:23:12] <icinga-wm>	 RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[18:23:22] <icinga-wm>	 RECOVERY - etcd request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[18:23:22] <icinga-wm>	 RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[18:23:42] <icinga-wm>	 RECOVERY - etcd request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[18:27:23] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s7 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 35.06 seconds
[18:27:52] <XioNoX>	 !log renumber v4 IP of 8560 on cr1-eqord
[18:27:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:28:23] <Reedy>	 ema: Revert hte jjb change to make things work, and see if the registry sorts itself out later?
[18:29:16] <ema>	 Reedy: let's try
[18:32:40] <wikibugs_>	 (03CR) 10Reedy: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/452612 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema)
[18:33:10] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] tox: add ts-lua tests for trafficserver [puppet] - 10https://gerrit.wikimedia.org/r/452612 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema)
[18:33:14] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[18:33:34] <ema>	 Reedy: well that patch specifically is supposed to fail with 0.3.2 :)
[18:33:46] <Reedy>	 yeah, I haven't jjb'd yet
[18:35:27] <wikibugs_>	 (03CR) 10Reedy: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/452612 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema)
[18:36:17] <Reedy>	 18:35:54 + exec docker run --rm --env-file /dev/fd/63 --volume /srv/jenkins-workspace/workspace/operations-puppet-tests-docker/log:/srv/workspace/log docker-registry.wikimedia.org/releng/operations-puppet:0.3.2
[18:36:45] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] tox: add ts-lua tests for trafficserver [puppet] - 10https://gerrit.wikimedia.org/r/452612 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema)
[18:38:14] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[18:38:48] <Reedy>	 ema: Back to old broken at least
[18:41:55] * Krinkle staging on mwdebug1002/deploy1001
[18:42:03] <wikibugs_>	 (03CR) 10Ema: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/451838 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema)
[18:43:10] <ema>	 Reedy: yeah, confirmed :)
[18:43:23] <Reedy>	 Filed a bug for the published/not published thing
[18:44:42] <XioNoX>	 !log delete peer 4589 on cr2-esams (no more direct peering)
[18:44:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:45:15] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[18:47:49] <ema>	 Reedy: thanks!
[18:49:36] <XioNoX>	 Fyi, those IPv6 RIPE atlas alerts seem to be due to Hurricane Electric
[18:50:15] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[18:58:11] <logmsgbot>	 !log mforns@deploy1001 Started deploy [analytics/refinery@a4d1d99]: adding hashing to EL whitelist
[18:58:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:01:35] <wikibugs_>	 (03PS1) 10Milimetric: Add reference to Wikitech docs [puppet] - 10https://gerrit.wikimedia.org/r/452738 (https://phabricator.wikimedia.org/T201653)
[19:02:25] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[19:06:48] <logmsgbot>	 !log filippo@neodymium conftool action : set/pooled=no; selector: name=logstash1008,service=gelf
[19:06:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:08:21] <XioNoX>	 jynus: not sure if expected but db1095 is almost maxing out its interface: https://librenms.wikimedia.org/device/device=162/tab=port/port=14702/
[19:10:04] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 20 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[19:11:15] <logmsgbot>	 !log mforns@deploy1001 Finished deploy [analytics/refinery@a4d1d99]: adding hashing to EL whitelist (duration: 13m 04s)
[19:11:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:12:04] <icinga-wm>	 PROBLEM - etcd request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 operation={get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[19:12:24] <icinga-wm>	 PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb={LIST,PATCH} https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[19:12:34] <icinga-wm>	 PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb=LIST https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[19:12:35] <icinga-wm>	 PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation=list https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[19:12:45] <icinga-wm>	 PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=PUT https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[19:13:34] <icinga-wm>	 RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[19:13:45] <icinga-wm>	 RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[19:14:05] <icinga-wm>	 RECOVERY - etcd request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[19:14:44] <icinga-wm>	 RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[19:15:05] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 17 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[19:15:34] <icinga-wm>	 RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[19:22:25] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[19:33:38] * Krinkle staging on mwdebug1002/deploy1001
[19:34:04] <icinga-wm>	 PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 22 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[19:34:34] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 25 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[19:35:27] <logmsgbot>	 !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.16/includes/cache/MessageCache.php: I6093113a / T201893 (duration: 00m 52s)
[19:35:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:35:36] <stashbot>	 T201893: MessageCache access throw UnexpectedValueException "The value of 'en' is not an array." from MapCacheLRU - https://phabricator.wikimedia.org/T201893
[19:37:44] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s8 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 446.44 seconds
[19:39:04] <icinga-wm>	 RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 16 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[19:39:35] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[19:41:16] * Krinkle done with deployment
[19:41:59] <wikibugs_>	 10Operations, 10Operations-Software-Development: confctl: log to SAL even if the selection doesn't match any host - https://phabricator.wikimedia.org/T155705 (10fgiunchedi) Also there's nothing logged on stdout on non-existent host and conftool exits 0. Ditto for a non-existant service
[19:47:12] <wikibugs_>	 10Operations, 10ops-eqiad: Degraded RAID on labvirt1019 - https://phabricator.wikimedia.org/T201957 (10ops-monitoring-bot)
[19:49:40] <wikibugs_>	 (03PS1) 10Krinkle: icinga: Define 'notify-by-email-per-service' command [puppet] - 10https://gerrit.wikimedia.org/r/452744
[19:50:54] <wikibugs_>	 (03CR) 10Krinkle: "Per Filippo, the association between alert types/receives is not in this repository, but in the private repository where contacts are defi" [puppet] - 10https://gerrit.wikimedia.org/r/452744 (owner: 10Krinkle)
[19:51:39] <wikibugs_>	 10Operations, 10Wikimedia-Logstash, 10monitoring, 10Patch-For-Review, 10User-herron: Send logstash service metrics to prometheus - https://phabricator.wikimedia.org/T200362 (10fgiunchedi) >>! In T200362#4487498, @gerritbot wrote: > Change 451018 **merged** by Filippo Giunchedi: > [operations/puppet@produ...
[19:51:45] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[19:55:50] <wikibugs_>	 (03CR) 10Krinkle: [C: 04-1] "This caused a rebase conflict for Beta Cluster's puppetmaster about 10 hours ago. I've tried to resolve it, but please double check and ma" [puppet] - 10https://gerrit.wikimedia.org/r/437052 (owner: 10Alex Monk)
[19:56:54] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[20:02:57] <wikibugs_>	 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-General-or-Unknown, and 2 others: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10matmarex) For reference, according to this thread, Polish Wikipedia was affe...
[20:03:55] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 25 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[20:06:15] <icinga-wm>	 PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 20 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[20:08:54] <wikibugs_>	 (03PS1) 10Filippo Giunchedi: logstash: use /etc/default/logstash to add jmx_exporter [puppet] - 10https://gerrit.wikimedia.org/r/452747 (https://phabricator.wikimedia.org/T200362)
[20:08:55] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[20:09:34] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 20 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[20:14:34] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 19 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[20:16:04] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 25 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[20:17:41] <wikibugs_>	 (03CR) 10Alex Monk: "fixed, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/437052 (owner: 10Alex Monk)
[20:19:01] <Krenair>	 gehel, hi
[20:21:04] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on db1095 is OK: OK slave_sql_lag Replication lag: 0.27 seconds
[20:22:47] <Krenair>	 meh idle 02:33
[20:23:05] <wikibugs_>	 (03PS3) 10Andrew Bogott: openstack glance: move active service for eqiad1 and main to cloudcontrol1003 [puppet] - 10https://gerrit.wikimedia.org/r/452595 (https://phabricator.wikimedia.org/T191791)
[20:23:07] <wikibugs_>	 (03PS3) 10Andrew Bogott: Openstack glance: remove glance service from labcontrol1001 [puppet] - 10https://gerrit.wikimedia.org/r/452596 (https://phabricator.wikimedia.org/T191791)
[20:23:09] <wikibugs_>	 (03PS1) 10Andrew Bogott: mwopenstackclients: use region from ENV if present [puppet] - 10https://gerrit.wikimedia.org/r/452751
[20:25:24] <wikibugs_>	 (03CR) 10Andrew Bogott: [C: 032] mwopenstackclients: use region from ENV if present [puppet] - 10https://gerrit.wikimedia.org/r/452751 (owner: 10Andrew Bogott)
[20:26:05] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[20:26:24] <icinga-wm>	 RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 18 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[20:28:04] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s6 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 15.92 seconds
[20:31:50] <wikibugs_>	 (03PS1) 10Legoktm: php72: Install php7.2-mbstring [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/452755 (https://phabricator.wikimedia.org/T188318)
[20:33:15] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[20:34:18] <Krenair>	 Krinkle, please remove your CR-1
[20:35:37] <wikibugs_>	 (03PS4) 10Andrew Bogott: openstack glance: move active service for eqiad1 and main to cloudcontrol1003 [puppet] - 10https://gerrit.wikimedia.org/r/452595 (https://phabricator.wikimedia.org/T191791)
[20:35:39] <wikibugs_>	 (03PS4) 10Andrew Bogott: Openstack glance: remove glance service from labcontrol1001 [puppet] - 10https://gerrit.wikimedia.org/r/452596 (https://phabricator.wikimedia.org/T191791)
[20:35:42] <wikibugs_>	 (03PS1) 10Andrew Bogott: mwopenstackclient: glance client takes 'region_name' arg instead of 'region' [puppet] - 10https://gerrit.wikimedia.org/r/452793
[20:38:07] <wikibugs_>	 (03CR) 10Andrew Bogott: [C: 032] mwopenstackclient: glance client takes 'region_name' arg instead of 'region' [puppet] - 10https://gerrit.wikimedia.org/r/452793 (owner: 10Andrew Bogott)
[20:38:15] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[20:41:32] <Krinkle>	 Krenair: the patch in Gerrit is out of date though, that was the -1 reason.
[20:41:42] <Krinkle>	 The area of conflict adds 1 line in beta but 2 lines in gerrit.
[20:41:47] <Krenair>	 oh right
[20:41:49] <Krinkle>	 not sure which about it further, just noticed it
[20:44:42] <wikibugs_>	 (03PS16) 10Alex Monk: cumin: Allow Puppet DB backend to be used within Labs projects that use it [puppet] - 10https://gerrit.wikimedia.org/r/437052
[20:45:24] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 23 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[20:48:34] <icinga-wm>	 PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 22 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[20:53:35] <icinga-wm>	 RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 18 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[20:55:25] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[20:56:15] <icinga-wm>	 PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received
[20:56:37] <wikibugs_>	 10Operations, 10MediaWiki-extensions-Translate, 10Language-2018-July-September, 10MW-1.32-release-notes (WMF-deploy-2018-08-21 (1.32.0-wmf.18)), and 4 others: 503 error attempting to open multiple projects (Wikipedia and meta wiki are loading very slowly) - https://phabricator.wikimedia.org/T195293 (10Jdfor...
[20:57:59] <wikibugs_>	 (03CR) 10Legoktm: [C: 032] php72: Install php7.2-mbstring [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/452755 (https://phabricator.wikimedia.org/T188318) (owner: 10Legoktm)
[20:58:14] <wikibugs_>	 (03Merged) 10jenkins-bot: php72: Install php7.2-mbstring [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/452755 (https://phabricator.wikimedia.org/T188318) (owner: 10Legoktm)
[20:59:14] <icinga-wm>	 RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy
[21:00:44] <icinga-wm>	 PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 20 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[21:02:34] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 22 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[21:05:45] <icinga-wm>	 RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 19 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[21:06:14] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 22 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[21:06:59] <wikibugs_>	 (03CR) 10Krinkle: [C: 04-1] "At least for beta this was a no-op. Probably mod_security needs to be applied  higher up for it to work." [puppet] - 10https://gerrit.wikimedia.org/r/452689 (https://phabricator.wikimedia.org/T158837) (owner: 10Krinkle)
[21:10:47] <Hauskatze>	 good night
[21:11:14] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 17 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[21:15:15] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s8 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 55.25 seconds
[21:27:45] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[21:34:28] <wikibugs_>	 (03CR) 10Dzahn: [V: 031 C: 031] "thanks! confirmed this doesn't appear to be used anymore" [puppet] - 10https://gerrit.wikimedia.org/r/452687 (owner: 10Krinkle)
[21:34:54] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[21:36:55] <wikibugs_>	 (03PS1) 10BBlack: Revert "Update alexa image block to just 1500px URIs" [puppet] - 10https://gerrit.wikimedia.org/r/452835
[21:36:57] <wikibugs_>	 (03PS1) 10BBlack: Revert "block alexawikibot for now" [puppet] - 10https://gerrit.wikimedia.org/r/452836
[21:37:53] <wikibugs_>	 (03CR) 10BBlack: [C: 032] Revert "Update alexa image block to just 1500px URIs" [puppet] - 10https://gerrit.wikimedia.org/r/452835 (owner: 10BBlack)
[21:37:56] <wikibugs_>	 (03CR) 10BBlack: [C: 032] Revert "block alexawikibot for now" [puppet] - 10https://gerrit.wikimedia.org/r/452836 (owner: 10BBlack)
[21:42:55] <icinga-wm>	 PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received
[21:42:55] <icinga-wm>	 PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 25 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[21:43:15] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 20 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[21:43:32] <godog>	 XioNoX: anything we can do for those alerts? ^
[21:43:46] <wikibugs_>	 (03PS2) 10Filippo Giunchedi: logstash: use /etc/default/logstash to add jmx_exporter [puppet] - 10https://gerrit.wikimedia.org/r/452747 (https://phabricator.wikimedia.org/T200362)
[21:43:54] <icinga-wm>	 RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy
[21:45:08] <XioNoX>	 godog: downtime for a bit, I'll email HE's noc
[21:46:43] <godog>	 sounds good -- thanks!
[21:47:55] <icinga-wm>	 RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 18 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[21:48:24] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 16 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[21:51:29] <wikibugs_>	 (03CR) 10Filippo Giunchedi: [C: 032] "PCC https://puppet-compiler.wmflabs.org/compiler02/12090/logstash1007.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/452747 (https://phabricator.wikimedia.org/T200362) (owner: 10Filippo Giunchedi)
[21:51:42] <wikibugs_>	 (03PS3) 10Filippo Giunchedi: logstash: use /etc/default/logstash to add jmx_exporter [puppet] - 10https://gerrit.wikimedia.org/r/452747 (https://phabricator.wikimedia.org/T200362)
[21:54:55] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[21:57:48] <wikibugs_>	 (03PS1) 10Dzahn: admins: Revoke SSH key for Daniel Kinzler [puppet] - 10https://gerrit.wikimedia.org/r/452841 (https://phabricator.wikimedia.org/T201913)
[21:59:02] <wikibugs_>	 (03CR) 10Dzahn: [C: 032] admins: Revoke SSH key for Daniel Kinzler [puppet] - 10https://gerrit.wikimedia.org/r/452841 (https://phabricator.wikimedia.org/T201913) (owner: 10Dzahn)
[22:01:22] <wikibugs_>	 (03PS1) 10Jforrester: Disable wgLegacyJavaScriptGlobals on all group0 wikis, not just test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452843 (https://phabricator.wikimedia.org/T35837)
[22:01:36] <wikibugs_>	 (03PS1) 10Dzahn: admins: add new SSH key for Daniel Kinzler [puppet] - 10https://gerrit.wikimedia.org/r/452844 (https://phabricator.wikimedia.org/T201913)
[22:02:04] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[22:04:36] <icinga-wm>	 ACKNOWLEDGEMENT - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map daniel_zahn XioNoX is mailing Hurricane Electric
[22:04:39] <wikibugs_>	 (03PS4) 10Filippo Giunchedi: logstash: use /etc/default/logstash to add jmx_exporter [puppet] - 10https://gerrit.wikimedia.org/r/452747 (https://phabricator.wikimedia.org/T200362)
[22:07:04] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[22:07:39] <wikibugs_>	 (03CR) 10Dzahn: [C: 04-1] "needs a verification that it's not a compromised phab account. could be a GPG signature, a 10 second hangout, a selfie with the key or som" [puppet] - 10https://gerrit.wikimedia.org/r/452844 (https://phabricator.wikimedia.org/T201913) (owner: 10Dzahn)
[22:08:39] <wikibugs_>	 (03CR) 10Dzahn: [C: 04-1] "or identified IRC nick i guess. but you were offline" [puppet] - 10https://gerrit.wikimedia.org/r/452844 (https://phabricator.wikimedia.org/T201913) (owner: 10Dzahn)
[22:10:30] <wikibugs_>	 (03CR) 10Dzahn: [C: 04-1] "or maybe somebody in EU timezone can do that kind of thing with you and add a +1 or merge" [puppet] - 10https://gerrit.wikimedia.org/r/452844 (https://phabricator.wikimedia.org/T201913) (owner: 10Dzahn)
[22:11:35] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash1008 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused
[22:11:44] <icinga-wm>	 PROBLEM - logstash JSON linesTCP port on logstash1008 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused
[22:11:44] <icinga-wm>	 PROBLEM - Check systemd state on logstash1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[22:11:45] <icinga-wm>	 PROBLEM - logstash process on logstash1008 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 499 (logstash), command name java, args logstash
[22:11:56] <godog>	 that's me ^ fix incoing
[22:12:04] <icinga-wm>	 PROBLEM - logstash log4j TCP port on logstash1008 is CRITICAL: connect to address 127.0.0.1 and port 4560: Connection refused
[22:12:31] <mutante>	 thanks
[22:12:35] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash1008 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514
[22:12:45] <icinga-wm>	 RECOVERY - logstash JSON linesTCP port on logstash1008 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514
[22:12:45] <icinga-wm>	 RECOVERY - Check systemd state on logstash1008 is OK: OK - running: The system is fully operational
[22:12:45] <icinga-wm>	 RECOVERY - logstash process on logstash1008 is OK: PROCS OK: 1 process with UID = 499 (logstash), command name java, args logstash
[22:13:04] <icinga-wm>	 RECOVERY - logstash log4j TCP port on logstash1008 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 4560
[22:14:14] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[22:14:29] <wikibugs_>	 (03PS1) 10Filippo Giunchedi: logstash: fix /etc/default/logstash [puppet] - 10https://gerrit.wikimedia.org/r/452845 (https://phabricator.wikimedia.org/T200362)
[22:14:53] <wikibugs_>	 (03CR) 10Filippo Giunchedi: [C: 032] logstash: fix /etc/default/logstash [puppet] - 10https://gerrit.wikimedia.org/r/452845 (https://phabricator.wikimedia.org/T200362) (owner: 10Filippo Giunchedi)
[22:15:35] <wikibugs_>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Karen Brown - https://phabricator.wikimedia.org/T201668 (10Dzahn) This is a bit strange, i can't find a user "karen" nor any user with email kbrown_at_wikimedia.org...
[22:15:45] <XioNoX>	 godog: the issue is that I think our alerting threshold is at 20, and it keeps flapping between 18 and 20...
[22:15:52] <XioNoX>	 well, at 19
[22:17:05] <godog>	 indeed, bad coincidence and sad_trombone.mkv
[22:17:45] <mutante>	 when did the .wav become an .mkv
[22:18:06] <James_F>	 Accept it, it's the 2010s now. :-)
[22:18:12] <mutante>	 haha
[22:18:25] <godog>	 heheh things change
[22:18:33] <godog>	 next thing you know it'll be .mov
[22:19:14] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[22:20:10] <mutante>	 ewww. wmv
[22:20:36] <XioNoX>	 that's why we should probably downtime them for like 24h
[22:21:10] <mutante>	 ok, doing 
[22:24:05] <wikibugs_>	 (03CR) 10Krinkle: [C: 031] "Effectively, this disables the legacy globals on mediawiki.org, test.wikipedia.org and closed wikis. This should have relatively small imp" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452843 (https://phabricator.wikimedia.org/T35837) (owner: 10Jforrester)
[22:24:30] <mutante>	 downtimed for 24h. 3 checks. IPv6-only and eqiad/codfw/ulsfo.  eqsin not affected
[22:24:48] <mutante>	 and esams has no ripe-atlas
[22:25:23] <XioNoX>	 thx
[22:37:04] <mutante>	 it's weird that i can see how James created staff users on wikitech but i cant see a trace of them in LDAP from mwmaint1001
[22:37:20] <mutante>	 searched by email, *@wikimedia.org etc
[22:38:06] <mutante>	  almost as if the "create user for somebody else" method doesn't create an LDAP but a local user
[22:48:56] <wikibugs_>	 (03PS1) 10AndyRussG: CentralNotice: EventLogging data stream at a low level (0.01 sample rate) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452859
[22:50:02] <wikibugs_>	 (03PS2) 10AndyRussG: CentralNotice: EventLogging data at a low level (0.01 sample rate) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452859
[22:50:44] <wikibugs_>	 (03CR) 10Filippo Giunchedi: icinga: Define 'notify-by-email-per-service' command (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/452744 (owner: 10Krinkle)
[22:52:55] <wikibugs_>	 (03PS2) 10Krinkle: icinga: Define 'notify-by-email-per-service' command [puppet] - 10https://gerrit.wikimedia.org/r/452744
[22:53:07] <wikibugs_>	 (03PS3) 10Krinkle: icinga: Define 'notify-by-email-per-service' command [puppet] - 10https://gerrit.wikimedia.org/r/452744
[22:53:53] <wikibugs_>	 (03CR) 10Krinkle: "Changed in the other direction instead by removing $hostalias. For the purpose of Grafana alerts, this wasn't useful anyway, and I imagine" [puppet] - 10https://gerrit.wikimedia.org/r/452744 (owner: 10Krinkle)
[22:57:03] <wikibugs_>	 (03CR) 10Filippo Giunchedi: [C: 032] "Even nicer with just the service" [puppet] - 10https://gerrit.wikimedia.org/r/452744 (owner: 10Krinkle)
[22:57:11] <wikibugs_>	 (03PS4) 10Filippo Giunchedi: icinga: Define 'notify-by-email-per-service' command [puppet] - 10https://gerrit.wikimedia.org/r/452744 (owner: 10Krinkle)
[23:00:04] <jouncebot>	 addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor I � Unicode. All rise for Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180814T2300).
[23:00:04] <jouncebot>	 James_F and RoanKattouw: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[23:00:59] <James_F>	 Heya.
[23:01:03] <James_F>	 RoanKattouw: You SWATing?
[23:01:06] <wikibugs_>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Karen Brown - https://phabricator.wikimedia.org/T201668 (10Krenair) Looks like the account was created by a logged-in user instead of anonymously, no idea if that eve...
[23:01:08] <wikibugs_>	 (03PS3) 10Jforrester: Remove obsolete $wgPopupsBetaFeature, Part I: CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450906 (owner: 10Prtksxna)
[23:01:10] <wikibugs_>	 (03PS6) 10Jforrester: Remove obsolete $wgPopupsBetaFeature, Part III: InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444574 (owner: 10Prtksxna)
[23:01:12] <wikibugs_>	 (03PS1) 10Jforrester: Remove obsolete $wgPopupsBetaFeature, Part II: InitialiseSettings-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452863
[23:01:39] <RoanKattouw>	 Yes
[23:01:42] <RoanKattouw>	 Missed the ping somehow
[23:02:29] <AndyRussG>	 RoanKattouw: hi!
[23:02:34] <wikibugs_>	 (03CR) 10Catrope: [C: 032] Disable wgLegacyJavaScriptGlobals on all group0 wikis, not just test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452843 (https://phabricator.wikimedia.org/T35837) (owner: 10Jforrester)
[23:02:37] <RoanKattouw>	 Hey AndyRussG 
[23:02:41] <RoanKattouw>	 You'll go second after James_F 
[23:02:52] <AndyRussG>	 okok no rush :) thx much!
[23:03:26] <wikibugs_>	 (03CR) 10Ejegg: [C: 031] CentralNotice: EventLogging data at a low level (0.01 sample rate) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452859 (owner: 10AndyRussG)
[23:03:55] <wikibugs_>	 (03Merged) 10jenkins-bot: Disable wgLegacyJavaScriptGlobals on all group0 wikis, not just test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452843 (https://phabricator.wikimedia.org/T35837) (owner: 10Jforrester)
[23:04:36] <wikibugs_>	 10Operations, 10Wikimedia-Logstash, 10Goal, 10User-fgiunchedi, 10User-herron: Shorten logstash retention - https://phabricator.wikimedia.org/T201971 (10fgiunchedi) p:05Triage>03Normal
[23:05:46] <RoanKattouw>	 James_F: Your change is on mwdebug1002, please test
[23:06:08] <wikibugs_>	 (03PS3) 10Catrope: CentralNotice: EventLogging data at a low level (0.01 sample rate) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452859 (owner: 10AndyRussG)
[23:06:10] <wikibugs_>	 (03CR) 10Jforrester: "PS3: Split the commit into touching only one file, per the SWAT rule. LGTM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450906 (owner: 10Prtksxna)
[23:06:11] <James_F>	 Kk.
[23:06:16] <wikibugs_>	 (03CR) 10Catrope: [C: 032] CentralNotice: EventLogging data at a low level (0.01 sample rate) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452859 (owner: 10AndyRussG)
[23:07:37] <wikibugs_>	 (03Merged) 10jenkins-bot: CentralNotice: EventLogging data at a low level (0.01 sample rate) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452859 (owner: 10AndyRussG)
[23:11:34] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production).
[23:11:36] <logmsgbot>	 !log fdans@deploy1001 Started deploy [analytics/refinery@21e07ae]: Deploying revert to prevent partition dropping jobs from failing
[23:11:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:11:40] <wikibugs_>	 10Operations, 10Maps-Sprint, 10Maps (Tilerator): Increase frequency of OSM replication - https://phabricator.wikimedia.org/T137939 (10Mholloway) p:05Normal>03High
[23:13:50] <logmsgbot>	 !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Disable wgLegacyJavaScriptGlobals on all group0 wikis (T35837) (duration: 00m 54s)
[23:13:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:13:57] <stashbot>	 T35837: Set $wgLegacyJavaScriptGlobals = false by default - https://phabricator.wikimedia.org/T35837
[23:15:08] <RoanKattouw>	 AndyRussG: Your patch is on mwdebug1002, please test (to the extent feasible)
[23:15:08] <wikibugs_>	 10Operations, 10Wikimedia-Logstash, 10Goal, 10User-fgiunchedi, 10User-herron: Shorten logstash retention - https://phabricator.wikimedia.org/T201971 (10Bawolff) Could we maybe dump by channel type? api-feature-usage is by far the majority of logstash events, but is much less likely useful to reatain for...
[23:15:40] <RoanKattouw>	 Hmm the gate-and-submit-swat queue doesn't seem to be behaving in a prioritized manner exactly
[23:16:54] <icinga-wm>	 PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation={get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[23:16:59] <AndyRussG>	 RoanKattouw: by chance do you know how long it currently takes for the config change to bubble up to JS in this case?
[23:17:14] <AndyRussG>	 I guess the normal RL module rollover delay?
[23:18:18] <AndyRussG>	 Oh wait, wrong debug instance
[23:18:44] <AndyRussG>	 RoanKattouw: all good! :)
[23:20:04] <logmsgbot>	 !log fdans@deploy1001 Finished deploy [analytics/refinery@21e07ae]: Deploying revert to prevent partition dropping jobs from failing (duration: 08m 27s)
[23:20:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:20:19] <logmsgbot>	 !log catrope@deploy1001 Synchronized wmf-config/CommonSettings.php: Enable CentralNotice EventLogging at a low sample rate (0.01) (duration: 00m 50s)
[23:20:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:20:54] <icinga-wm>	 PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=LIST https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[23:21:26] <wikibugs_>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Karen Brown - https://phabricator.wikimedia.org/T201668 (10Dzahn) @Jalexander I was able to go directly to the wikitech wiki database and look in the user table and i...
[23:21:55] <icinga-wm>	 RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[23:23:05] <icinga-wm>	 RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[23:24:28] <AndyRussG>	 RoanKattouw: thx!!!!
[23:24:30] <wikibugs_>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Karen Brown - https://phabricator.wikimedia.org/T201668 (10Dzahn) @Krenair this would confirm my suspicion. thank you. it looks like that might not work with LDAP int...
[23:29:48] <wikibugs_>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Karen Brown - https://phabricator.wikimedia.org/T201668 (10Dzahn) on DB level i can see these differences:  For my own user the fields "user_real_name" and "user_pass...
[23:31:57] <wikibugs_>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Karen Brown - https://phabricator.wikimedia.org/T201668 (10Legoktm) Did she log into wikitech and set a real password instead of the temporary one? That would populat...
[23:35:51] <wikibugs_>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Karen Brown - https://phabricator.wikimedia.org/T201668 (10Dzahn) @Jalexander Can we just let her create a normal user (as anon)  and not worry about the "(WMF)" part...
[23:37:01] <wikibugs_>	 10Operations, 10Wikimedia-Logstash, 10Goal, 10User-fgiunchedi, 10User-herron: Shorten logstash retention - https://phabricator.wikimedia.org/T201971 (10fgiunchedi) We can't delete inside indices easily, no. Dropping old indices is cheap compared to actually looking inside and delete only specific data. I...
[23:38:07] <wikibugs_>	 10Operations, 10Wikimedia-Logstash, 10Goal, 10User-fgiunchedi, 10User-herron: Shorten logstash retention temporarily - https://phabricator.wikimedia.org/T201971 (10fgiunchedi)
[23:38:59] <tzatziki>	 !log Change password for User:Textorus
[23:39:03] <tzatziki>	 uh, email
[23:39:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:39:11] <tzatziki>	 !log Correction: Change email for User:Textorus
[23:39:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:39:41] <wikibugs_>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Karen Brown - https://phabricator.wikimedia.org/T201668 (10Dzahn) Ok, so let's first have the 2 users (also see T201667) confirm they set their intial password.  Mayb...
[23:39:51] <wikibugs_>	 (03CR) 10jenkins-bot: Disable wgLegacyJavaScriptGlobals on all group0 wikis, not just test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452843 (https://phabricator.wikimedia.org/T35837) (owner: 10Jforrester)
[23:39:53] <wikibugs_>	 (03CR) 10jenkins-bot: CentralNotice: EventLogging data at a low level (0.01 sample rate) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452859 (owner: 10AndyRussG)
[23:41:22] <wikibugs_>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Karen Brown - https://phabricator.wikimedia.org/T201668 (10Dzahn) 05Open>03stalled
[23:43:56] <wikibugs_>	 10Operations, 10LDAP-Access-Requests, 10User-Addshore: Give access to graphite and grafana-admin to Aleksey Bekh-Ivanov (WMDE) - https://phabricator.wikimedia.org/T199233 (10Dzahn) Hi @RStallman-legalteam here's another WMDE engineer who needs an NDA signed.
[23:46:59] <wikibugs_>	 10Operations, 10Wikimedia-Logstash, 10Goal, 10User-fgiunchedi, 10User-herron: Shorten logstash retention temporarily - https://phabricator.wikimedia.org/T201971 (10Legoktm) api-feature-usage is exposed via Special:ApiFeatureUsage, which queries the log entries from elasticsearch, I'm not sure if that's d...
[23:47:49] <wikibugs_>	 10Operations, 10SRE-Access-Requests: Requesting access to restricted production access and analytics-privatedata-users for Patrick Earley - https://phabricator.wikimedia.org/T201667 (10Dzahn) The comments from T201668#4503338  and following also apply to this ticket.  The user_password field is not populated i...
[23:48:02] <wikibugs_>	 10Operations, 10SRE-Access-Requests: Requesting access to restricted production access and analytics-privatedata-users for Patrick Earley - https://phabricator.wikimedia.org/T201667 (10Dzahn) 05Open>03stalled
[23:48:28] <wikibugs_>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: new ssh key for daniel - https://phabricator.wikimedia.org/T201913 (10Dzahn) p:05Triage>03High
[23:48:34] <tzatziki>	 !log deleting three images for legal compliance
[23:48:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:55:29] <TimStarling>	 !log restarted populateContentTables.php on s2
[23:55:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:56:08] <wikibugs_>	 10Operations, 10Patch-For-Review: reinstall rdb100[56] with RAID - https://phabricator.wikimedia.org/T140442 (10Dzahn) I see that T190327 is closed meanwhile. Did it actually become easy now? :)
[23:59:59] <wikibugs_>	 10Operations, 10Wikimedia-Logstash, 10Goal, 10User-fgiunchedi, 10User-herron: Shorten logstash retention temporarily - https://phabricator.wikimedia.org/T201971 (10Krinkle) >>! In T201971#4503528, @Bawolff wrote: > Could we maybe dump by channel type? api-feature-usage is by far the majority of logstash...