[00:00:07] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/448779 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [00:03:03] (03PS2) 10Aaron Schulz: Only do cache writes to mcrouter for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449605 (https://phabricator.wikimedia.org/T198239) [00:05:58] (03CR) 10Aaron Schulz: [C: 032] Only do cache writes to mcrouter for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449605 (https://phabricator.wikimedia.org/T198239) (owner: 10Aaron Schulz) [00:07:17] (03Merged) 10jenkins-bot: Only do cache writes to mcrouter for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449605 (https://phabricator.wikimedia.org/T198239) (owner: 10Aaron Schulz) [00:09:04] (03CR) 10jenkins-bot: Only do cache writes to mcrouter for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449605 (https://phabricator.wikimedia.org/T198239) (owner: 10Aaron Schulz) [00:09:31] !log aaron@deploy1001 Synchronized wmf-config/mc.php: Only do cache writes to mcrouter for all wikis (duration: 00m 52s) [00:09:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:11] (03CR) 10Dzahn: [C: 032] graphite::carbon_c_relay: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/448779 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [00:13:20] (03PS2) 10Aaron Schulz: Allow broadcasted mcrouter cache operations for purges [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449606 (https://phabricator.wikimedia.org/T198239) [00:15:28] !log graphite2002 - stopping carbon-local-relay, running puppet to start it again to confirm no issues with gerrit:448779 [00:15:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:17:07] (03CR) 10Dzahn: [C: 032] "puppet run is no-op. stopped carbon-local-relay on graphite2002, then ran puppet to have it started again. then the same with carbon-front" [puppet] - 10https://gerrit.wikimedia.org/r/448779 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [00:17:20] mutante: thanks! I'm off but let me know if you run into issues [00:17:36] godog: i tested it on codfw but did not touch eqiad. no problem [00:27:53] (03PS3) 10Aaron Schulz: Enable broadcasted mcrouter cache operations for testwikis and mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449606 (https://phabricator.wikimedia.org/T198239) [00:29:12] (03CR) 10Aaron Schulz: [C: 032] Enable broadcasted mcrouter cache operations for testwikis and mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449606 (https://phabricator.wikimedia.org/T198239) (owner: 10Aaron Schulz) [00:29:28] (03PS4) 10Aaron Schulz: Enable broadcasted mcrouter cache operations for test wikis and mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449606 (https://phabricator.wikimedia.org/T198239) [00:31:42] (03CR) 10Aaron Schulz: [C: 032] Enable broadcasted mcrouter cache operations for test wikis and mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449606 (https://phabricator.wikimedia.org/T198239) (owner: 10Aaron Schulz) [00:33:00] (03Merged) 10jenkins-bot: Enable broadcasted mcrouter cache operations for test wikis and mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449606 (https://phabricator.wikimedia.org/T198239) (owner: 10Aaron Schulz) [00:35:25] !log aaron@deploy1001 Synchronized wmf-config/mc.php: Enable broadcasted mcrouter cache operations for test wikis and mw.org (duration: 00m 49s) [00:35:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:38:06] (03PS1) 10Aaron Schulz: Enable broadcasted mcrouter operations for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452592 (https://phabricator.wikimedia.org/T198239) [00:38:52] (03CR) 10jenkins-bot: Enable broadcasted mcrouter cache operations for test wikis and mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449606 (https://phabricator.wikimedia.org/T198239) (owner: 10Aaron Schulz) [00:55:20] (03CR) 10Legoktm: "What you have so far looks nice, but I think the main thing that's missing is details of the arguments. If you look at 10Operations, 10Packaging, 10Toolforge, 10Patch-For-Review: Please add php-imagick and php-redis packages to apt.wikimedia.org thirdparty/php72 - https://phabricator.wikimedia.org/T200666 (10Legoktm) [00:56:44] (03CR) 10Krinkle: [C: 031] Enable broadcasted mcrouter operations for all wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452592 (https://phabricator.wikimedia.org/T198239) (owner: 10Aaron Schulz) [01:41:27] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 22 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [01:46:27] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [01:58:37] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 22 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [02:02:57] 10Operations, 10TechCom-RFC, 10Traffic, 10Patch-For-Review, and 3 others: Harmonise the identification of requests across our stack - https://phabricator.wikimedia.org/T201409 (10CCicalese_WMF) [02:08:37] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [02:15:38] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 22 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [02:22:26] 10Operations, 10Performance-Team, 10Traffic: Significant increase in Time To First Byte on 2018-08-08, between 16:00 and 20:00 UTC - https://phabricator.wikimedia.org/T201769 (10Imarlier) 05Open>03Resolved Confirmed that WPT agents are resolving to the codfw edge. Given that this means that they're goin... [02:23:48] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 20 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [02:28:48] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 18 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [02:31:57] 10Operations, 10Wikimedia-General-or-Unknown, 10Wikimedia-log-errors: Jobrunners generate mediawiki exceptions upon calling Closure$RecentChange::save - https://phabricator.wikimedia.org/T169884 (10Krinkle) 05Open>03declined Not seen in Logstash for at least 7 days (searching mediawiki-errors for `Recent... [02:35:21] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.16) (duration: 13m 57s) [02:35:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:45:47] !log l10nupdate@deploy1001 ResourceLoader cache refresh completed at Tue Aug 14 02:45:46 UTC 2018 (duration 10m 25s) [02:45:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:50:48] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [02:57:18] PROBLEM - High CPU load on API appserver on mw1290 is CRITICAL: CRITICAL - load average: 44.57, 42.28, 40.13 [02:57:57] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [03:01:18] PROBLEM - High CPU load on API appserver on mw1290 is CRITICAL: CRITICAL - load average: 41.14, 41.04, 40.10 [03:06:48] PROBLEM - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is alerting: 70% GET drop in 30min alert. [03:07:57] RECOVERY - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is not alerting. [03:07:58] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [03:14:58] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 27 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [03:26:47] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 778.63 seconds [03:44:17] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 261.99 seconds [03:51:18] PROBLEM - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is alerting: 70% GET drop in 30min alert. [03:55:08] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [03:59:37] RECOVERY - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is not alerting. [04:02:08] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [04:06:47] PROBLEM - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is alerting: 70% GET drop in 30min alert. [04:07:47] RECOVERY - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is not alerting. [04:12:08] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [04:19:27] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [04:27:36] (03PS1) 10Andrew Bogott: openstack glance: move active service for eqiad1 and main to cloudcontrol1003 [puppet] - 10https://gerrit.wikimedia.org/r/452595 (https://phabricator.wikimedia.org/T191791) [04:27:38] (03PS1) 10Andrew Bogott: Openstack glance: remove glance service from labcontrol1001 [puppet] - 10https://gerrit.wikimedia.org/r/452596 (https://phabricator.wikimedia.org/T191791) [04:36:37] PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 35.97, 33.90, 32.07 [04:39:27] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [04:44:03] 10Operations, 10TechCom-RFC, 10Traffic, 10Patch-For-Review, and 3 others: Harmonise the identification of requests across our stack - https://phabricator.wikimedia.org/T201409 (10Joe) We also need internal requests to be traced, so I would assume we need all services to generate a request Id whenever they... [04:44:47] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 20 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [04:46:37] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [04:48:51] (03PS1) 10Tim Starling: Update ~tstarling/.bashrc [puppet] - 10https://gerrit.wikimedia.org/r/452597 [04:49:48] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 19 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [04:53:08] (03CR) 10Tim Starling: [C: 032] Update ~tstarling/.bashrc [puppet] - 10https://gerrit.wikimedia.org/r/452597 (owner: 10Tim Starling) [04:53:28] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Karen Brown - https://phabricator.wikimedia.org/T201668 (10JanWMF) Thank you, Daniel, I appreciate you checking and working on this :) [04:56:37] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [05:03:47] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 22 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [05:13:21] !log killed populateContentTables.php for s2 at jcrespo's request [05:13:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:28:57] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [05:35:58] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [05:45:59] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [05:53:08] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [05:58:08] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [06:05:18] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 23 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [06:11:38] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 41.08, 35.61, 32.20 [06:13:46] 10Operations, 10SRE-Access-Requests: Subscribe user mepps to security@wikimedia.org - https://phabricator.wikimedia.org/T201856 (10Zoranzoki21) [06:20:18] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [06:27:28] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 52 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [06:28:47] PROBLEM - High CPU load on API appserver on mw1232 is CRITICAL: CRITICAL - load average: 39.73, 35.54, 31.74 [06:32:48] PROBLEM - High CPU load on API appserver on mw1232 is CRITICAL: CRITICAL - load average: 37.02, 35.10, 32.36 [06:36:06] <_joe_> oh nice [06:36:11] <_joe_> a good way to start the day [06:37:41] <_joe_> !log depooling mw1233 from live traffic for debugging [06:37:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:48] PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 36.97, 35.17, 32.20 [06:45:23] <_joe_> !log rolling restart of hhvm in eqiad api, high load [06:45:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:58] PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 37.60, 33.55, 32.28 [06:52:38] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [06:53:00] (03PS2) 10Jcrespo: Revert "mariadb: Depool db1101 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452324 [06:53:23] (03PS1) 10Volans: OpenStack: add custom parameters for the client [software/cumin] - 10https://gerrit.wikimedia.org/r/452608 (https://phabricator.wikimedia.org/T201881) [06:54:07] (03PS1) 10Volans: cumin: add region_name to the WMCS openstack config [puppet] - 10https://gerrit.wikimedia.org/r/452609 (https://phabricator.wikimedia.org/T201881) [06:57:27] PROBLEM - High CPU load on API appserver on mw1235 is CRITICAL: CRITICAL - load average: 36.21, 33.87, 32.19 [06:57:38] RECOVERY - High CPU load on API appserver on mw1232 is OK: OK - load average: 1.21, 12.28, 23.90 [06:59:07] RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 19.53, 19.68, 23.85 [06:59:48] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 22 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [07:01:11] (03CR) 10Volans: "Compiler results available here:" [puppet] - 10https://gerrit.wikimedia.org/r/452609 (https://phabricator.wikimedia.org/T201881) (owner: 10Volans) [07:01:50] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db1101 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452324 (owner: 10Jcrespo) [07:03:12] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1101 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452324 (owner: 10Jcrespo) [07:03:52] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1101 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452324 (owner: 10Jcrespo) [07:05:34] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1101:s7 and db1101:s8 (duration: 00m 52s) [07:05:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:52] (03PS2) 10Muehlenhoff: Add php-imagick and php-redis to thirdparty/php72 [puppet] - 10https://gerrit.wikimedia.org/r/452274 (https://phabricator.wikimedia.org/T200666) (owner: 10Legoktm) [07:09:11] (03CR) 10Muehlenhoff: [C: 032] Add php-imagick and php-redis to thirdparty/php72 [puppet] - 10https://gerrit.wikimedia.org/r/452274 (https://phabricator.wikimedia.org/T200666) (owner: 10Legoktm) [07:12:58] PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 36.68, 33.19, 32.07 [07:14:24] (03PS6) 10Ema: ATS: add Lua scripting support [puppet] - 10https://gerrit.wikimedia.org/r/451838 (https://phabricator.wikimedia.org/T199720) [07:14:26] (03PS1) 10Ema: contint: add support for testing ts-lua plugins [puppet] - 10https://gerrit.wikimedia.org/r/452612 (https://phabricator.wikimedia.org/T199720) [07:15:58] (03CR) 10jerkins-bot: [V: 04-1] contint: add support for testing ts-lua plugins [puppet] - 10https://gerrit.wikimedia.org/r/452612 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema) [07:18:27] (03PS2) 10Ema: contint: add support for testing ts-lua plugins [puppet] - 10https://gerrit.wikimedia.org/r/452612 (https://phabricator.wikimedia.org/T199720) [07:20:18] PROBLEM - High CPU load on API appserver on mw1235 is CRITICAL: CRITICAL - load average: 34.36, 32.33, 32.03 [07:27:38] PROBLEM - High CPU load on API appserver on mw1235 is CRITICAL: CRITICAL - load average: 35.92, 32.79, 32.24 [07:29:58] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [07:32:31] (03CR) 10Legoktm: [C: 04-1] "If you want to use this for operations/puppet, then adding the packages to the contint manifests won't work. You'll have to update https:/" [puppet] - 10https://gerrit.wikimedia.org/r/452612 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema) [07:34:57] PROBLEM - High CPU load on API appserver on mw1235 is CRITICAL: CRITICAL - load average: 33.75, 32.46, 32.06 [07:37:17] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [07:41:31] (03PS1) 10Jcrespo: mariadb: Set s2 in read only mode due to maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452620 (https://phabricator.wikimedia.org/T201694) [07:42:17] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [07:42:56] (03PS1) 10Muehlenhoff: Revert "Add php-imagick and php-redis to thirdparty/php72" [puppet] - 10https://gerrit.wikimedia.org/r/452621 [07:47:03] (03CR) 10Muehlenhoff: [C: 032] Revert "Add php-imagick and php-redis to thirdparty/php72" [puppet] - 10https://gerrit.wikimedia.org/r/452621 (owner: 10Muehlenhoff) [07:54:17] RECOVERY - High CPU load on API appserver on mw1226 is OK: OK - load average: 5.61, 9.11, 23.32 [07:54:28] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 24 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [07:56:17] !log installing ghostscript security updates [07:56:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:28] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [08:00:12] PROBLEM - HHVM rendering on mw1281 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:01:11] RECOVERY - HHVM rendering on mw1281 is OK: HTTP OK: HTTP/1.1 200 OK - 75412 bytes in 3.445 second response time [08:01:12] RECOVERY - High CPU load on API appserver on mw1290 is OK: OK - load average: 11.16, 13.04, 29.63 [08:03:09] (03CR) 10Gehel: Add cookbook entry point script (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/450937 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [08:04:02] RECOVERY - High CPU load on API appserver on mw1231 is OK: OK - load average: 8.41, 11.66, 23.76 [08:06:07] (03CR) 10Gehel: [C: 031] "LGTM, trivial enough" [software/cumin] - 10https://gerrit.wikimedia.org/r/452608 (https://phabricator.wikimedia.org/T201881) (owner: 10Volans) [08:06:32] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [08:07:05] (03CR) 10Volans: Add cookbook entry point script (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/450937 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [08:07:51] (03CR) 10Volans: [C: 032] OpenStack: add custom parameters for the client [software/cumin] - 10https://gerrit.wikimedia.org/r/452608 (https://phabricator.wikimedia.org/T201881) (owner: 10Volans) [08:08:11] RECOVERY - High CPU load on API appserver on mw1235 is OK: OK - load average: 7.63, 13.84, 23.44 [08:10:47] (03Merged) 10jenkins-bot: OpenStack: add custom parameters for the client [software/cumin] - 10https://gerrit.wikimedia.org/r/452608 (https://phabricator.wikimedia.org/T201881) (owner: 10Volans) [08:12:01] (03CR) 10jenkins-bot: OpenStack: add custom parameters for the client [software/cumin] - 10https://gerrit.wikimedia.org/r/452608 (https://phabricator.wikimedia.org/T201881) (owner: 10Volans) [08:13:42] (03CR) 10Vgutierrez: [C: 032] varnish: get rid of AES128-SHA redirection to /sec-warning [puppet] - 10https://gerrit.wikimedia.org/r/450020 (https://phabricator.wikimedia.org/T192555) (owner: 10Vgutierrez) [08:13:50] (03PS2) 10Vgutierrez: varnish: get rid of AES128-SHA redirection to /sec-warning [puppet] - 10https://gerrit.wikimedia.org/r/450020 (https://phabricator.wikimedia.org/T192555) [08:16:32] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [08:19:39] !log upgrading wikidiff to 1.7.2 on mw1266-mw1275 (HHVM bytecode cache is pruned during update) [08:19:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:52] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 20 probes of 314 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [08:22:17] (03PS1) 10Jcrespo: mariadb: Set s2 as read-write and promote db1122 as the new s2 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452632 (https://phabricator.wikimedia.org/T201694) [08:23:42] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [08:25:52] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 16 probes of 314 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [08:31:42] (03PS3) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: move the other private wikis to the define [puppet] - 10https://gerrit.wikimedia.org/r/451255 (https://phabricator.wikimedia.org/T196968) [08:31:44] (03PS3) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: make includes explicit in more wikis [puppet] - 10https://gerrit.wikimedia.org/r/451257 (https://phabricator.wikimedia.org/T196968) [08:31:46] (03PS3) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert loginwiki, chapterwiki [puppet] - 10https://gerrit.wikimedia.org/r/451258 (https://phabricator.wikimedia.org/T196968) [08:31:48] (03PS3) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: expand include everywhere in remnant.conf [puppet] - 10https://gerrit.wikimedia.org/r/451259 (https://phabricator.wikimedia.org/T196968) [08:31:50] (03PS3) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: expand the includes in sites in main.conf (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/451260 (https://phabricator.wikimedia.org/T196968) [08:31:52] (03PS2) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: expand the includes in sites in main.conf (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/452322 (https://phabricator.wikimedia.org/T196968) [08:31:54] (03PS3) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert simple wikis in remnant.conf [puppet] - 10https://gerrit.wikimedia.org/r/452323 (https://phabricator.wikimedia.org/T196968) [08:31:56] (03PS2) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: enable HHVM on some sites(!!!) [puppet] - 10https://gerrit.wikimedia.org/r/452325 [08:31:58] (03PS1) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert usability wiki [puppet] - 10https://gerrit.wikimedia.org/r/452635 (https://phabricator.wikimedia.org/T196968) [08:32:00] (03PS1) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: migrate wikispecies [puppet] - 10https://gerrit.wikimedia.org/r/452636 (https://phabricator.wikimedia.org/T196968) [08:32:02] (03PS1) 10Jcrespo: mariadb: Failover db1066 (eqiad s2 master) to db1122 [puppet] - 10https://gerrit.wikimedia.org/r/452637 (https://phabricator.wikimedia.org/T197073) [08:33:13] (03PS7) 10Ema: ATS: add Lua scripting support [puppet] - 10https://gerrit.wikimedia.org/r/451838 (https://phabricator.wikimedia.org/T199720) [08:33:15] (03PS3) 10Ema: tox: add ts-lua tests for trafficserver [puppet] - 10https://gerrit.wikimedia.org/r/452612 (https://phabricator.wikimedia.org/T199720) [08:37:42] (03CR) 10Gehel: [C: 031] "very minor comments inline, otherwise LGTM" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/450937 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [08:38:00] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 20 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [08:38:27] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1049.eqiad.wmnet', 'elastic1046... [08:40:03] (03CR) 10jerkins-bot: [V: 04-1] tox: add ts-lua tests for trafficserver [puppet] - 10https://gerrit.wikimedia.org/r/452612 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema) [08:40:39] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [08:43:00] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 19 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [08:43:20] PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet operation_type={create_container,start_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [08:44:20] RECOVERY - kubelet operational latencies on kubernetes1003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [08:46:16] (03CR) 10Gehel: [C: 031] "LGTM, minor comment inline." (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/452378 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [08:47:11] PROBLEM - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is alerting: 70% GET drop in 30min alert. [08:49:10] RECOVERY - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is not alerting. [08:49:42] uh ema ^^ [08:50:03] (03CR) 10Gehel: [C: 031] "Nice cleanup!" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/452379 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [08:50:19] vgutierrez: T200673 [08:50:21] T200673: varnish-http-requests false positives when a DC is depooled - https://phabricator.wikimedia.org/T200673 [08:50:48] ema: <3 [08:51:51] !log addshore@labweb1001:~$ mwscript extensions/OATHAuth/maintenance/disableOATHAuthForUser.php --wiki=labswiki GoranSMilovanovic # T201122 [08:51:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:59] T201122: Cannot login to Wikitech w. my LDAP account - https://phabricator.wikimedia.org/T201122 [08:57:00] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [08:59:19] Is there a way to automatically !log things from a machine while running a command? [09:00:45] <_joe_> addshore: heh, not really [09:01:01] <_joe_> but it's not so hard to add a simple script that does that for you [09:01:30] <_joe_> it's what scap and/or conftool do, after all [09:01:38] (03PS1) 10Jcrespo: mariadb: Point s2-master CNAME to db1122 [dns] - 10https://gerrit.wikimedia.org/r/452642 (https://phabricator.wikimedia.org/T201694) [09:01:48] it would be nice to just be able to run "sal some_command_here" for example [09:01:53] <_joe_> addshore: are you aware all your !log messages end up on twitter, right? [09:03:09] yup, that bot is the thing that probably says my username the most on twitter [09:03:27] <_joe_> eheheh indeed [09:03:45] <_joe_> I remember before knowing it to have done some very snarky !logs [09:04:38] (03CR) 10Gehel: "LGTM, very minor comment inline." (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/451254 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [09:05:04] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1039.eqiad.wmnet', 'elastic1046.eqiad.wmnet', 'elastic1049.eqiad.wmnet'] ``` an... [09:07:01] !log for i in {1..1000}; do echo Lexeme:L$i; done | mwscript purgePage.php --wiki wikidatawiki --skip-exists-check # T198301 [09:07:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:08] T198301: Poke existing lexemes to be reflected on SpecialPage - https://phabricator.wikimedia.org/T198301 [09:08:54] !log for i in {1000..3000}; do echo Lexeme:L$i; done | mwscript purgePage.php --wiki wikidatawiki --skip-exists-check # T198301 [09:08:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:14] !log for i in {3000..6000}; do echo Lexeme:L$i; done | mwscript purgePage.php --wiki wikidatawiki --skip-exists-check # T198301 [09:13:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:20] T198301: Poke existing lexemes to be reflected on SpecialPage - https://phabricator.wikimedia.org/T198301 [09:15:56] (03CR) 10Gehel: [C: 031] "LGTM, trivial enough" [software/spicerack] - 10https://gerrit.wikimedia.org/r/451537 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [09:16:33] 10Operations, 10Wikidata, 10monitoring, 10Patch-For-Review, 10User-Addshore: Add Addshore & possibly other WMDE devs/deployers to the wikidata icinga contact list - https://phabricator.wikimedia.org/T195289 (10Addshore) Quite some time has passed now, any update here? [09:16:51] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 4 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 (10phuedx) Further to the above, AIUI both {T181623} and {T177765} block Proton's deployment. @pmiazga has submitted two changes for the form... [09:17:14] !log for i in {6000..10000}; do echo Lexeme:L$i; done | mwscript purgePage.php --wiki wikidatawiki --skip-exists-check # T198301 [09:17:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:53] !log upgrading wikidiff to 1.7.2 on mw1238-mw1258 (HHVM bytecode cache is pruned during update) [09:20:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:01] (03PS1) 10Jcrespo: mariadb: Depool db1102 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452644 (https://phabricator.wikimedia.org/T201694) [09:23:29] !log for i in {10000..12500}; do echo Lexeme:L$i; done | mwscript purgePage.php --wiki wikidatawiki --skip-exists-check # T198301 [09:23:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:35] T198301: Poke existing lexemes to be reflected on SpecialPage - https://phabricator.wikimedia.org/T198301 [09:24:24] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1102 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452644 (https://phabricator.wikimedia.org/T201694) (owner: 10Jcrespo) [09:25:05] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [09:25:40] (03Merged) 10jenkins-bot: mariadb: Depool db1102 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452644 (https://phabricator.wikimedia.org/T201694) (owner: 10Jcrespo) [09:27:15] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1102 (duration: 00m 52s) [09:27:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:53] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1102 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452645 [09:32:13] !log stop and restart db1122 for maintenance [09:32:14] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [09:32:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:08] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1044.eqiad.wmnet', 'elastic1045... [09:33:23] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [09:33:26] (03CR) 10Gehel: [C: 04-1] Add remote module to interact with Cumin (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/451538 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [09:33:33] 10Operations, 10Wikidata, 10monitoring, 10Patch-For-Review, 10User-Addshore: Add Addshore & possibly other WMDE devs/deployers to the wikidata icinga contact list - https://phabricator.wikimedia.org/T195289 (10ArielGlenn) @Addshore and @Ladsgroup should be on the contact list (patch merged at the end of... [09:35:49] (03PS1) 10Jcrespo: mariadb: Switch db1122 binlog format to ROW [puppet] - 10https://gerrit.wikimedia.org/r/452648 (https://phabricator.wikimedia.org/T201694) [09:36:29] (03CR) 10Jcrespo: [C: 032] mariadb: Switch db1122 binlog format to ROW [puppet] - 10https://gerrit.wikimedia.org/r/452648 (https://phabricator.wikimedia.org/T201694) (owner: 10Jcrespo) [09:37:05] (03CR) 10jenkins-bot: mariadb: Depool db1102 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452644 (https://phabricator.wikimedia.org/T201694) (owner: 10Jcrespo) [09:38:58] (03PS1) 10Jcrespo: mariadb: Switch db1122 binlog format to STATEMENT [puppet] - 10https://gerrit.wikimedia.org/r/452649 (https://phabricator.wikimedia.org/T201694) [09:39:46] (03CR) 10Jcrespo: [C: 032] mariadb: Switch db1122 binlog format to STATEMENT [puppet] - 10https://gerrit.wikimedia.org/r/452649 (https://phabricator.wikimedia.org/T201694) (owner: 10Jcrespo) [09:42:14] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [09:42:20] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db1102 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452645 (owner: 10Jcrespo) [09:43:38] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1102 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452645 (owner: 10Jcrespo) [09:46:26] (03CR) 10Gehel: "Mostly minor comments inline" (035 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/451814 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [09:46:45] (03PS2) 10Jcrespo: mariadb: Failover db1066 (eqiad s2 master) to db1122 [puppet] - 10https://gerrit.wikimedia.org/r/452637 (https://phabricator.wikimedia.org/T197073) [09:47:23] (03PS2) 10Jcrespo: mariadb: Set s2 in read only mode due to maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452620 (https://phabricator.wikimedia.org/T201694) [09:49:23] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [09:50:17] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1102 with low load (duration: 00m 50s) [09:50:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:40] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1102 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452645 (owner: 10Jcrespo) [09:54:24] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [09:57:45] <_joe_> addshore: /win 22 [09:57:47] <_joe_> argh [09:57:51] <_joe_> sorry [09:58:01] <_joe_> keyboard shortcut error [09:58:04] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [09:59:46] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1044.eqiad.wmnet', 'elastic1045.eqiad.wmnet', 'elastic1048.eqiad.wmnet'] ``` an... [10:01:30] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [10:02:00] PROBLEM - HHVM jobrunner on mw1299 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.003 second response time [10:02:41] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "I believe you should use 'profile::openstack::main::region' instead of 'profile::openstack::base::region'." [puppet] - 10https://gerrit.wikimedia.org/r/452609 (https://phabricator.wikimedia.org/T201881) (owner: 10Volans) [10:03:00] RECOVERY - HHVM jobrunner on mw1299 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.003 second response time [10:04:12] 10Operations, 10MediaWiki-extensions-Translate, 10Language-2018-July-September, 10MW-1.32-release-notes (WMF-deploy-2018-08-14 (1.32.0-wmf.17)), and 4 others: 503 error attempting to open multiple projects (Wikipedia and meta wiki are loading very slowly) - https://phabricator.wikimedia.org/T195293 (10Petar... [10:04:40] (03CR) 10Volans: "> Patch Set 1: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/452609 (https://phabricator.wikimedia.org/T201881) (owner: 10Volans) [10:10:27] 10Operations, 10LDAP-Access-Requests, 10User-Addshore: Give access to graphite and grafana-admin to Aleksey Bekh-Ivanov (WMDE) - https://phabricator.wikimedia.org/T199233 (10WMDE-leszek) 05declined>03Open As Aleksey's manager I hereby sign off this request. He is an WMDE engineer and needs the requested... [10:12:32] !log uploaded cumin_3.0.2-2_amd64.deb to apt.wikimedia.org jessie-wikimedia [10:12:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:22] (03CR) 10Arturo Borrero Gonzalez: [C: 031] openstack glance: move active service for eqiad1 and main to cloudcontrol1003 [puppet] - 10https://gerrit.wikimedia.org/r/452595 (https://phabricator.wikimedia.org/T191791) (owner: 10Andrew Bogott) [10:14:39] arturo: if you have a minute https://gerrit.wikimedia.org/r/c/operations/puppet/+/452609 (see my reply) [10:14:45] (03CR) 10Arturo Borrero Gonzalez: [C: 031] "Could we also drop the `modules/profile/manifests/openstack/main/glance.pp` file if it isn't referenced anymore?" [puppet] - 10https://gerrit.wikimedia.org/r/452596 (https://phabricator.wikimedia.org/T191791) (owner: 10Andrew Bogott) [10:15:03] ack [10:15:46] (03CR) 10Volans: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452620 (https://phabricator.wikimedia.org/T201694) (owner: 10Jcrespo) [10:15:52] thx [10:16:41] !log upgrading wikidiff to 1.7.2 on mw1221-mw1235 (HHVM bytecode cache is pruned during update) [10:16:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:07] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/452637 (https://phabricator.wikimedia.org/T197073) (owner: 10Jcrespo) [10:17:36] (03CR) 10Volans: [C: 031] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/452642 (https://phabricator.wikimedia.org/T201694) (owner: 10Jcrespo) [10:18:00] (03CR) 10Tim Starling: [C: 031] mariadb: Set s2 as read-write and promote db1122 as the new s2 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452632 (https://phabricator.wikimedia.org/T201694) (owner: 10Jcrespo) [10:19:11] (03CR) 10Volans: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452632 (https://phabricator.wikimedia.org/T201694) (owner: 10Jcrespo) [10:22:07] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "> > Patch Set 1: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/452609 (https://phabricator.wikimedia.org/T201881) (owner: 10Volans) [10:23:22] arturo: thanks for the clarification, at this point I have additional questions :) [10:23:40] go ahead :-) [10:23:54] the cluster that was until yesterday without region, and is the one used by WMCS and queried by cumin on labpuppetmaster*, which one is it? main? [10:24:06] yes, main [10:24:14] which region is `eqiad` [10:24:20] ok and the eqiad1 what is it? [10:24:29] will you need to query that too soon? [10:24:58] will it replace main or in addition to it? [10:26:42] (03PS2) 10Volans: cumin: add region_name to the WMCS openstack config [puppet] - 10https://gerrit.wikimedia.org/r/452609 (https://phabricator.wikimedia.org/T201881) [10:26:43] in the meanwhile, code updated :) ^^^ [10:27:23] (03CR) 10Volans: "Ack, done." [puppet] - 10https://gerrit.wikimedia.org/r/452609 (https://phabricator.wikimedia.org/T201881) (owner: 10Volans) [10:31:51] (03PS1) 10Volans: Rebuild for Django security upgrade [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/452656 [10:32:46] volans: `eqiad1` is another deployment [10:33:11] the one that will eventually replace `main` [10:33:22] `main` -> nova-network based openstack deployment [10:33:29] ok, so I guess for the interim period we'll use 2 different cumin configs [10:33:40] `eqiad1` -> neutron based openstack deployment [10:33:47] with one default and the other to be specified with -c /etc/cumin/eqiad1.yaml [10:34:14] for which we could add an alias/wrapper :) [10:35:04] apart of the many deployments, we introduced the concepts of `regions', and since yesterday the main and eqiad1 deployment share some components by means of this regions mechanism [10:35:34] the main deployment region is called `eqiad` and the eqiad1 deployment region is called `eqiad1-r` [10:35:42] naming is hard.... :-P [10:36:05] ehehe [10:36:10] is the patch ok now to merge? [10:36:23] at least to unblock cumin for the current main deployment [10:36:25] checking [10:36:33] thanks! [10:37:00] (03CR) 10Arturo Borrero Gonzalez: [C: 031] cumin: add region_name to the WMCS openstack config [puppet] - 10https://gerrit.wikimedia.org/r/452609 (https://phabricator.wikimedia.org/T201881) (owner: 10Volans) [10:37:06] +1 [10:37:28] (03CR) 10Volans: [C: 032] cumin: add region_name to the WMCS openstack config [puppet] - 10https://gerrit.wikimedia.org/r/452609 (https://phabricator.wikimedia.org/T201881) (owner: 10Volans) [10:37:29] great [10:39:11] working! [10:40:08] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [10:41:26] !log upgraded cumin on labpuppetmaster* to fix cumin with the new openstack region - T201881 [10:41:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:33] T201881: Cumin's OpenStack backend appears to be broken after labs keystone region merge - https://phabricator.wikimedia.org/T201881 [10:44:08] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [10:45:16] !log repooled mw1227 (was probably overlooked to repool after previous debugging) [10:45:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:39] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [11:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor I � Unicode. All rise for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180814T1100). [11:00:04] Zoranzoki21: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:03:10] I am here [11:03:12] Who swating? [11:03:31] I can SWAT today [11:03:48] Ok. Can you? [11:03:50] Zoranzoki21: I'll ping you when the first patch is at mwdebug1002 for testing [11:04:03] zeljkof: Testing is not needed [11:04:12] Zoranzoki21: for both patches? [11:04:17] zeljkof: No [11:04:31] zeljkof: This two patches only fix typo in comment [11:05:13] (03CR) 10Volans: [C: 031] "Diff looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/451255 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [11:10:10] Zoranzoki21: ok, both patches look good to me, will merge and deploy, I guess there is nothing to test :) [11:10:34] zeljkof: Yes [11:11:27] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452050 (https://phabricator.wikimedia.org/T201491) (owner: 10Zoranzoki21) [11:11:47] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452051 (https://phabricator.wikimedia.org/T201491) (owner: 10Zoranzoki21) [11:12:58] (03Merged) 10jenkins-bot: Fix 'the the' typo in wmf-config/CirrusSearch-common.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452050 (https://phabricator.wikimedia.org/T201491) (owner: 10Zoranzoki21) [11:13:17] (03Merged) 10jenkins-bot: Fix 'the the' typo in vendor/perftools/xhgui-collector/external/header.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452051 (https://phabricator.wikimedia.org/T201491) (owner: 10Zoranzoki21) [11:13:35] Zoranzoki21: both patches merged, no rebase was needed [11:13:46] zeljkof: Excellent than [11:14:11] well, except the rebase that gerrit does automatically [11:14:50] !log repooled mw1233 (debugging completed) [11:14:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:59] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [11:17:20] !log zfilipin@deploy1001 Synchronized wmf-config/CirrusSearch-common.php: SWAT: [[gerrit:452050|Fix the the typo in wmf-config/CirrusSearch-common.php (T201491)]] (duration: 00m 51s) [11:17:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:27] T201491: Fix common typos in code - https://phabricator.wikimedia.org/T201491 [11:19:16] !log zfilipin@deploy1001 Synchronized vendor/perftools/xhgui-collector/external/header.php: SWAT: [[gerrit:452051|Fix "the the" typo in vendor/perftools/xhgui-collector/external/header.php (T201491)]] (duration: 00m 49s) [11:19:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:53] !log upgrading wikidiff to 1.7.2 on mw1299-mw1306 (HHVM bytecode cache is pruned during update) [11:19:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:14] Zoranzoki21, Zoranzoki21_: both patches deployed, thanks for deploying with #releng! :D [11:20:26] Your welcome [11:20:50] !log EU SWAT finished [11:20:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:01] (03CR) 10jenkins-bot: Fix 'the the' typo in wmf-config/CirrusSearch-common.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452050 (https://phabricator.wikimedia.org/T201491) (owner: 10Zoranzoki21) [11:25:04] (03CR) 10jenkins-bot: Fix 'the the' typo in vendor/perftools/xhgui-collector/external/header.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452051 (https://phabricator.wikimedia.org/T201491) (owner: 10Zoranzoki21) [11:35:29] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 21 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [11:37:26] what's the procedure for setting up a new ssh key for production shell access? I forgot my passphrase, since I never use it >_< [11:37:29] don't judge ;) [11:38:19] DanielK_WMDE: please create a Phab task and tag is SRE-Access-Requests, then it'll be processed by clinic duty [11:39:21] moritzm: can do that, but where does the new private key go? [11:39:26] err, public :) [11:39:31] the private key doesn't go anywhere :) [11:40:29] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 16 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [11:40:41] !log restarted populateContentTables.php on s2 (T183488) [11:40:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:47] T183488: MCR schema migration stage 2: populate new fields - https://phabricator.wikimedia.org/T183488 [11:40:55] DanielK_WMDE: simply paste it in the Phab task, the person on clinic duty will take care of merging/deploying it via puppet [11:40:59] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [11:41:18] moritzm: ok [11:43:20] 10Operations, 10SRE-Access-Requests: new ssh key for daniel - https://phabricator.wikimedia.org/T201913 (10daniel) [11:44:59] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [11:52:09] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [11:58:50] !log upgrading wikidiff to 1.7.2 on mw1276-mw1283 (HHVM bytecode cache is pruned during update) [11:58:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:59] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [12:09:36] (03PS6) 10Jcrespo: db backup statistics: Initial implementation of the backup stats [puppet] - 10https://gerrit.wikimedia.org/r/449681 (https://phabricator.wikimedia.org/T198987) [12:10:22] (03CR) 10jerkins-bot: [V: 04-1] db backup statistics: Initial implementation of the backup stats [puppet] - 10https://gerrit.wikimedia.org/r/449681 (https://phabricator.wikimedia.org/T198987) (owner: 10Jcrespo) [12:12:24] (03CR) 10Jcrespo: "I have implemented the quick fixes. The with, logger and dict changes require more work (specially logger on the whole file), and will do " (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/449681 (https://phabricator.wikimedia.org/T198987) (owner: 10Jcrespo) [12:14:16] 10Operations, 10TCB-Team, 10wikidiff2, 10WMDE-QWERTY-Sprint-2018-07-17, and 2 others: Update wikidiff2 library on the WMF production cluster to v1.7.2 - https://phabricator.wikimedia.org/T199801 (10WMDE-Fisch) [12:15:02] (03PS7) 10Jcrespo: db backup statistics: Initial implementation of the backup stats [puppet] - 10https://gerrit.wikimedia.org/r/449681 (https://phabricator.wikimedia.org/T198987) [12:17:19] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [12:18:12] 10Operations, 10ops-eqiad, 10Discovery, 10Discovery-Search, 10Elasticsearch: check elastic1022 power supply redundancy - https://phabricator.wikimedia.org/T177631 (10Gehel) 05Resolved>03Open Re-opening as Icinga is still alerting on this. I can confirm that `ipmi-sensors --output-sensor-state --ignor... [12:18:26] 10Operations, 10ops-eqiad, 10Discovery, 10Discovery-Search, 10Elasticsearch: check elastic1022 power supply redundancy - https://phabricator.wikimedia.org/T177631 (10Gehel) p:05Triage>03High [12:24:29] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [12:31:33] PROBLEM - ElasticSearch health check for shards on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - elasticsearch http://10.2.2.30:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.2.2.30, port=9200): Read timed out. (read timeout=4) [12:31:56] uh? [12:31:58] <_joe_> uh [12:32:01] ^ master re-election slower than expected, should be back up in a second [12:32:03] <_joe_> that looks bad [12:32:10] <_joe_> gehel: is that serving traffic? [12:32:14] were you doing something gehel? [12:32:18] should be, checking [12:32:27] yep, reimaging of elastic nodes [12:32:33] RECOVERY - ElasticSearch health check for shards on search.svc.eqiad.wmnet is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 32, unassigned_shards: 141, number_of_pending_tasks: 589, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3139, task_max_waiting_in_queue_millis: 65118, cluster_name: production-search-eqiad, relocating_shards: 61, active_shards_percent_as_nu [12:32:33] , active_shards: 9276, initializing_shards: 10, number_of_data_nodes: 32, delayed_unassigned_shards: 141 [12:32:41] please !log :) [12:32:49] search wfm [12:32:51] ok, let's make that check not paging... [12:33:22] paravoid: there should be a log from the reimage script [12:34:24] (03PS2) 10Gehel: elasticsearch: shards check should not page. [puppet] - 10https://gerrit.wikimedia.org/r/451583 [12:34:30] gehel: I only see one from yesterday [12:35:06] yeah, scrolling back, I can't see it either [12:35:31] (03CR) 10Gehel: [C: 032] elasticsearch: shards check should not page. [puppet] - 10https://gerrit.wikimedia.org/r/451583 (owner: 10Gehel) [12:36:48] PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [600.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&panelId=9&fullscreen [12:37:59] PROBLEM - tilerator on maps1004 is CRITICAL: connect to address 10.64.48.154 and port 6534: Connection refused [12:39:33] tilerator issue seems transient, a npm worker was killed and automatically restarted [12:39:39] (03PS1) 10Giuseppe Lavagetto: [WIP] PHP: create module for modern Debian-based distributions [puppet] - 10https://gerrit.wikimedia.org/r/452664 (https://phabricator.wikimedia.org/T201140) [12:40:22] cirrus failures should be going down in a minute, the trend on the graph are not amazingly clear though [12:41:11] https://horizon.wikimedia.org/ is not loading for me, btw [12:42:04] <_joe_> wfm [12:42:59] RECOVERY - Number of backend failures per minute from CirrusSearch on graphite1001 is OK: OK: Less than 20.00% above the threshold [300.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&panelId=9&fullscreen [12:43:18] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 43 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [12:43:35] Nikerabbit: wfm too, I also logged in [12:43:39] works in incognito... something messed up with session state I suppose [12:44:04] yeah, try clearing the cookies and session [12:44:38] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [12:45:43] curiously, https://horizon.wikimedia.org redirectors to http://horizon.wikimedia.org/project/ (not https!), that redirects to same url in https that never loads [12:46:30] 10Operations: onboarding Effie Mouzeli - https://phabricator.wikimedia.org/T201855 (10Joe) [12:46:32] since it doesn't load at all, I don't have easy access to delete cookies for that domain... annoying browsers [12:46:32] 10Operations: Onboarding Effie Mouzeli - https://phabricator.wikimedia.org/T201816 (10Joe) [12:47:15] Nikerabbit: try https://horizon.wikimedia.org/auth/ [12:47:22] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 4 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 (10debt) awesome, thanks for the updates, @Pchelolo and @phuedx :) [12:47:30] it gives you an error, but should load and allow you to clear them ;) [12:48:03] nope, same behavior :( [12:48:17] that's weird, I get a 404 The page you were looking for doesn't exist [12:48:43] then try the https://horizon.wikimedia.org/auth/logout/ logout page [12:49:07] !log upgrading wikidiff to 1.7.2 on snapshot hosts [12:49:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:31] volans: I found a way to delete cookies for a specific site from Chrome's settings, but thanks for help anyway [12:49:39] ack [12:49:44] no problem :) [12:52:10] PROBLEM - tilerator on maps1003 is CRITICAL: connect to address 10.64.32.117 and port 6534: Connection refused [12:52:49] PROBLEM - tilerator on maps1001 is CRITICAL: connect to address 10.64.0.79 and port 6534: Connection refused [12:53:00] PROBLEM - tilerator on maps1002 is CRITICAL: connect to address 10.64.16.42 and port 6534: Connection refused [12:53:09] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1042.eqiad.wmnet', 'elastic1040.eqiad.wmnet', 'elastic1041.eqiad.wmnet'] ``` an... [12:53:19] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 17 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [12:53:36] tilerator is suspicious, looking [12:54:32] !log reindexing Indonesian wikis on elastic@eqiad and elastic@codfw complete (T200204) [12:54:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:39] T200204: Re-index Malay and Indonesian Wikis to use new unpacked analysis chain - https://phabricator.wikimedia.org/T200204 [12:57:04] !log restarting tilerator on maps eqiad [12:57:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:19] RECOVERY - tilerator on maps1003 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.035 second response time [12:57:40] RECOVERY - tilerator on maps1004 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.030 second response time [12:57:50] RECOVERY - tilerator on maps1001 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.007 second response time [12:58:09] RECOVERY - tilerator on maps1002 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.031 second response time [12:59:24] (03PS8) 10Jcrespo: db backup statistics: Initial implementation of the backup stats [puppet] - 10https://gerrit.wikimedia.org/r/449681 (https://phabricator.wikimedia.org/T198987) [13:02:04] (03CR) 10Jcrespo: "This is still untested, but can give you an idea of the suggestions of the review being implemented (even if I am not 100% some are actual" [puppet] - 10https://gerrit.wikimedia.org/r/449681 (https://phabricator.wikimedia.org/T198987) (owner: 10Jcrespo) [13:05:30] (03CR) 10Jcrespo: "It would be nice to know your high level opinion on arch decisions- for example, I chose to connect to mysql directly and not implement an" [puppet] - 10https://gerrit.wikimedia.org/r/449681 (https://phabricator.wikimedia.org/T198987) (owner: 10Jcrespo) [13:11:14] !log upgrading wikidiff to 1.7.2 on mw1308-mw1311/mw1293-mw1296 (HHVM bytecode cache is pruned during update) [13:11:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:41] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [13:13:50] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [13:15:46] 10Operations, 10Wikimedia-Mailing-lists: wikimedia-us-mn administration password reset - https://phabricator.wikimedia.org/T201920 (10MarkTraceur) [13:16:41] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [13:21:49] !log restarting elasticsearch on elastic1043 (overloaded) [13:21:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:17] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on neodymium.eqiad.wmnet for hosts: ``` cp5005.eqsin.wmnet ``` The log can be found in `/var/log/wmf-auto-reimage/201808... [13:23:50] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [13:26:51] (03PS1) 10Gehel: elasticsearch: storage device is md1 after reimage to stretch [puppet] - 10https://gerrit.wikimedia.org/r/452669 [13:28:51] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [13:29:20] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [13:30:30] (03CR) 10Ottomata: Remove geowiki cron jobs and make puppet delete related files/dirs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/450040 (https://phabricator.wikimedia.org/T190059) (owner: 10Fdans) [13:35:53] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [13:37:33] PROBLEM - Disk space on elastic1030 is CRITICAL: DISK CRITICAL - free space: /srv 52146 MB (10% inode=99%) [13:40:33] RECOVERY - Disk space on elastic1030 is OK: DISK OK [13:45:42] (03PS14) 10Giuseppe Lavagetto: webperf: Split Redis from the rest of the arclamp profile [puppet] - 10https://gerrit.wikimedia.org/r/444331 (https://phabricator.wikimedia.org/T195312) (owner: 10Krinkle) [13:45:52] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [13:46:27] <_joe_> Krinkle: puppet patches don't get merged by gerrit [13:46:31] <_joe_> I need to merge them myself [13:46:50] right [13:46:50] <_joe_> that's why I gave +2 previously but didn't merge it [13:47:08] !log restarting elasticsearch on elastic1051 (overloaded) [13:47:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:00] <_joe_> Krinkle: I am running puppet manually on mwlog1001 [13:48:32] <_joe_> noop as expected [13:49:10] cool. [13:50:06] (03PS12) 10Giuseppe Lavagetto: webperf: Add arclamp profile to webperf::profiling_tools role [puppet] - 10https://gerrit.wikimedia.org/r/445066 (https://phabricator.wikimedia.org/T195312) (owner: 10Krinkle) [13:50:15] <_joe_> ok the next one is the first that should do something [13:50:40] (03CR) 10Giuseppe Lavagetto: [C: 032] webperf: Add arclamp profile to webperf::profiling_tools role [puppet] - 10https://gerrit.wikimedia.org/r/445066 (https://phabricator.wikimedia.org/T195312) (owner: 10Krinkle) [13:51:57] Yeah, I'm logged-in on webperf2002/1002 and expect user[xenon] and the xenon-log service to start showing up there [13:52:28] Which just reminded me, I do not know whether or not reading Redis from mwlog1001 will work just as-is or whether that needs a firewall rule. Completely forgot about that. [13:52:53] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [13:52:58] <_joe_> Notice: /Stage[main]/Httpd/Service[apache2]: Triggered 'refresh' from 4 events [13:53:09] <_joe_> Krinkle: let's see [13:53:33] <_joe_> it works [13:53:53] 10Operations, 10TCB-Team, 10wikidiff2, 10WMDE-QWERTY-Sprint-2018-07-31, 10WMDE-QWERTY-Sprint-2018-08-14: Update wikidiff2 library on the WMF production cluster to v1.7.2 - https://phabricator.wikimedia.org/T199801 (10WMDE-Fisch) [13:53:55] <_joe_> as in I can connect to redis on mwlog1001 [13:54:07] from a webperf ? [13:54:08] cool [13:54:13] <_joe_> yes [13:54:22] <_joe_> I think I checked when I reviewed the patch [13:54:30] <_joe_> but you know, reality can be tricky [13:54:42] <_joe_> so I guess you want to verify something [13:55:13] <_joe_> uhm I see a problem [13:55:25] <_joe_> xenon.conf has no ServerName directive [13:55:34] <_joe_> which means it will never serve requests [13:55:52] <_joe_> oh it's the only vhost though [13:55:53] <_joe_> ok [13:56:14] Yeah, it's an old issue. I've got a todo to fix that. [13:56:47] (03PS1) 10Mforns: Add salt file path to EventLoggingSanitization cron job [puppet] - 10https://gerrit.wikimedia.org/r/452674 (https://phabricator.wikimedia.org/T199902) [13:56:47] <_joe_> https://gerrit.wikimedia.org/r/c/operations/puppet/+/451107/ is beta only [13:56:51] I realised it when things just "worked" in Beta Cluster when routing performance-beta.wmflabs.org to a host serving performance.wikimedia.org, and it worked because it didn't identify as that. [13:56:54] <_joe_> I assume it's safe to merge? [13:56:57] Yeah, already picked there. [13:57:06] (03CR) 10Giuseppe Lavagetto: [C: 032] webperf: Switch arclamp_host in Beta from mwlog host to webperf12 [puppet] - 10https://gerrit.wikimedia.org/r/451107 (https://phabricator.wikimedia.org/T195312) (owner: 10Krinkle) [13:57:08] PROBLEM - Varnish backend child restarted on cp1087 is CRITICAL: 4 gt 3 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1087&var-datasource=eqiad+prometheus/ops [13:57:17] (03PS6) 10Giuseppe Lavagetto: webperf: Switch arclamp_host in Beta from mwlog host to webperf12 [puppet] - 10https://gerrit.wikimedia.org/r/451107 (https://phabricator.wikimedia.org/T195312) (owner: 10Krinkle) [13:57:38] The one after that for prod should impact data served from https://performance.wikimedia.org/xenon/svgs/daily/ to be from webperfX002 instead of mwlog1001 [13:57:51] Which at first will be visible by there being almost no data (I'll backfil later) [13:58:18] <_joe_> Krinkle: do you want to backfill first, switch afterwards? [13:58:25] <_joe_> either choice is ok [13:58:36] No, that's alright. It's just for human use. Nothing depends on this programmatically. [13:58:54] I can see it's working on the new host, /srv/xenon is being populated already from Redis [13:59:28] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:59:44] <_joe_> uh? [13:59:54] <_joe_> can someone check what's up with cache_text? [14:00:20] looking [14:02:09] brief spike likely due to a backend child crash on cp1087 [14:02:18] <_joe_> thanks for looking <3 [14:02:28] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [14:02:58] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [14:03:02] interestingly only two hosts in eqiad have been affected by the crashes, it's gonna be a fun issue to debug [14:05:33] !log restart varnish-be on cp108[79], fetch failures after child crashes [14:05:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:04] (03CR) 10Giuseppe Lavagetto: [C: 032] webperf: Switch webperf::site to use arclamp from webperf-2 [puppet] - 10https://gerrit.wikimedia.org/r/452449 (https://phabricator.wikimedia.org/T195312) (owner: 10Krinkle) [14:06:14] (03PS2) 10Giuseppe Lavagetto: webperf: Switch webperf::site to use arclamp from webperf-2 [puppet] - 10https://gerrit.wikimedia.org/r/452449 (https://phabricator.wikimedia.org/T195312) (owner: 10Krinkle) [14:07:12] <_joe_> https://memegenerator.net/img/instances/65289046/waiting-for-jenkins-to-finish-build.jpg me right now [14:07:29] RECOVERY - Varnish backend child restarted on cp1089 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1089&var-datasource=eqiad+prometheus/ops [14:09:18] RECOVERY - Varnish backend child restarted on cp1087 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1087&var-datasource=eqiad+prometheus/ops [14:09:59] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [14:10:18] <_joe_> Krinkle: uhm I ran puppet on webperf1001 but I see more files than i expected [14:10:36] <_joe_> ah ok, varnish caches these [14:10:39] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [14:10:43] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp5005.eqsin.wmnet'] ``` and were **ALL** successful. [14:10:48] <_joe_> not sure it's what we wanted [14:10:58] <_joe_> (varnish caching such files) [14:11:14] files appeared on webperf1001? [14:12:15] <_joe_> no, on https://performance.wikimedia.org/xenon/logs/daily [14:12:35] <_joe_> but if you bust the cache with any bogus query parameter, you can seelthe actual shortlist [14:12:47] right [14:13:05] Yeah, we may need to revise the caching of that. [14:13:16] It's also multi-dc. [14:13:17] <_joe_> I would assume no caching is what we want [14:13:51] I've been refreshing that page every few seconds and I do see the timestamps an order change constantly until now [14:13:58] so I think it's already not caching [14:14:05] but maybe it was routing to codfw randomly as well? [14:14:08] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [14:15:08] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [14:15:49] <_joe_> Krinkle: uhm [14:15:49] _joe_: does a typical misc/eqiad+codfw director route round-robin to both? I'd assume it uses eqiad for eqiad and codfw for codfw. [14:15:58] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [14:15:58] <_joe_> misc is no more [14:16:09] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [14:16:20] <_joe_> in theory, requests going to ulsfo/eqsin/codfw should go to codfw [14:16:29] <_joe_> and requests going to eqiad and esams should go to eqiad [14:16:34] <_joe_> for active/active things [14:16:58] right [14:17:29] <_joe_> performance: [14:17:30] but refreshing this url still gives me sometimes webperfX001->mwlog1001 (old) and sometimes webperfX001->webperfX002 (new) [14:17:33] <_joe_> backends: [14:17:37] <_joe_> eqiad: 'webperf1001.eqiad.wmnet' [14:17:41] <_joe_> codfw: 'webperf2001.codfw.wmnet' [14:17:50] <_joe_> that's baffling, yeah [14:18:04] <_joe_> can you look at the caching headers for both cases? [14:18:12] <_joe_> X-cache should tell us what's going on [14:18:30] (old) x-cache: cp1089 pass, cp3041 hit/5, cp3033 pass; x-cache-status: hit-local ;; [14:18:31] (new) x-cache: cp1089 pass, cp3033 hit/4, cp3033 pass; x-cache-status: hit-local [14:18:50] <_joe_> so both cached, but with different values? [14:18:52] <_joe_> wow [14:18:57] <_joe_> ema, bblack any idea? [14:18:58] PROBLEM - Varnish backend child restarted on cp1089 is CRITICAL: 4 gt 3 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1089&var-datasource=eqiad+prometheus/ops [14:19:17] <_joe_> well it's cp1089, which is having some issues AFAICS [14:19:32] getting HTTP 200 and age:0 on both [14:19:37] 10Operations, 10Wikimedia-Mailing-lists: Mailing list for Wikimedians of Tamazight User Group - https://phabricator.wikimedia.org/T201929 (10Vikoula5) [14:20:03] <_joe_> ok I got it [14:20:14] <_joe_> we have two frontends with different versions of that page cached [14:20:29] <_joe_> and that can happen ofc [14:20:34] Right [14:20:42] <_joe_> now we should purge that url, or we wait [14:20:45] <_joe_> I vote we wait [14:20:49] Yeah, no need to purge. [14:20:58] but I'm confused as to how/why it caches [14:21:07] Should it not give a non-zero age in that case? [14:21:46] <_joe_> let me see [14:21:53] !log ema@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp1089.eqiad.wmnet,service=varnish-be [14:21:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:57] <_joe_> the server sends no caching headers whatsoever [14:23:11] <_joe_> I just tried the xenon/ directory [14:23:34] yeah, it's pretty much default static files over apache [14:23:47] I thought age would be computed in varnish though [14:26:51] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1102 with full weight (duration: 00m 52s) [14:26:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:52] 10Operations, 10netops, 10Patch-For-Review: Move servers off asw2-a-eqiad - https://phabricator.wikimedia.org/T201694 (10Cmjohnson) @ayounsi I added sfp-t's to asw2-a5-eqiad for the new server in that rack. For the remainder of the 10G servers in rack's 2/4/6 do you want me to run cross connects to asw2-a5?... [14:32:32] I'm off until Sunday evening. See folks then! [14:33:29] _joe_: Is it applied to codfw as well? [14:33:38] webperf2001. [14:35:55] !log Copying xenon/logs/daily/2018-*{all,load,index,api,RunSingleJob}.log from mwlog1001 to webperfX002 hosts [14:35:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:19] <_joe_> Krinkle: maybe not [14:36:26] <_joe_> let me see if puppet has run there [14:37:05] * Krinkle makes a note to figure out how to add a "Server:" header to these so that it's easier to see where stuff came from [14:38:07] <_joe_> Krinkle: now it's applied everywhere [14:38:12] perfect [14:39:10] So the cache_misc , it's gone completely? I got hte impression it was in progress because I found the performance_director in both text and misc.yaml [14:40:01] (03PS4) 10Herron: WIP: logstash: add ids to filter configs [puppet] - 10https://gerrit.wikimedia.org/r/452461 [14:40:21] <_joe_> Krinkle: it's pending cleanup [14:40:25] !log bstorm@deploy1001 Started deploy [striker/deploy@13da520]: (no justification provided) [14:40:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:57] OK. I'd love to figure out why there is no Age header on these, but it also seems relatively unimportant right now, so I'll get back to stuff now. [14:40:58] Thanks ! [14:41:38] !log bstorm@deploy1001 Finished deploy [striker/deploy@13da520]: (no justification provided) (duration: 01m 13s) [14:41:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:52] (03PS1) 10Ema: cache_text: do not limit transient memory usage [puppet] - 10https://gerrit.wikimedia.org/r/452680 [14:42:27] (03PS2) 10Andrew Bogott: openstack glance: move active service for eqiad1 and main to cloudcontrol1003 [puppet] - 10https://gerrit.wikimedia.org/r/452595 (https://phabricator.wikimedia.org/T191791) [14:42:29] (03PS2) 10Andrew Bogott: Openstack glance: remove glance service from labcontrol1001 [puppet] - 10https://gerrit.wikimedia.org/r/452596 (https://phabricator.wikimedia.org/T191791) [14:42:31] (03PS1) 10Andrew Bogott: Designate: use $keystone_host for keystone rather than $nova_controller [puppet] - 10https://gerrit.wikimedia.org/r/452682 [14:42:44] (03CR) 10BBlack: [C: 031] cache_text: do not limit transient memory usage [puppet] - 10https://gerrit.wikimedia.org/r/452680 (owner: 10Ema) [14:43:07] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [14:43:12] (03CR) 10Ema: [C: 032] cache_text: do not limit transient memory usage [puppet] - 10https://gerrit.wikimedia.org/r/452680 (owner: 10Ema) [14:43:28] PROBLEM - puppet last run on webperf2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:44:41] (03PS2) 10Andrew Bogott: Designate: use $keystone_host for keystone rather than $nova_controller [puppet] - 10https://gerrit.wikimedia.org/r/452682 [14:45:44] (03CR) 10Andrew Bogott: [C: 032] Designate: use $keystone_host for keystone rather than $nova_controller [puppet] - 10https://gerrit.wikimedia.org/r/452682 (owner: 10Andrew Bogott) [14:48:07] PROBLEM - puppet last run on cp1089 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:48:10] (03PS1) 10Ema: cache_text: set be_transient_gb: 0 [puppet] - 10https://gerrit.wikimedia.org/r/452684 [14:48:21] puppetfails on cache are my fault, fixing [14:48:37] RECOVERY - puppet last run on webperf2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:48:58] PROBLEM - puppet last run on cp5009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:49:16] (03CR) 10Ema: [C: 032] cache_text: set be_transient_gb: 0 [puppet] - 10https://gerrit.wikimedia.org/r/452684 (owner: 10Ema) [14:50:42] !log reimage of elastic103[678] [14:50:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:00] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1036.eqiad.wmnet', 'elastic1037... [14:53:07] RECOVERY - puppet last run on cp1089 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:53:08] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [14:53:09] !log rebooting cloudelastic* for kernel update [14:53:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:58] RECOVERY - puppet last run on cp5009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:54:13] (03CR) 10Ayounsi: "Not sure how I can review this." [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/452656 (owner: 10Volans) [14:54:37] PROBLEM - puppet last run on cp4030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:55:50] Elastic issues again? [14:56:23] (03CR) 10Krinkle: [C: 031] mediawiki::web::prod_sites: enable HHVM on some sites(!!!) [puppet] - 10https://gerrit.wikimedia.org/r/452325 (owner: 10Giuseppe Lavagetto) [14:57:08] sjoerddebruin: reimaging in progress, I see a rise in response times, but if experience is any predictor it should be back to normal in < 1 minute [14:57:13] sjoerddebruin: or do you see something else [14:57:48] On Wikidata, the suggester based on elastic sometimes takes quite some time or shows no results. [14:58:25] sjoerddebruin: on newly created entries? Or in general? [14:58:32] In general. [14:58:39] (03PS1) 10Jijiki: admin: added user jiji to the list of users Bug: T201816 [puppet] - 10https://gerrit.wikimedia.org/r/452685 (https://phabricator.wikimedia.org/T201816) [14:58:41] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [puppet] - 10https://gerrit.wikimedia.org/r/452685 (https://phabricator.wikimedia.org/T201816) (owner: 10Jijiki) [14:58:54] sjoerddebruin: and do you have a timeline for that issue? Is it just in the last 2 or 3 minutes? Or has it been going for logner? [14:59:02] s/logner/longer/ [14:59:10] 20 minutes I guess? [14:59:15] (03CR) 10jerkins-bot: [V: 04-1] admin: added user jiji to the list of users Bug: T201816 [puppet] - 10https://gerrit.wikimedia.org/r/452685 (https://phabricator.wikimedia.org/T201816) (owner: 10Jijiki) [14:59:38] RECOVERY - puppet last run on cp4030 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [14:59:48] It's responding nicely now, just ups and downs. [14:59:59] !log cache_text eqiad: restart varnish-be without transient storage caps [15:00:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:06] sjoerddebruin: that's interesting... I see a peak, but fairly short one: https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?orgId=1&from=now-1h&to=now&panelId=50&fullscreen&refresh=1m [15:00:24] jijiki: please meet our lovely commit message validator /o\ [15:00:46] lol tx :p [15:01:24] second line should be empty ;) [15:01:49] !log rebooting meitnerium/archiva.wikimedia.org for kernel security update [15:01:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:00] sjoerddebruin: please let me know if you see the issue again! We might have a hole in our monitoring [15:02:25] I can see the spikes for yesterday as well (had the same thing then), and will do. :) [15:04:37] 10Operations, 10SRE-Access-Requests: Requesting access to restricted production access and analytics-privatedata-users for Patrick Earley - https://phabricator.wikimedia.org/T201667 (10Jalexander) >>! In T201667#4500192, @Dzahn wrote: > Hi @PEarleyWMF @Jalexander Could you please create a user on Wikitech/LDAP... [15:08:21] (03PS2) 10Jijiki: admin: added user jiji to the list of users [puppet] - 10https://gerrit.wikimedia.org/r/452685 (https://phabricator.wikimedia.org/T201816) [15:12:57] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [15:13:38] jijiki: you don't need to remove reviewers :-) if Gerrit adds people as reviewers to a patch set, that happens because people are subscribed to patches matching a certain pattern [15:14:08] I actually wanted to save you having one more patch in your list :p [15:14:23] he likes patches [15:14:26] I'll spam away then no worries :p [15:14:38] it's entirely my own fault, I subscribed to that pattern :-) [15:14:57] RECOVERY - Varnish backend child restarted on cp1089 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1089&var-datasource=eqiad+prometheus/ops [15:15:08] PROBLEM - Host elastic1036 is DOWN: PING CRITICAL - Packet loss = 100% [15:15:36] ^downtime failed [15:16:07] RECOVERY - Host elastic1036 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [15:16:42] (03CR) 10Ottomata: [C: 031] "Ok to merge?" [puppet] - 10https://gerrit.wikimedia.org/r/452674 (https://phabricator.wikimedia.org/T199902) (owner: 10Mforns) [15:17:38] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "The uid for the user is wrong, please change it with the correct one you can find in the comments" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/452685 (https://phabricator.wikimedia.org/T201816) (owner: 10Jijiki) [15:17:57] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [15:19:27] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1037.eqiad.wmnet', 'elastic1036.eqiad.wmnet', 'elastic1038.eqiad.wmnet'] ``` an... [15:24:16] (03PS9) 10Jcrespo: db backup statistics: Initial implementation of the backup stats [puppet] - 10https://gerrit.wikimedia.org/r/449681 (https://phabricator.wikimedia.org/T198987) [15:24:50] (03PS1) 10Volans: LDAP: allow to specify multiple search strings [software/debmonitor] - 10https://gerrit.wikimedia.org/r/452686 [15:24:58] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [15:25:33] (03CR) 10Jcrespo: "This should be working after the fixes." [puppet] - 10https://gerrit.wikimedia.org/r/449681 (https://phabricator.wikimedia.org/T198987) (owner: 10Jcrespo) [15:25:35] (03CR) 10jerkins-bot: [V: 04-1] LDAP: allow to specify multiple search strings [software/debmonitor] - 10https://gerrit.wikimedia.org/r/452686 (owner: 10Volans) [15:26:06] yeah I know, CI is "broken" :D [15:26:49] (03CR) 10Krinkle: [C: 031] "It would make some unused legacy rewrites available and also make https://usability.wikimedia.org/api/ work, which seems fine and actually" [puppet] - 10https://gerrit.wikimedia.org/r/452635 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [15:27:18] (03CR) 10Volans: "Tests fail on CI for a series of reason, mainly python 3.4 only. Result of tests locally:" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/452686 (owner: 10Volans) [15:29:46] 10Operations, 10ops-codfw, 10cloud-services-team, 10decommission: Decommission labtestnet2001.codfw.wmnet - https://phabricator.wikimedia.org/T201440 (10Papaul) [15:29:58] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [15:35:36] (03CR) 10Volans: [V: 032 C: 032] "ACK" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/452656 (owner: 10Volans) [15:36:12] (03PS3) 10Jijiki: admin: added user jiji to the list of users [puppet] - 10https://gerrit.wikimedia.org/r/452685 (https://phabricator.wikimedia.org/T201816) [15:37:00] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [15:38:24] !log volans@deploy1001 Started deploy [netbox/deploy@792d4d5]: Security upgrade of dependency [15:38:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:30] (03PS1) 10Krinkle: apache: Remove unused apache::static_site type [puppet] - 10https://gerrit.wikimedia.org/r/452687 [15:39:28] !log volans@deploy1001 Finished deploy [netbox/deploy@792d4d5]: Security upgrade of dependency (duration: 01m 03s) [15:39:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:09] RECOVERY - Check systemd state on notebook1003 is OK: OK - running: The system is fully operational [15:41:28] (03CR) 10EBernhardson: [C: 031] elasticsearch: storage device is md1 after reimage to stretch [puppet] - 10https://gerrit.wikimedia.org/r/452669 (owner: 10Gehel) [15:41:58] (03PS2) 10Gehel: elasticsearch: storage device is md1 after reimage to stretch [puppet] - 10https://gerrit.wikimedia.org/r/452669 [15:42:00] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [15:42:39] (03CR) 10Gehel: [C: 032] elasticsearch: storage device is md1 after reimage to stretch [puppet] - 10https://gerrit.wikimedia.org/r/452669 (owner: 10Gehel) [15:42:47] (03CR) 10Jijiki: "done" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/452685 (https://phabricator.wikimedia.org/T201816) (owner: 10Jijiki) [15:43:53] (03CR) 10Giuseppe Lavagetto: [C: 031] admin: added user jiji to the list of users [puppet] - 10https://gerrit.wikimedia.org/r/452685 (https://phabricator.wikimedia.org/T201816) (owner: 10Jijiki) [15:45:10] (03PS7) 10Ottomata: Remove all geowiki puppetization except for the geowiki site [puppet] - 10https://gerrit.wikimedia.org/r/450040 (https://phabricator.wikimedia.org/T190059) (owner: 10Fdans) [15:48:01] (03PS1) 10Krinkle: webperf: Add 'Server: ' header to performance.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/452689 (https://phabricator.wikimedia.org/T158837) [15:48:47] (03PS8) 10Ottomata: Remove all geowiki puppetization except for the geowiki site [puppet] - 10https://gerrit.wikimedia.org/r/450040 (https://phabricator.wikimedia.org/T190059) (owner: 10Fdans) [15:48:53] (03CR) 10Ottomata: [V: 032 C: 032] Remove all geowiki puppetization except for the geowiki site [puppet] - 10https://gerrit.wikimedia.org/r/450040 (https://phabricator.wikimedia.org/T190059) (owner: 10Fdans) [15:49:52] (03PS4) 10Dzahn: admin: added user jiji to the list of users [puppet] - 10https://gerrit.wikimedia.org/r/452685 (https://phabricator.wikimedia.org/T201816) (owner: 10Jijiki) [15:51:07] (03CR) 10Dzahn: [C: 032] admin: added user jiji to the list of users [puppet] - 10https://gerrit.wikimedia.org/r/452685 (https://phabricator.wikimedia.org/T201816) (owner: 10Jijiki) [15:51:50] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 21 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [15:51:50] 10Operations, 10Performance-Team, 10monitoring, 10Patch-For-Review: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837 (10Krinkle) [15:53:44] (03Abandoned) 10Fdans: Remove all geowiki references from puppet [puppet] - 10https://gerrit.wikimedia.org/r/450025 (https://phabricator.wikimedia.org/T190059) (owner: 10Fdans) [15:55:35] 10Operations, 10Performance-Team, 10monitoring, 10Patch-For-Review: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837 (10Krinkle) [15:55:46] 10Operations, 10Performance-Team, 10monitoring, 10Patch-For-Review: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837 (10Krinkle) [15:57:53] 10Operations, 10netops, 10Patch-For-Review: Move servers off asw2-a-eqiad - https://phabricator.wikimedia.org/T201694 (10Cmjohnson) @ayounsi I pre-cabled everything. The lvs cross connects only need to move racks to the new switch. We probably need to do those 1 at a time, because downtime may be close to 1m... [15:58:46] (03CR) 10Filippo Giunchedi: "> Patch Set 8:" [puppet] - 10https://gerrit.wikimedia.org/r/449681 (https://phabricator.wikimedia.org/T198987) (owner: 10Jcrespo) [15:59:46] !log volans@deploy1001 Started deploy [netbox/deploy@792d4d5]: Security upgrade of dependency [15:59:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:04] godog, moritzm, and _joe_: (Dis)respected human, time to deploy Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180814T1600). Please do the needful. [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:01:09] !log volans@deploy1001 Finished deploy [netbox/deploy@792d4d5]: Security upgrade of dependency (duration: 01m 23s) [16:01:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:20] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 25 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [16:01:52] (03CR) 10Filippo Giunchedi: "Thanks for implementing the fixes!" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/449681 (https://phabricator.wikimedia.org/T198987) (owner: 10Jcrespo) [16:02:05] puppet swat at 9am in the morning is weird [16:03:01] (03PS6) 10Vgutierrez: [WIP] Refactor certcentral.certificate_management() [software/certcentral] - 10https://gerrit.wikimedia.org/r/451867 [16:03:51] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Refactor certcentral.certificate_management() [software/certcentral] - 10https://gerrit.wikimedia.org/r/451867 (owner: 10Vgutierrez) [16:04:00] RECOVERY - Check systemd state on ms-be1022 is OK: OK - running: The system is fully operational [16:05:09] quit [16:05:14] arg :) [16:06:20] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [16:06:50] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 18 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [16:13:29] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [16:15:08] (03PS1) 10Volans: Rebuild wheels for Django security upgrade (2) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/452699 [16:16:25] (03Abandoned) 10Bstorm: nfs-exportd: correcting typo [puppet] - 10https://gerrit.wikimedia.org/r/452428 (owner: 10Bstorm) [16:16:37] (03CR) 10Volans: [V: 032 C: 032] Rebuild wheels for Django security upgrade (2) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/452699 (owner: 10Volans) [16:16:54] 10Operations, 10Traffic, 10Patch-For-Review: Deploy initial ATS test clusters in core DCs - https://phabricator.wikimedia.org/T199720 (10Reedy) ``` 2018-08-14 16:01:08,650 [docker-pkg-build] INFO - Generated dockerfile for docker-registry.discovery.wmnet/releng/operations-puppet:0.3.3: FROM docker-registry.d... [16:17:57] !log volans@deploy1001 Started deploy [netbox/deploy@e2fd41d]: Security upgrade of dependency [16:18:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:28] 10Operations, 10ops-codfw, 10cloud-services-team, 10decommission: Decommission labtestnet2001.codfw.wmnet - https://phabricator.wikimedia.org/T201440 (10Papaul) ``` show | compare [edit interfaces interface-range vlan-cloud-hosts1-b-codfw] - member ge-5/0/21; [edit interfaces interface-range cloud-... [16:18:31] !log volans@deploy1001 Finished deploy [netbox/deploy@e2fd41d]: Security upgrade of dependency (duration: 00m 34s) [16:18:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:02] 10Operations, 10ops-codfw, 10cloud-services-team, 10decommission: Decommission labtestnet2001.codfw.wmnet - https://phabricator.wikimedia.org/T201440 (10Papaul) [16:24:14] 10Operations, 10Patch-For-Review: Onboarding Effie Mouzeli - https://phabricator.wikimedia.org/T201816 (10Dzahn) [16:29:00] 10Operations, 10Patch-For-Review: Onboarding Effie Mouzeli - https://phabricator.wikimedia.org/T201816 (10Dzahn) [16:31:53] (03PS1) 10Papaul: DNS: Remove mgmt DNS for labtestnet2001 [dns] - 10https://gerrit.wikimedia.org/r/452706 [16:33:17] 10Operations, 10Patch-For-Review: Onboarding Effie Mouzeli - https://phabricator.wikimedia.org/T201816 (10Dzahn) [16:33:39] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [16:35:44] 10Operations, 10hardware-requests: Request for swift ms-be expansion - https://phabricator.wikimedia.org/T201937 (10fgiunchedi) [16:37:38] (03CR) 10Dzahn: [C: 032] DNS: Remove mgmt DNS for labtestnet2001 [dns] - 10https://gerrit.wikimedia.org/r/452706 (owner: 10Papaul) [16:40:02] 10Operations, 10ops-codfw, 10cloud-services-team, 10decommission, 10Patch-For-Review: Decommission labtestnet2001.codfw.wmnet - https://phabricator.wikimedia.org/T201440 (10Papaul) [16:40:40] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [16:41:21] 10Operations, 10hardware-requests: Request for swift ms-be refresh - https://phabricator.wikimedia.org/T201938 (10fgiunchedi) [16:41:41] 10Operations, 10ops-codfw, 10cloud-services-team, 10decommission, 10Patch-For-Review: Decommission labtestnet2001.codfw.wmnet - https://phabricator.wikimedia.org/T201440 (10Papaul) 05Open>03Resolved Complete [16:42:56] 10Operations, 10Analytics: rack/setup/install 2 new hadoop master/standby systems in eqiad - https://phabricator.wikimedia.org/T201939 (10RobH) p:05Triage>03Normal [16:45:40] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [16:46:08] 10Operations, 10Analytics: rack/setup/install 2 new hadoop master/standby systems in eqiad - https://phabricator.wikimedia.org/T201939 (10RobH) Assigned to @elukey for hostname feedback, but as they are on vacation perhaps someone else in #analytics would be able to provide feedback on hostname? [16:46:20] 10Operations, 10Analytics: rack/setup/install 2 new hadoop master/standby systems in eqiad - https://phabricator.wikimedia.org/T201939 (10RobH) a:05elukey>03Ottomata [16:47:29] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): jessie support for QLogic FastLinQ 41112 Dual Port 10Gb SFP+ Adapter - https://phabricator.wikimedia.org/T201942 (10aborrero) p:05Triage>03Normal [16:48:29] (03CR) 10Filippo Giunchedi: "Thanks for this! What about hieradata? Also please run PCC" [puppet] - 10https://gerrit.wikimedia.org/r/449763 (owner: 10Dzahn) [16:50:01] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): jessie support for QLogic FastLinQ 41112 Dual Port 10Gb SFP+ Adapter - https://phabricator.wikimedia.org/T201942 (10RobH) a:05RobH>03None [16:52:50] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [16:56:53] 10Operations, 10Analytics: rack/setup/install 2 new hadoop master/standby systems in eqiad - https://phabricator.wikimedia.org/T201939 (10Ottomata) Hm, tough question! I'd be ok with analytics-master1001 and analytics-master1002. Let's do it! [16:57:14] !log reimage of elastic103[345] [16:57:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:29] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1033.eqiad.wmnet', 'elastic1034... [16:57:42] 10Operations, 10Traffic, 10Patch-For-Review: Deploy initial ATS test clusters in core DCs - https://phabricator.wikimedia.org/T199720 (10ema) Thanks @Reedy! The `luarocks` part fails with: ``` Warning: Failed searching manifest: Failed extracting manifest file Installing https://raw.githubusercontent.com/ro... [16:58:34] 10Operations, 10Analytics: rack/setup/install analytics-master100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T201939 (10RobH) a:05Ottomata>03Cmjohnson [16:59:35] 10Operations, 10Traffic, 10Patch-For-Review: Deploy initial ATS test clusters in core DCs - https://phabricator.wikimedia.org/T199720 (10Reedy) Yay, dependancies. Feel free to bump the package again and add unzip and I can try again [17:00:04] cscott, arlolra, subbu, halfak, and Amir1: It is that lovely time of the day again! You are hereby commanded to deploy Services – Graphoid / Parsoid / Citoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180814T1700). [17:00:27] Nothing for ORES today! [17:00:50] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Karen Brown - https://phabricator.wikimedia.org/T201668 (10Jalexander) >>! In T201668#4500160, @Dzahn wrote: > Note that "kbrown" is a username already taken in LDAP... [17:01:46] 10Operations, 10Traffic, 10Patch-For-Review: Deploy initial ATS test clusters in core DCs - https://phabricator.wikimedia.org/T199720 (10ema) >>! In T199720#4502223, @Reedy wrote: > Yay, dependancies. Yeah. Note that the version of `luarocks` in stretch does depend on `unzip`, it's the jessie version that d... [17:02:59] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [17:04:37] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/452686 (owner: 10Volans) [17:05:28] 10Operations, 10Traffic, 10Patch-For-Review: Deploy initial ATS test clusters in core DCs - https://phabricator.wikimedia.org/T199720 (10Reedy) Least `unzip` isn't a heavyweight dependancy :) [17:05:50] PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [600.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&panelId=9&fullscreen [17:12:10] RECOVERY - Number of backend failures per minute from CirrusSearch on graphite1001 is OK: OK: Less than 20.00% above the threshold [300.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&panelId=9&fullscreen [17:17:22] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 22 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [17:19:08] !log restarting elasticsearch on elastic1050 (high load) [17:19:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:56] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): jessie support for QLogic FastLinQ 41112 Dual Port 10Gb SFP+ Adapter - https://phabricator.wikimedia.org/T201942 (10aborrero) a:03aborrero There is a Debian non-free package with firmware for QLogic NICs, hope we didn't buy hardware... [17:21:41] PROBLEM - MariaDB Slave Lag: s7 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.74 seconds [17:22:17] 10Operations, 10ops-eqiad: Degraded RAID on labvirt1019 - https://phabricator.wikimedia.org/T196507 (10Cmjohnson) @Bstorm I have the new battery on-site...when is a good time for you to replace? [17:22:22] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [17:22:52] !log configuring eqiad A switch ports for T201694 [17:22:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:59] T201694: Move servers off asw2-a-eqiad - https://phabricator.wikimedia.org/T201694 [17:23:05] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1033.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['elastic1033.eqiad.wmnet... [17:23:55] 10Operations, 10ops-eqiad: Degraded RAID on labvirt1019 - https://phabricator.wikimedia.org/T196507 (10Bstorm) I can stop the VMs on labvirt1019 and 1020, silence alerts and shut them down whenever you like :) @Cmjohnson [17:24:43] bstorm_ let's go ahead and do the battery now [17:25:02] Sure, I already started shutting off instances. I'll downtime the labvirts [17:25:09] it's just the one [17:25:23] 1019.....I want to see the results of this before doing 1020 [17:26:30] (03CR) 10Ema: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/452612 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema) [17:28:44] (03CR) 10jerkins-bot: [V: 04-1] tox: add ts-lua tests for trafficserver [puppet] - 10https://gerrit.wikimedia.org/r/452612 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema) [17:29:03] too early! :P [17:30:57] (03CR) 10Ema: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/452612 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema) [17:31:25] (03CR) 10jerkins-bot: [V: 04-1] tox: add ts-lua tests for trafficserver [puppet] - 10https://gerrit.wikimedia.org/r/452612 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema) [17:31:31] (03CR) 10Jcrespo: [C: 04-1] db backup statistics: Initial implementation of the backup stats [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/449469 (https://phabricator.wikimedia.org/T198987) (owner: 10Jcrespo) [17:31:50] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service: WDQS diskspace is low - https://phabricator.wikimedia.org/T196485 (10Cmjohnson) I have the 4 ssds on-site. [17:33:44] (03CR) 10Jcrespo: "> I don't like adding more code to puppet that could live outside" [puppet] - 10https://gerrit.wikimedia.org/r/449681 (https://phabricator.wikimedia.org/T198987) (owner: 10Jcrespo) [17:34:18] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1033.eqiad.wmnet'] ``` The log... [17:36:16] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 20 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [17:36:48] 10Operations, 10SRE-Access-Requests: Subscribe user mepps to security@wikimedia.org - https://phabricator.wikimedia.org/T201856 (10mark) @Dzahn please get her added to this list. Thanks! [17:38:21] (03PS1) 10Ladsgroup: etherpad: Add article to the placeholder text [puppet] - 10https://gerrit.wikimedia.org/r/452716 [17:38:38] (03PS2) 10Gehel: Changing day of the cron for testing [puppet] - 10https://gerrit.wikimedia.org/r/452467 (https://phabricator.wikimedia.org/T194787) (owner: 10MSantos) [17:38:45] (03CR) 10jerkins-bot: [V: 04-1] etherpad: Add article to the placeholder text [puppet] - 10https://gerrit.wikimedia.org/r/452716 (owner: 10Ladsgroup) [17:39:03] (03CR) 10jerkins-bot: [V: 04-1] Changing day of the cron for testing [puppet] - 10https://gerrit.wikimedia.org/r/452467 (https://phabricator.wikimedia.org/T194787) (owner: 10MSantos) [17:39:27] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [17:39:36] (03CR) 10Gehel: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/452467 (https://phabricator.wikimedia.org/T194787) (owner: 10MSantos) [17:39:54] (03CR) 10jerkins-bot: [V: 04-1] Changing day of the cron for testing [puppet] - 10https://gerrit.wikimedia.org/r/452467 (https://phabricator.wikimedia.org/T194787) (owner: 10MSantos) [17:40:03] (03CR) 10Ladsgroup: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/452716 (owner: 10Ladsgroup) [17:40:27] (03CR) 10jerkins-bot: [V: 04-1] etherpad: Add article to the placeholder text [puppet] - 10https://gerrit.wikimedia.org/r/452716 (owner: 10Ladsgroup) [17:41:16] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 19 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [17:41:27] (03PS2) 10Ottomata: Add salt file path to EventLoggingSanitization cron job [puppet] - 10https://gerrit.wikimedia.org/r/452674 (https://phabricator.wikimedia.org/T199902) (owner: 10Mforns) [17:41:31] (03CR) 10Ottomata: [V: 032 C: 032] Add salt file path to EventLoggingSanitization cron job [puppet] - 10https://gerrit.wikimedia.org/r/452674 (https://phabricator.wikimedia.org/T199902) (owner: 10Mforns) [17:41:32] 404 on the docker registry for the jenkins puppet jobs, seems to be something I've heard before [17:42:48] https://gerrit.wikimedia.org/r/c/operations/puppet/+/452716 [17:42:52] gehel: that happened over the weekend because of the cache_misc -> cache_text transition, but it was fixed with https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/452182/ [17:42:58] Is the master broken? [17:43:09] why it can't find the docker image [17:43:12] ema: yeah, I was looking at that change. So something else this time [17:43:17] PROBLEM - MariaDB Slave Lag: s3 on db1095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 643.37 seconds [17:43:29] Amir1: I'm hitting the same issue [17:44:20] :/ [17:44:27] gehel: and over the weekend we had 403s, not 404s [17:44:27] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [17:44:29] (03PS1) 10Cmjohnson: Adding mgmt dns for newly racked servers [dns] - 10https://gerrit.wikimedia.org/r/452718 (https://phabricator.wikimedia.org/T201343) [17:44:30] 10Operations, 10SRE-Access-Requests: Subscribe user mepps to security@wikimedia.org - https://phabricator.wikimedia.org/T201856 (10Dzahn) 05Open>03Resolved a:03Dzahn Done. @Mepps You have been added to security@wikimedia.org now. [17:45:13] ema: the logs say "Error response from daemon: manifest for docker-registry.wikimedia.org/releng/operations-puppet:0.3.4 not found", so I suspect a 404, but it might actually be something entirely different [17:45:16] (03CR) 10Cmjohnson: [C: 032] Adding mgmt dns for newly racked servers [dns] - 10https://gerrit.wikimedia.org/r/452718 (https://phabricator.wikimedia.org/T201343) (owner: 10Cmjohnson) [17:45:35] https://phabricator.wikimedia.org/T200722 [17:45:40] 10Operations, 10SRE-Access-Requests: Request production global root access for Effie Mouzeli - https://phabricator.wikimedia.org/T201849 (10Dzahn) @jijiki created her own user with https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/452685/ The next step will be to upload a second change to add her user to... [17:45:53] (03PS2) 10Cmjohnson: Adding mgmt dns for newly racked servers [dns] - 10https://gerrit.wikimedia.org/r/452718 (https://phabricator.wikimedia.org/T201343) [17:45:59] (03CR) 10Cmjohnson: [V: 032 C: 032] Adding mgmt dns for newly racked servers [dns] - 10https://gerrit.wikimedia.org/r/452718 (https://phabricator.wikimedia.org/T201343) (owner: 10Cmjohnson) [17:46:47] sorry, need to take a family break, Amir1 I hope that issue will fix itself in the meantime :) [17:47:41] 10Operations, 10ops-eqiad, 10Analytics: rack/setup/install analytics-master100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T201939 (10Cmjohnson) [17:47:58] 10Operations, 10ops-eqiad, 10Analytics: rack/setup/install analytics-master100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T201939 (10Cmjohnson) [17:48:04] (03CR) 10Awight: [C: 031] "Better grammar is more good!" [puppet] - 10https://gerrit.wikimedia.org/r/452716 (owner: 10Ladsgroup) [17:49:09] 10Operations, 10ops-eqiad, 10Analytics: rack/setup/install analytics-master100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T201939 (10Cmjohnson) @ottomata the name is entirely too long for labels and tracking. can we shorten it a bit? [17:50:32] 10Operations, 10SRE-Access-Requests: Request production global root access for Effie Mouzeli - https://phabricator.wikimedia.org/T201849 (10Dzahn) a:05Joe>03jijiki We have tested and confirmed access to bast1002 works. Next ssh to `rutherfordium.eqiad.wmnet` (people.wikimedia.org) can be used to test SSH... [17:51:27] PROBLEM - Check systemd state on wdqs1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:54:05] PROBLEM - MariaDB Slave Lag: s6 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 331.40 seconds [17:54:19] 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-General-or-Unknown, and 2 others: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10Imarlier) Yesterday we had just under 20,000 requests for the copyright prot... [17:54:32] 10Operations, 10Patch-For-Review: Onboarding Effie Mouzeli - https://phabricator.wikimedia.org/T201816 (10Dzahn) - added to https://phabricator.wikimedia.org/project/members/974/ and then https://phabricator.wikimedia.org/project/members/61/ for access to "WMF-NDA" Phabricator tickets - subscribed to ops mai... [17:54:44] RECOVERY - Check systemd state on wdqs1010 is OK: OK - running: The system is fully operational [17:55:48] (03PS1) 10Volans: Force django-filter==1.1.0 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/452723 [17:56:29] 10Operations, 10netops, 10Patch-For-Review: Move servers off asw2-a-eqiad - https://phabricator.wikimedia.org/T201694 (10ayounsi) [17:56:45] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [17:57:00] (03CR) 10Volans: [V: 032 C: 032] Force django-filter==1.1.0 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/452723 (owner: 10Volans) [17:57:17] 10Operations, 10netops, 10Patch-For-Review: Move servers off asw2-a-eqiad - https://phabricator.wikimedia.org/T201694 (10ayounsi) Task description updated with Chris's info so we have everything in 1 place. Switch ports configured accordingly. [17:57:45] PROBLEM - Check systemd state on wdqs1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:58:30] (03PS8) 10Ema: ATS: add Lua scripting support [puppet] - 10https://gerrit.wikimedia.org/r/451838 (https://phabricator.wikimedia.org/T199720) [17:58:40] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1033.eqiad.wmnet'] ``` and were **ALL** successful. [17:58:53] (03CR) 10jerkins-bot: [V: 04-1] ATS: add Lua scripting support [puppet] - 10https://gerrit.wikimedia.org/r/451838 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema) [17:59:29] !log volans@deploy1001 Started deploy [netbox/deploy@eae2c9d]: Fix broken dependency [17:59:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:46] !log reindexing Polish wikis on elastic@eqiad and elastic@codfw (T200037) [17:59:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:57] T200037: Re-index Polish Wikis to patch Stempel stems - https://phabricator.wikimedia.org/T200037 [18:00:02] !log volans@deploy1001 Finished deploy [netbox/deploy@eae2c9d]: Fix broken dependency (duration: 00m 33s) [18:00:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:17] (03CR) 10Reedy: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/452612 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema) [18:00:49] (03CR) 10jerkins-bot: [V: 04-1] tox: add ts-lua tests for trafficserver [puppet] - 10https://gerrit.wikimedia.org/r/452612 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema) [18:01:42] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [18:02:50] (03CR) 10Reedy: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/452612 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema) [18:03:12] that patch just does not want to please jenkins [18:03:19] (03CR) 10jerkins-bot: [V: 04-1] tox: add ts-lua tests for trafficserver [puppet] - 10https://gerrit.wikimedia.org/r/452612 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema) [18:03:32] 10Operations, 10ops-eqiad, 10Analytics: rack/setup/install analytics-master100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T201939 (10RobH) >>! In T201939#4502422, @Cmjohnson wrote: > @ottomata the name is entirely too long for labels and tracking. can we shorten it a bit? This was discussed in IRC... [18:03:56] jenkins' hard to please alright [18:04:29] 18:03:16 docker: Error response from daemon: manifest for docker-registry.wikimedia.org/releng/operations-puppet:0.3.4 not found. [18:05:44] !log mforns@deploy1001 Started deploy [analytics/refinery@cb57843]: corresponding to refinery-source v0.0.70 [18:05:44] not related to the actual content of the change it seems? [18:05:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:59] mutante: nope, related to my attempts to add `lua-busted` to puppet's CI though [18:07:32] !log update NTP servers on pfw [18:07:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:34] mutante: see https://gerrit.wikimedia.org/r/#/c/integration/config/+/452634/ https://gerrit.wikimedia.org/r/#/c/integration/config/+/452714/ https://gerrit.wikimedia.org/r/#/c/integration/config/+/452692/ [18:09:29] Reedy did some magic, the image seemed to have been built correctly, then apparently https://phabricator.wikimedia.org/T200722 [18:10:20] I wasn't planning on changing the world BTW, I initially just wanted to add one package to the image :) [18:10:42] !log re-activate peer 13285 on cr2-ulsfo [18:10:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:05] ema: all i had was to say that the image referenced in https://gerrit.wikimedia.org/r/#/c/integration/config/+/452692/2/jjb/operations-puppet.yaml has to pushed to the docker-registry.. and i was told Reedy can do it ... [18:11:18] but if he already did magic.. hmm [18:11:39] [contint1001.wikimedia.org] out: adding_tag latest [18:11:39] [contint1001.wikimedia.org] out: Call: docker-registry.discovery.wmnet/releng/operations-puppet:0.3.4 docker-registry.discovery.wmnet/releng/operations-puppet latest [18:11:39] [contint1001.wikimedia.org] out: Successfully published image docker-registry.discovery.wmnet/releng/operations-puppet [18:11:41] Something is odd [18:12:56] pushed to docker-registry.discovery.wmnet but the image: references docker-registry.wikimedia.org normal? [18:13:45] Not sure [18:13:56] !log bump PCCW max accepted prefixes on cr2-esams [18:14:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:50] yea, that is both darmstadtium as backend, looks normal [18:14:53] PROBLEM - Check systemd state on cp5005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:14:54] https://docker-registry.wikimedia.org/v2/releng/operations-puppet/tags/list [18:14:58] {"name":"releng/operations-puppet","tags":["0.1.0","0.2.1","0.3.0","0.3.1","0.3.2","0.3.4","latest"]} [18:15:15] https://docker-registry.wikimedia.org/v2/releng/operations-puppet/manifests/0.3.4 [18:15:31] Is it some caching? [18:15:46] Because the slaves hit before it was there? [18:15:49] !log bump PCCW max accepted prefixes on cr2-eqiad [18:15:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:59] (03CR) 10Reedy: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/452612 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema) [18:17:27] (03CR) 10jerkins-bot: [V: 04-1] tox: add ts-lua tests for trafficserver [puppet] - 10https://gerrit.wikimedia.org/r/452612 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema) [18:17:59] Reedy: you're right [18:18:57] mutante: Is there anything in the logs on the backend? [18:19:08] Reedy: the 404 is indeed cached [18:19:30] Aha [18:19:38] That's kinda sill [18:19:39] y [18:19:52] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb={GET,LIST,PATCH} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:19:53] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb=LIST https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:20:02] PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation={get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:20:12] !log mforns@deploy1001 Finished deploy [analytics/refinery@cb57843]: corresponding to refinery-source v0.0.70 (duration: 14m 27s) [18:20:13] PROBLEM - etcd request latencies on argon is CRITICAL: instance=10.64.32.133:6443 operation=list https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:20:13] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=PUT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:20:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:33] PROBLEM - etcd request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 operation={get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:21:06] ema: The simple answer is maybe to just to rebuild it again... Make sure I wait fully for it all to finish, then do jjb again [18:23:02] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:23:03] RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:23:12] Reedy: but yeah I get a 404 from darmstadtium too, so something is still wrong with the registry itself [18:23:12] RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:23:22] RECOVERY - etcd request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:23:22] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:23:42] RECOVERY - etcd request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:27:23] RECOVERY - MariaDB Slave Lag: s7 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 35.06 seconds [18:27:52] !log renumber v4 IP of 8560 on cr1-eqord [18:27:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:23] ema: Revert hte jjb change to make things work, and see if the registry sorts itself out later? [18:29:16] Reedy: let's try [18:32:40] (03CR) 10Reedy: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/452612 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema) [18:33:10] (03CR) 10jerkins-bot: [V: 04-1] tox: add ts-lua tests for trafficserver [puppet] - 10https://gerrit.wikimedia.org/r/452612 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema) [18:33:14] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [18:33:34] Reedy: well that patch specifically is supposed to fail with 0.3.2 :) [18:33:46] yeah, I haven't jjb'd yet [18:35:27] (03CR) 10Reedy: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/452612 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema) [18:36:17] 18:35:54 + exec docker run --rm --env-file /dev/fd/63 --volume /srv/jenkins-workspace/workspace/operations-puppet-tests-docker/log:/srv/workspace/log docker-registry.wikimedia.org/releng/operations-puppet:0.3.2 [18:36:45] (03CR) 10jerkins-bot: [V: 04-1] tox: add ts-lua tests for trafficserver [puppet] - 10https://gerrit.wikimedia.org/r/452612 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema) [18:38:14] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [18:38:48] ema: Back to old broken at least [18:41:55] * Krinkle staging on mwdebug1002/deploy1001 [18:42:03] (03CR) 10Ema: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/451838 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema) [18:43:10] Reedy: yeah, confirmed :) [18:43:23] Filed a bug for the published/not published thing [18:44:42] !log delete peer 4589 on cr2-esams (no more direct peering) [18:44:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:15] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [18:47:49] Reedy: thanks! [18:49:36] Fyi, those IPv6 RIPE atlas alerts seem to be due to Hurricane Electric [18:50:15] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [18:58:11] !log mforns@deploy1001 Started deploy [analytics/refinery@a4d1d99]: adding hashing to EL whitelist [18:58:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:35] (03PS1) 10Milimetric: Add reference to Wikitech docs [puppet] - 10https://gerrit.wikimedia.org/r/452738 (https://phabricator.wikimedia.org/T201653) [19:02:25] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [19:06:48] !log filippo@neodymium conftool action : set/pooled=no; selector: name=logstash1008,service=gelf [19:06:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:21] jynus: not sure if expected but db1095 is almost maxing out its interface: https://librenms.wikimedia.org/device/device=162/tab=port/port=14702/ [19:10:04] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 20 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [19:11:15] !log mforns@deploy1001 Finished deploy [analytics/refinery@a4d1d99]: adding hashing to EL whitelist (duration: 13m 04s) [19:11:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:04] PROBLEM - etcd request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 operation={get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:12:24] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb={LIST,PATCH} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:12:34] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb=LIST https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:12:35] PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation=list https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:12:45] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=PUT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:13:34] RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:13:45] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:14:05] RECOVERY - etcd request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:14:44] RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:15:05] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 17 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [19:15:34] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:22:25] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [19:33:38] * Krinkle staging on mwdebug1002/deploy1001 [19:34:04] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 22 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [19:34:34] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 25 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [19:35:27] !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.16/includes/cache/MessageCache.php: I6093113a / T201893 (duration: 00m 52s) [19:35:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:36] T201893: MessageCache access throw UnexpectedValueException "The value of 'en' is not an array." from MapCacheLRU - https://phabricator.wikimedia.org/T201893 [19:37:44] PROBLEM - MariaDB Slave Lag: s8 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 446.44 seconds [19:39:04] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 16 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [19:39:35] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [19:41:16] * Krinkle done with deployment [19:41:59] 10Operations, 10Operations-Software-Development: confctl: log to SAL even if the selection doesn't match any host - https://phabricator.wikimedia.org/T155705 (10fgiunchedi) Also there's nothing logged on stdout on non-existent host and conftool exits 0. Ditto for a non-existant service [19:47:12] 10Operations, 10ops-eqiad: Degraded RAID on labvirt1019 - https://phabricator.wikimedia.org/T201957 (10ops-monitoring-bot) [19:49:40] (03PS1) 10Krinkle: icinga: Define 'notify-by-email-per-service' command [puppet] - 10https://gerrit.wikimedia.org/r/452744 [19:50:54] (03CR) 10Krinkle: "Per Filippo, the association between alert types/receives is not in this repository, but in the private repository where contacts are defi" [puppet] - 10https://gerrit.wikimedia.org/r/452744 (owner: 10Krinkle) [19:51:39] 10Operations, 10Wikimedia-Logstash, 10monitoring, 10Patch-For-Review, 10User-herron: Send logstash service metrics to prometheus - https://phabricator.wikimedia.org/T200362 (10fgiunchedi) >>! In T200362#4487498, @gerritbot wrote: > Change 451018 **merged** by Filippo Giunchedi: > [operations/puppet@produ... [19:51:45] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [19:55:50] (03CR) 10Krinkle: [C: 04-1] "This caused a rebase conflict for Beta Cluster's puppetmaster about 10 hours ago. I've tried to resolve it, but please double check and ma" [puppet] - 10https://gerrit.wikimedia.org/r/437052 (owner: 10Alex Monk) [19:56:54] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [20:02:57] 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-General-or-Unknown, and 2 others: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10matmarex) For reference, according to this thread, Polish Wikipedia was affe... [20:03:55] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 25 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [20:06:15] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 20 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [20:08:54] (03PS1) 10Filippo Giunchedi: logstash: use /etc/default/logstash to add jmx_exporter [puppet] - 10https://gerrit.wikimedia.org/r/452747 (https://phabricator.wikimedia.org/T200362) [20:08:55] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [20:09:34] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 20 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [20:14:34] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 19 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [20:16:04] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 25 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [20:17:41] (03CR) 10Alex Monk: "fixed, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/437052 (owner: 10Alex Monk) [20:19:01] gehel, hi [20:21:04] RECOVERY - MariaDB Slave Lag: s3 on db1095 is OK: OK slave_sql_lag Replication lag: 0.27 seconds [20:22:47] meh idle 02:33 [20:23:05] (03PS3) 10Andrew Bogott: openstack glance: move active service for eqiad1 and main to cloudcontrol1003 [puppet] - 10https://gerrit.wikimedia.org/r/452595 (https://phabricator.wikimedia.org/T191791) [20:23:07] (03PS3) 10Andrew Bogott: Openstack glance: remove glance service from labcontrol1001 [puppet] - 10https://gerrit.wikimedia.org/r/452596 (https://phabricator.wikimedia.org/T191791) [20:23:09] (03PS1) 10Andrew Bogott: mwopenstackclients: use region from ENV if present [puppet] - 10https://gerrit.wikimedia.org/r/452751 [20:25:24] (03CR) 10Andrew Bogott: [C: 032] mwopenstackclients: use region from ENV if present [puppet] - 10https://gerrit.wikimedia.org/r/452751 (owner: 10Andrew Bogott) [20:26:05] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [20:26:24] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 18 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [20:28:04] RECOVERY - MariaDB Slave Lag: s6 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 15.92 seconds [20:31:50] (03PS1) 10Legoktm: php72: Install php7.2-mbstring [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/452755 (https://phabricator.wikimedia.org/T188318) [20:33:15] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [20:34:18] Krinkle, please remove your CR-1 [20:35:37] (03PS4) 10Andrew Bogott: openstack glance: move active service for eqiad1 and main to cloudcontrol1003 [puppet] - 10https://gerrit.wikimedia.org/r/452595 (https://phabricator.wikimedia.org/T191791) [20:35:39] (03PS4) 10Andrew Bogott: Openstack glance: remove glance service from labcontrol1001 [puppet] - 10https://gerrit.wikimedia.org/r/452596 (https://phabricator.wikimedia.org/T191791) [20:35:42] (03PS1) 10Andrew Bogott: mwopenstackclient: glance client takes 'region_name' arg instead of 'region' [puppet] - 10https://gerrit.wikimedia.org/r/452793 [20:38:07] (03CR) 10Andrew Bogott: [C: 032] mwopenstackclient: glance client takes 'region_name' arg instead of 'region' [puppet] - 10https://gerrit.wikimedia.org/r/452793 (owner: 10Andrew Bogott) [20:38:15] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [20:41:32] Krenair: the patch in Gerrit is out of date though, that was the -1 reason. [20:41:42] The area of conflict adds 1 line in beta but 2 lines in gerrit. [20:41:47] oh right [20:41:49] not sure which about it further, just noticed it [20:44:42] (03PS16) 10Alex Monk: cumin: Allow Puppet DB backend to be used within Labs projects that use it [puppet] - 10https://gerrit.wikimedia.org/r/437052 [20:45:24] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 23 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [20:48:34] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 22 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [20:53:35] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 18 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [20:55:25] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [20:56:15] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [20:56:37] 10Operations, 10MediaWiki-extensions-Translate, 10Language-2018-July-September, 10MW-1.32-release-notes (WMF-deploy-2018-08-21 (1.32.0-wmf.18)), and 4 others: 503 error attempting to open multiple projects (Wikipedia and meta wiki are loading very slowly) - https://phabricator.wikimedia.org/T195293 (10Jdfor... [20:57:59] (03CR) 10Legoktm: [C: 032] php72: Install php7.2-mbstring [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/452755 (https://phabricator.wikimedia.org/T188318) (owner: 10Legoktm) [20:58:14] (03Merged) 10jenkins-bot: php72: Install php7.2-mbstring [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/452755 (https://phabricator.wikimedia.org/T188318) (owner: 10Legoktm) [20:59:14] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy [21:00:44] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 20 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [21:02:34] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 22 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [21:05:45] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 19 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [21:06:14] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 22 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [21:06:59] (03CR) 10Krinkle: [C: 04-1] "At least for beta this was a no-op. Probably mod_security needs to be applied higher up for it to work." [puppet] - 10https://gerrit.wikimedia.org/r/452689 (https://phabricator.wikimedia.org/T158837) (owner: 10Krinkle) [21:10:47] good night [21:11:14] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 17 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [21:15:15] RECOVERY - MariaDB Slave Lag: s8 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 55.25 seconds [21:27:45] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [21:34:28] (03CR) 10Dzahn: [V: 031 C: 031] "thanks! confirmed this doesn't appear to be used anymore" [puppet] - 10https://gerrit.wikimedia.org/r/452687 (owner: 10Krinkle) [21:34:54] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [21:36:55] (03PS1) 10BBlack: Revert "Update alexa image block to just 1500px URIs" [puppet] - 10https://gerrit.wikimedia.org/r/452835 [21:36:57] (03PS1) 10BBlack: Revert "block alexawikibot for now" [puppet] - 10https://gerrit.wikimedia.org/r/452836 [21:37:53] (03CR) 10BBlack: [C: 032] Revert "Update alexa image block to just 1500px URIs" [puppet] - 10https://gerrit.wikimedia.org/r/452835 (owner: 10BBlack) [21:37:56] (03CR) 10BBlack: [C: 032] Revert "block alexawikibot for now" [puppet] - 10https://gerrit.wikimedia.org/r/452836 (owner: 10BBlack) [21:42:55] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [21:42:55] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 25 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [21:43:15] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 20 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [21:43:32] XioNoX: anything we can do for those alerts? ^ [21:43:46] (03PS2) 10Filippo Giunchedi: logstash: use /etc/default/logstash to add jmx_exporter [puppet] - 10https://gerrit.wikimedia.org/r/452747 (https://phabricator.wikimedia.org/T200362) [21:43:54] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy [21:45:08] godog: downtime for a bit, I'll email HE's noc [21:46:43] sounds good -- thanks! [21:47:55] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 18 probes of 315 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [21:48:24] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 16 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [21:51:29] (03CR) 10Filippo Giunchedi: [C: 032] "PCC https://puppet-compiler.wmflabs.org/compiler02/12090/logstash1007.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/452747 (https://phabricator.wikimedia.org/T200362) (owner: 10Filippo Giunchedi) [21:51:42] (03PS3) 10Filippo Giunchedi: logstash: use /etc/default/logstash to add jmx_exporter [puppet] - 10https://gerrit.wikimedia.org/r/452747 (https://phabricator.wikimedia.org/T200362) [21:54:55] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [21:57:48] (03PS1) 10Dzahn: admins: Revoke SSH key for Daniel Kinzler [puppet] - 10https://gerrit.wikimedia.org/r/452841 (https://phabricator.wikimedia.org/T201913) [21:59:02] (03CR) 10Dzahn: [C: 032] admins: Revoke SSH key for Daniel Kinzler [puppet] - 10https://gerrit.wikimedia.org/r/452841 (https://phabricator.wikimedia.org/T201913) (owner: 10Dzahn) [22:01:22] (03PS1) 10Jforrester: Disable wgLegacyJavaScriptGlobals on all group0 wikis, not just test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452843 (https://phabricator.wikimedia.org/T35837) [22:01:36] (03PS1) 10Dzahn: admins: add new SSH key for Daniel Kinzler [puppet] - 10https://gerrit.wikimedia.org/r/452844 (https://phabricator.wikimedia.org/T201913) [22:02:04] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [22:04:36] ACKNOWLEDGEMENT - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map daniel_zahn XioNoX is mailing Hurricane Electric [22:04:39] (03PS4) 10Filippo Giunchedi: logstash: use /etc/default/logstash to add jmx_exporter [puppet] - 10https://gerrit.wikimedia.org/r/452747 (https://phabricator.wikimedia.org/T200362) [22:07:04] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [22:07:39] (03CR) 10Dzahn: [C: 04-1] "needs a verification that it's not a compromised phab account. could be a GPG signature, a 10 second hangout, a selfie with the key or som" [puppet] - 10https://gerrit.wikimedia.org/r/452844 (https://phabricator.wikimedia.org/T201913) (owner: 10Dzahn) [22:08:39] (03CR) 10Dzahn: [C: 04-1] "or identified IRC nick i guess. but you were offline" [puppet] - 10https://gerrit.wikimedia.org/r/452844 (https://phabricator.wikimedia.org/T201913) (owner: 10Dzahn) [22:10:30] (03CR) 10Dzahn: [C: 04-1] "or maybe somebody in EU timezone can do that kind of thing with you and add a +1 or merge" [puppet] - 10https://gerrit.wikimedia.org/r/452844 (https://phabricator.wikimedia.org/T201913) (owner: 10Dzahn) [22:11:35] PROBLEM - logstash syslog TCP port on logstash1008 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused [22:11:44] PROBLEM - logstash JSON linesTCP port on logstash1008 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused [22:11:44] PROBLEM - Check systemd state on logstash1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:11:45] PROBLEM - logstash process on logstash1008 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 499 (logstash), command name java, args logstash [22:11:56] that's me ^ fix incoing [22:12:04] PROBLEM - logstash log4j TCP port on logstash1008 is CRITICAL: connect to address 127.0.0.1 and port 4560: Connection refused [22:12:31] thanks [22:12:35] RECOVERY - logstash syslog TCP port on logstash1008 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 [22:12:45] RECOVERY - logstash JSON linesTCP port on logstash1008 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 [22:12:45] RECOVERY - Check systemd state on logstash1008 is OK: OK - running: The system is fully operational [22:12:45] RECOVERY - logstash process on logstash1008 is OK: PROCS OK: 1 process with UID = 499 (logstash), command name java, args logstash [22:13:04] RECOVERY - logstash log4j TCP port on logstash1008 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 4560 [22:14:14] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [22:14:29] (03PS1) 10Filippo Giunchedi: logstash: fix /etc/default/logstash [puppet] - 10https://gerrit.wikimedia.org/r/452845 (https://phabricator.wikimedia.org/T200362) [22:14:53] (03CR) 10Filippo Giunchedi: [C: 032] logstash: fix /etc/default/logstash [puppet] - 10https://gerrit.wikimedia.org/r/452845 (https://phabricator.wikimedia.org/T200362) (owner: 10Filippo Giunchedi) [22:15:35] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Karen Brown - https://phabricator.wikimedia.org/T201668 (10Dzahn) This is a bit strange, i can't find a user "karen" nor any user with email kbrown_at_wikimedia.org... [22:15:45] godog: the issue is that I think our alerting threshold is at 20, and it keeps flapping between 18 and 20... [22:15:52] well, at 19 [22:17:05] indeed, bad coincidence and sad_trombone.mkv [22:17:45] when did the .wav become an .mkv [22:18:06] Accept it, it's the 2010s now. :-) [22:18:12] haha [22:18:25] heheh things change [22:18:33] next thing you know it'll be .mov [22:19:14] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 316 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [22:20:10] ewww. wmv [22:20:36] that's why we should probably downtime them for like 24h [22:21:10] ok, doing [22:24:05] (03CR) 10Krinkle: [C: 031] "Effectively, this disables the legacy globals on mediawiki.org, test.wikipedia.org and closed wikis. This should have relatively small imp" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452843 (https://phabricator.wikimedia.org/T35837) (owner: 10Jforrester) [22:24:30] downtimed for 24h. 3 checks. IPv6-only and eqiad/codfw/ulsfo. eqsin not affected [22:24:48] and esams has no ripe-atlas [22:25:23] thx [22:37:04] it's weird that i can see how James created staff users on wikitech but i cant see a trace of them in LDAP from mwmaint1001 [22:37:20] searched by email, *@wikimedia.org etc [22:38:06] almost as if the "create user for somebody else" method doesn't create an LDAP but a local user [22:48:56] (03PS1) 10AndyRussG: CentralNotice: EventLogging data stream at a low level (0.01 sample rate) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452859 [22:50:02] (03PS2) 10AndyRussG: CentralNotice: EventLogging data at a low level (0.01 sample rate) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452859 [22:50:44] (03CR) 10Filippo Giunchedi: icinga: Define 'notify-by-email-per-service' command (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/452744 (owner: 10Krinkle) [22:52:55] (03PS2) 10Krinkle: icinga: Define 'notify-by-email-per-service' command [puppet] - 10https://gerrit.wikimedia.org/r/452744 [22:53:07] (03PS3) 10Krinkle: icinga: Define 'notify-by-email-per-service' command [puppet] - 10https://gerrit.wikimedia.org/r/452744 [22:53:53] (03CR) 10Krinkle: "Changed in the other direction instead by removing $hostalias. For the purpose of Grafana alerts, this wasn't useful anyway, and I imagine" [puppet] - 10https://gerrit.wikimedia.org/r/452744 (owner: 10Krinkle) [22:57:03] (03CR) 10Filippo Giunchedi: [C: 032] "Even nicer with just the service" [puppet] - 10https://gerrit.wikimedia.org/r/452744 (owner: 10Krinkle) [22:57:11] (03PS4) 10Filippo Giunchedi: icinga: Define 'notify-by-email-per-service' command [puppet] - 10https://gerrit.wikimedia.org/r/452744 (owner: 10Krinkle) [23:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor I � Unicode. All rise for Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180814T2300). [23:00:04] James_F and RoanKattouw: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:59] Heya. [23:01:03] RoanKattouw: You SWATing? [23:01:06] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Karen Brown - https://phabricator.wikimedia.org/T201668 (10Krenair) Looks like the account was created by a logged-in user instead of anonymously, no idea if that eve... [23:01:08] (03PS3) 10Jforrester: Remove obsolete $wgPopupsBetaFeature, Part I: CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450906 (owner: 10Prtksxna) [23:01:10] (03PS6) 10Jforrester: Remove obsolete $wgPopupsBetaFeature, Part III: InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444574 (owner: 10Prtksxna) [23:01:12] (03PS1) 10Jforrester: Remove obsolete $wgPopupsBetaFeature, Part II: InitialiseSettings-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452863 [23:01:39] Yes [23:01:42] Missed the ping somehow [23:02:29] RoanKattouw: hi! [23:02:34] (03CR) 10Catrope: [C: 032] Disable wgLegacyJavaScriptGlobals on all group0 wikis, not just test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452843 (https://phabricator.wikimedia.org/T35837) (owner: 10Jforrester) [23:02:37] Hey AndyRussG [23:02:41] You'll go second after James_F [23:02:52] okok no rush :) thx much! [23:03:26] (03CR) 10Ejegg: [C: 031] CentralNotice: EventLogging data at a low level (0.01 sample rate) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452859 (owner: 10AndyRussG) [23:03:55] (03Merged) 10jenkins-bot: Disable wgLegacyJavaScriptGlobals on all group0 wikis, not just test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452843 (https://phabricator.wikimedia.org/T35837) (owner: 10Jforrester) [23:04:36] 10Operations, 10Wikimedia-Logstash, 10Goal, 10User-fgiunchedi, 10User-herron: Shorten logstash retention - https://phabricator.wikimedia.org/T201971 (10fgiunchedi) p:05Triage>03Normal [23:05:46] James_F: Your change is on mwdebug1002, please test [23:06:08] (03PS3) 10Catrope: CentralNotice: EventLogging data at a low level (0.01 sample rate) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452859 (owner: 10AndyRussG) [23:06:10] (03CR) 10Jforrester: "PS3: Split the commit into touching only one file, per the SWAT rule. LGTM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450906 (owner: 10Prtksxna) [23:06:11] Kk. [23:06:16] (03CR) 10Catrope: [C: 032] CentralNotice: EventLogging data at a low level (0.01 sample rate) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452859 (owner: 10AndyRussG) [23:07:37] (03Merged) 10jenkins-bot: CentralNotice: EventLogging data at a low level (0.01 sample rate) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452859 (owner: 10AndyRussG) [23:11:34] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [23:11:36] !log fdans@deploy1001 Started deploy [analytics/refinery@21e07ae]: Deploying revert to prevent partition dropping jobs from failing [23:11:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:40] 10Operations, 10Maps-Sprint, 10Maps (Tilerator): Increase frequency of OSM replication - https://phabricator.wikimedia.org/T137939 (10Mholloway) p:05Normal>03High [23:13:50] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Disable wgLegacyJavaScriptGlobals on all group0 wikis (T35837) (duration: 00m 54s) [23:13:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:57] T35837: Set $wgLegacyJavaScriptGlobals = false by default - https://phabricator.wikimedia.org/T35837 [23:15:08] AndyRussG: Your patch is on mwdebug1002, please test (to the extent feasible) [23:15:08] 10Operations, 10Wikimedia-Logstash, 10Goal, 10User-fgiunchedi, 10User-herron: Shorten logstash retention - https://phabricator.wikimedia.org/T201971 (10Bawolff) Could we maybe dump by channel type? api-feature-usage is by far the majority of logstash events, but is much less likely useful to reatain for... [23:15:40] Hmm the gate-and-submit-swat queue doesn't seem to be behaving in a prioritized manner exactly [23:16:54] PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation={get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [23:16:59] RoanKattouw: by chance do you know how long it currently takes for the config change to bubble up to JS in this case? [23:17:14] I guess the normal RL module rollover delay? [23:18:18] Oh wait, wrong debug instance [23:18:44] RoanKattouw: all good! :) [23:20:04] !log fdans@deploy1001 Finished deploy [analytics/refinery@21e07ae]: Deploying revert to prevent partition dropping jobs from failing (duration: 08m 27s) [23:20:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:19] !log catrope@deploy1001 Synchronized wmf-config/CommonSettings.php: Enable CentralNotice EventLogging at a low sample rate (0.01) (duration: 00m 50s) [23:20:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:54] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=LIST https://grafana.wikimedia.org/dashboard/db/kubernetes-api [23:21:26] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Karen Brown - https://phabricator.wikimedia.org/T201668 (10Dzahn) @Jalexander I was able to go directly to the wikitech wiki database and look in the user table and i... [23:21:55] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [23:23:05] RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [23:24:28] RoanKattouw: thx!!!! [23:24:30] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Karen Brown - https://phabricator.wikimedia.org/T201668 (10Dzahn) @Krenair this would confirm my suspicion. thank you. it looks like that might not work with LDAP int... [23:29:48] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Karen Brown - https://phabricator.wikimedia.org/T201668 (10Dzahn) on DB level i can see these differences: For my own user the fields "user_real_name" and "user_pass... [23:31:57] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Karen Brown - https://phabricator.wikimedia.org/T201668 (10Legoktm) Did she log into wikitech and set a real password instead of the temporary one? That would populat... [23:35:51] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Karen Brown - https://phabricator.wikimedia.org/T201668 (10Dzahn) @Jalexander Can we just let her create a normal user (as anon) and not worry about the "(WMF)" part... [23:37:01] 10Operations, 10Wikimedia-Logstash, 10Goal, 10User-fgiunchedi, 10User-herron: Shorten logstash retention - https://phabricator.wikimedia.org/T201971 (10fgiunchedi) We can't delete inside indices easily, no. Dropping old indices is cheap compared to actually looking inside and delete only specific data. I... [23:38:07] 10Operations, 10Wikimedia-Logstash, 10Goal, 10User-fgiunchedi, 10User-herron: Shorten logstash retention temporarily - https://phabricator.wikimedia.org/T201971 (10fgiunchedi) [23:38:59] !log Change password for User:Textorus [23:39:03] uh, email [23:39:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:11] !log Correction: Change email for User:Textorus [23:39:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:41] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Karen Brown - https://phabricator.wikimedia.org/T201668 (10Dzahn) Ok, so let's first have the 2 users (also see T201667) confirm they set their intial password. Mayb... [23:39:51] (03CR) 10jenkins-bot: Disable wgLegacyJavaScriptGlobals on all group0 wikis, not just test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452843 (https://phabricator.wikimedia.org/T35837) (owner: 10Jforrester) [23:39:53] (03CR) 10jenkins-bot: CentralNotice: EventLogging data at a low level (0.01 sample rate) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452859 (owner: 10AndyRussG) [23:41:22] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Karen Brown - https://phabricator.wikimedia.org/T201668 (10Dzahn) 05Open>03stalled [23:43:56] 10Operations, 10LDAP-Access-Requests, 10User-Addshore: Give access to graphite and grafana-admin to Aleksey Bekh-Ivanov (WMDE) - https://phabricator.wikimedia.org/T199233 (10Dzahn) Hi @RStallman-legalteam here's another WMDE engineer who needs an NDA signed. [23:46:59] 10Operations, 10Wikimedia-Logstash, 10Goal, 10User-fgiunchedi, 10User-herron: Shorten logstash retention temporarily - https://phabricator.wikimedia.org/T201971 (10Legoktm) api-feature-usage is exposed via Special:ApiFeatureUsage, which queries the log entries from elasticsearch, I'm not sure if that's d... [23:47:49] 10Operations, 10SRE-Access-Requests: Requesting access to restricted production access and analytics-privatedata-users for Patrick Earley - https://phabricator.wikimedia.org/T201667 (10Dzahn) The comments from T201668#4503338 and following also apply to this ticket. The user_password field is not populated i... [23:48:02] 10Operations, 10SRE-Access-Requests: Requesting access to restricted production access and analytics-privatedata-users for Patrick Earley - https://phabricator.wikimedia.org/T201667 (10Dzahn) 05Open>03stalled [23:48:28] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: new ssh key for daniel - https://phabricator.wikimedia.org/T201913 (10Dzahn) p:05Triage>03High [23:48:34] !log deleting three images for legal compliance [23:48:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:55:29] !log restarted populateContentTables.php on s2 [23:55:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:56:08] 10Operations, 10Patch-For-Review: reinstall rdb100[56] with RAID - https://phabricator.wikimedia.org/T140442 (10Dzahn) I see that T190327 is closed meanwhile. Did it actually become easy now? :) [23:59:59] 10Operations, 10Wikimedia-Logstash, 10Goal, 10User-fgiunchedi, 10User-herron: Shorten logstash retention temporarily - https://phabricator.wikimedia.org/T201971 (10Krinkle) >>! In T201971#4503528, @Bawolff wrote: > Could we maybe dump by channel type? api-feature-usage is by far the majority of logstash...