[00:05:38] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [00:27:12] (03PS1) 10Dzahn: icinga/check_ssl: add support for stretch, rename it to check_tls [puppet] - 10https://gerrit.wikimedia.org/r/465805 (https://phabricator.wikimedia.org/T202782) [00:27:58] (03CR) 10jerkins-bot: [V: 04-1] icinga/check_ssl: add support for stretch, rename it to check_tls [puppet] - 10https://gerrit.wikimedia.org/r/465805 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [00:30:57] (03PS2) 10Dzahn: icinga/check_ssl: add support for stretch, rename it to check_tls [puppet] - 10https://gerrit.wikimedia.org/r/465805 (https://phabricator.wikimedia.org/T202782) [00:31:46] (03CR) 10jerkins-bot: [V: 04-1] icinga/check_ssl: add support for stretch, rename it to check_tls [puppet] - 10https://gerrit.wikimedia.org/r/465805 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [01:02:48] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 30 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [01:05:17] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [01:09:58] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 46 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [01:20:07] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 32 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [01:27:18] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 87 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [01:28:57] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 65 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [01:28:57] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 52 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [01:30:18] (03CR) 10Krinkle: [C: 032] wmfSetupEtcd: Correctly initialize the local cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464117 (https://phabricator.wikimedia.org/T176370) (owner: 10Giuseppe Lavagetto) [01:32:03] (03Merged) 10jenkins-bot: wmfSetupEtcd: Correctly initialize the local cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464117 (https://phabricator.wikimedia.org/T176370) (owner: 10Giuseppe Lavagetto) [01:33:58] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 16 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [01:33:58] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 20 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [01:38:36] !log krinkle@deploy1001 Synchronized wmf-config/etcd.php: T176370 - I5e7e5d167d517 (duration: 00m 55s) [01:38:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:38:39] T176370: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 [01:39:35] (03CR) 10jenkins-bot: wmfSetupEtcd: Correctly initialize the local cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464117 (https://phabricator.wikimedia.org/T176370) (owner: 10Giuseppe Lavagetto) [01:41:18] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 65 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [01:51:27] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 16 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [01:58:47] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 101 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [02:00:47] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 102 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [02:07:14] !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.24/tests/phpunit/includes/page/: Ib211d98498f (duration: 00m 49s) [02:07:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:10:31] !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.24/includes/page/WikiPage.php: T203942 - Ib211d98498f (duration: 00m 49s) [02:10:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:10:34] T203942: Fatal exception when attempting to view wiki page that should redirect to Media-namespace ("NS_MEDIA is a virtual namespace; use NS_FILE") - https://phabricator.wikimedia.org/T203942 [02:21:13] (03PS2) 10Krinkle: Enforce no-session constraint in static.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464391 (https://phabricator.wikimedia.org/T127233) (owner: 10Anomie) [02:21:15] (03CR) 10Krinkle: [C: 032] Enforce no-session constraint in static.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464391 (https://phabricator.wikimedia.org/T127233) (owner: 10Anomie) [02:22:28] (03Merged) 10jenkins-bot: Enforce no-session constraint in static.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464391 (https://phabricator.wikimedia.org/T127233) (owner: 10Anomie) [02:25:09] !log krinkle@deploy1001 Synchronized w/static.php: T127233 - Ic6acb70 (duration: 00m 49s) [02:25:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:25:21] T127233: Endpoints which do not need to authenticate users should set MW_NO_SESSION - https://phabricator.wikimedia.org/T127233 [02:25:58] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 20 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [02:26:01] (03CR) 10jenkins-bot: Enforce no-session constraint in static.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464391 (https://phabricator.wikimedia.org/T127233) (owner: 10Anomie) [02:34:08] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 16 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [02:52:38] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [02:52:48] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [02:55:18] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [02:55:35] Somethign is up [02:55:39] 6,000 exceptions [02:55:50] }All in the form of: LoadBalancer.php: Transaction spent 10.059820175171 second(s) in writes, exceeding the limit of 3 [02:57:09] Write contention? [02:57:33] Someting in api, with commons/search referers [03:03:57] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [03:06:36] Looks like a commonswiki bot editing via API and hitting some issue, possibly with editIncrement [03:10:07] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [03:10:08] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [03:10:48] Bleh, it's not findable in Hadoop's wmf_raw.apiaction. [03:28:47] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 887.64 seconds [03:39:17] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 39 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [03:44:18] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 18 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [03:49:07] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 260.17 seconds [04:20:23] (03CR) 10Dzahn: [C: 04-1] "already working on an alternative that just edits check_ssl itself instead" [puppet] - 10https://gerrit.wikimedia.org/r/465805 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [05:07:17] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1067 - https://phabricator.wikimedia.org/T206500 (10Marostegui) 05Open>03Resolved So after replacing the disk 3 times yesterday evening...we finally got this fixed! Thanks a lot Chris! ``` Number of Virtual Disks: 1 Virtual Drive: 0 (Target Id: 0) Na... [05:13:33] (03PS1) 10KartikMistry: lttoolbox: New upstream release [debs/contenttranslation/lttoolbox] - 10https://gerrit.wikimedia.org/r/465932 (https://phabricator.wikimedia.org/T206439) [05:18:39] (03CR) 10Marostegui: "this is not needed and can be abandoned" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465634 (owner: 10Banyek) [05:20:04] (03PS1) 10Marostegui: db-eqiad.php: Temporary depool db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465933 (https://phabricator.wikimedia.org/T205865) [05:20:17] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 33 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [05:21:47] 10Operations, 10DBA, 10Epic, 10Patch-For-Review: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107 (10Marostegui) [05:21:52] 10Operations, 10DBA, 10cloud-services-team, 10wikitech.wikimedia.org, and 2 others: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805 (10Marostegui) [05:21:55] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Temporary depool db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465933 (https://phabricator.wikimedia.org/T205865) (owner: 10Marostegui) [05:22:59] 10Operations, 10Goal, 10Patch-For-Review: Perform a datacenter switchover (2018-19 Q1) - https://phabricator.wikimedia.org/T199073 (10Marostegui) [05:23:03] 10Operations, 10DBA, 10Epic, 10Patch-For-Review: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107 (10Marostegui) 05Open>03Resolved a:03Marostegui All the tasks we scheduled to do whilst eqiad was passive, were done!. We also included T184805 on a last minute task,... [05:23:09] (03Merged) 10jenkins-bot: db-eqiad.php: Temporary depool db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465933 (https://phabricator.wikimedia.org/T205865) (owner: 10Marostegui) [05:23:43] wikibase appears to be failing its jenkins checkstyle jobs [05:23:44] (03PS1) 10Marostegui: Revert "db-eqiad.php: Temporary depool db1109" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465935 [05:24:10] (03PS5) 10Marostegui: db: Switch dns master alias to eqiad [dns] - 10https://gerrit.wikimedia.org/r/458790 (https://phabricator.wikimedia.org/T203777) (owner: 10Alexandros Kosiaris) [05:24:22] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1109 (duration: 00m 51s) [05:24:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:24:40] 03:14:59 Build step 'Execute shell' marked build as failure [05:24:40] 03:14:59 [CHECKSTYLE] Collecting checkstyle analysis files... [05:24:40] 03:14:59 [CHECKSTYLE] Searching for all files in /srv/jenkins-workspace/workspace/mwext-php70-phan-docker that match the pattern log/phan-issues [05:24:40] 03:14:59 [CHECKSTYLE] No files found. Configuration error? [05:25:00] eg https://integration.wikimedia.org/ci/job/mwext-php70-phan-docker/14812/console [05:25:26] not sure what's going on there, there doesn't seem to be any recent change to either wikibase or integration-config that obviously explains this [05:26:01] cscott: Maybe create a task for releng? [05:26:40] marostegui: sure, just figured i'd mention it here first in case someone was like, oh, i just totally did XYZ that would explain that [05:27:07] cscott: Yeah, from my side, I am not aware of changes there, but it doesn't mean there were not changes of course :-) [05:27:12] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Temporary depool db1109" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465935 (owner: 10Marostegui) [05:27:27] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 43 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [05:28:11] 10Operations, 10DBA, 10MediaWiki-General-or-Unknown, 10Wikidata, and 4 others: Investigate decrease in wikidata dispatch times due to eqiad -> codfw DC switch - https://phabricator.wikimedia.org/T205865 (10Marostegui) I have run an `analyze table` on both db1109 and db2083 and things are similar now: ```... [05:28:25] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Temporary depool db1109" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465935 (owner: 10Marostegui) [05:29:31] (03CR) 10Marostegui: [C: 032] db: Switch dns master alias to eqiad [dns] - 10https://gerrit.wikimedia.org/r/458790 (https://phabricator.wikimedia.org/T203777) (owner: 10Alexandros Kosiaris) [05:29:33] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1109 (duration: 00m 48s) [05:29:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:31:09] marostegui: filed T206738, although i'm not certain i used the right tags [05:31:10] T206738: Wikibase appears to be failing checkstyle on all builds now - https://phabricator.wikimedia.org/T206738 [05:32:06] (03CR) 10jenkins-bot: db-eqiad.php: Temporary depool db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465933 (https://phabricator.wikimedia.org/T205865) (owner: 10Marostegui) [05:32:08] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Temporary depool db1109" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465935 (owner: 10Marostegui) [05:32:15] cscott: I changed one, but there are so many tags that it is not easy to find the right one :) [05:43:30] !log Purge binary logs on pc2005 due to disk space issues - T206740 [05:43:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:43:33] T206740: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 [05:46:05] (03PS2) 10Muehlenhoff: Remove now obsolete Diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/465596 (https://phabricator.wikimedia.org/T183454) [05:47:28] (03CR) 10Muehlenhoff: [C: 032] Remove now obsolete Diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/465596 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [05:53:48] (03PS1) 10Marostegui: db-eqiad.php: Increase weight for db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465959 (https://phabricator.wikimedia.org/T205514) [05:58:34] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase weight for db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465959 (https://phabricator.wikimedia.org/T205514) (owner: 10Marostegui) [05:59:49] (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight for db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465959 (https://phabricator.wikimedia.org/T205514) (owner: 10Marostegui) [06:00:53] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Increase db1092 weight (duration: 00m 49s) [06:00:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:04:32] (03CR) 10jenkins-bot: db-eqiad.php: Increase weight for db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465959 (https://phabricator.wikimedia.org/T205514) (owner: 10Marostegui) [06:07:56] (03PS4) 10Muehlenhoff: Stop the diamond service when removing Diamond [puppet] - 10https://gerrit.wikimedia.org/r/465440 (https://phabricator.wikimedia.org/T183454) [06:09:26] (03PS4) 10Urbanecm: Initial configuration for yuewiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463482 (https://phabricator.wikimedia.org/T205546) [06:10:21] (03CR) 10Muehlenhoff: [C: 032] Stop the diamond service when removing Diamond [puppet] - 10https://gerrit.wikimedia.org/r/465440 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [06:28:08] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 30 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [06:29:38] PROBLEM - puppet last run on mw1323 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/var/lib/hphpd/hphpd.ini] [06:33:33] (03PS2) 10Muehlenhoff: Remove all absented Diamond collectors [puppet] - 10https://gerrit.wikimedia.org/r/465600 (https://phabricator.wikimedia.org/T183454) [06:35:27] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 63 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [06:39:47] (03CR) 10Muehlenhoff: [C: 032] Remove all absented Diamond collectors [puppet] - 10https://gerrit.wikimedia.org/r/465600 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [06:55:37] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 32 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [06:56:00] (03PS2) 10Muehlenhoff: Remove obsolete Diamond collectors [puppet] - 10https://gerrit.wikimedia.org/r/465601 (https://phabricator.wikimedia.org/T183454) [06:57:49] (03CR) 10Muehlenhoff: [C: 032] Remove obsolete Diamond collectors [puppet] - 10https://gerrit.wikimedia.org/r/465601 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [06:59:08] (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for SSH [puppet] - 10https://gerrit.wikimedia.org/r/444230 (https://phabricator.wikimedia.org/T135991) [06:59:58] RECOVERY - puppet last run on mw1323 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:02:57] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 62 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [07:06:27] (03PS1) 10Muehlenhoff: Disable Diamond on test role [puppet] - 10https://gerrit.wikimedia.org/r/466007 (https://phabricator.wikimedia.org/T183454) [07:09:51] (03PS1) 10Joal: Bump AQS druid-datasource to 2018_09 [puppet] - 10https://gerrit.wikimedia.org/r/466018 [07:11:08] (03CR) 10Muehlenhoff: [C: 032] Disable Diamond on test role [puppet] - 10https://gerrit.wikimedia.org/r/466007 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [07:12:57] (03PS2) 10Elukey: Bump AQS druid-datasource to 2018_09 [puppet] - 10https://gerrit.wikimedia.org/r/466018 (owner: 10Joal) [07:13:48] (03CR) 10Elukey: [C: 032] Bump AQS druid-datasource to 2018_09 [puppet] - 10https://gerrit.wikimedia.org/r/466018 (owner: 10Joal) [07:17:00] (03PS3) 10Mathew.onipe: base::monitoring::host: added prometheus check for network drops [puppet] - 10https://gerrit.wikimedia.org/r/465450 (https://phabricator.wikimedia.org/T206114) [07:19:16] (03CR) 10Mathew.onipe: base::monitoring::host: added prometheus check for network drops (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/465450 (https://phabricator.wikimedia.org/T206114) (owner: 10Mathew.onipe) [07:26:13] \o all [07:26:16] jouncebot: now [07:26:16] No deployments scheduled for the next 3 hour(s) and 33 minute(s) [07:26:36] * addshore is going to sync a mw change adding some more tracking to statsd for wikidata dispatching [07:27:49] (03PS1) 10Elukey: Add stat1007 as role::spare::system [puppet] - 10https://gerrit.wikimedia.org/r/466030 (https://phabricator.wikimedia.org/T203852) [07:28:45] (03CR) 10Elukey: [C: 032] Add stat1007 as role::spare::system [puppet] - 10https://gerrit.wikimedia.org/r/466030 (https://phabricator.wikimedia.org/T203852) (owner: 10Elukey) [07:36:29] !log roll restart of aqs on aqs100[4-9] to pick up new Druid settings [07:36:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:08] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 29 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [07:42:55] (03CR) 10Gehel: base::monitoring::host: added prometheus check for network drops (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/465450 (https://phabricator.wikimedia.org/T206114) (owner: 10Mathew.onipe) [07:43:40] !log deploy https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/Wikibase/+/466031 to mwmaint1002 only (increasing tracking of wikidata dispatching) T205865 [07:43:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:43] T205865: Investigate decrease in wikidata dispatch times due to eqiad -> codfw DC switch - https://phabricator.wikimedia.org/T205865 [07:45:12] (03PS1) 10Jcrespo: parsercache: Reduce retention time to 7 days due to running out of space [puppet] - 10https://gerrit.wikimedia.org/r/466036 [07:45:27] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 48 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [07:47:15] (03CR) 10Gilles: [C: 031] Remove obsolete mediawiki-firejail-rsvg-convert [puppet] - 10https://gerrit.wikimedia.org/r/465590 (owner: 10Muehlenhoff) [07:47:21] (03CR) 10Marostegui: [C: 031] parsercache: Reduce retention time to 7 days due to running out of space [puppet] - 10https://gerrit.wikimedia.org/r/466036 (owner: 10Jcrespo) [07:50:10] (03PS2) 10Gehel: wdqs: increase throttling limits for internal cluster [puppet] - 10https://gerrit.wikimedia.org/r/465652 (https://phabricator.wikimedia.org/T206648) [07:50:37] 10Operations, 10DBA, 10MediaWiki-Database: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Marostegui) [07:51:37] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:52:36] (03PS3) 10Gehel: wdqs: increase throttling limits for internal cluster [puppet] - 10https://gerrit.wikimedia.org/r/465652 (https://phabricator.wikimedia.org/T206648) [07:53:36] (03CR) 10Gehel: [C: 032] wdqs: increase throttling limits for internal cluster [puppet] - 10https://gerrit.wikimedia.org/r/465652 (https://phabricator.wikimedia.org/T206648) (owner: 10Gehel) [07:56:34] 10Operations, 10DBA, 10MediaWiki-General-or-Unknown, 10Wikidata, and 4 others: Investigate decrease in wikidata dispatch times due to eqiad -> codfw DC switch - https://phabricator.wikimedia.org/T205865 (10Addshore) Looking at a couple of minutes of extra timing data it looks like this is down to the selec... [07:57:31] !log rolling restart blazegraph on wdqs-internal for config change - T206648 [07:57:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:34] T206648: Increase throttling rates for wdqs internal cluster - https://phabricator.wikimedia.org/T206648 [07:57:57] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [07:59:07] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:59:15] (03CR) 10Gehel: [C: 032] wdqs: refactor use_git_deploy to include scap3 and autodeploy options [puppet] - 10https://gerrit.wikimedia.org/r/465566 (https://phabricator.wikimedia.org/T206597) (owner: 10Mathew.onipe) [07:59:23] (03PS4) 10Gehel: wdqs: refactor use_git_deploy to include scap3 and autodeploy options [puppet] - 10https://gerrit.wikimedia.org/r/465566 (https://phabricator.wikimedia.org/T206597) (owner: 10Mathew.onipe) [08:00:07] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [08:00:28] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 31 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [08:04:04] !log purging binary logs on pc1004 [08:04:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:06] !log running /usr/local/bin/mwscript purgeParserCache.php --wiki=aawiki --age=1900800 --msleep 0 [08:04:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:08] <_joe_> we had a small mediawiki outage there ^^ [08:04:29] !log purging binary logs on pc1005 [08:04:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:54] !log purging binary logs on pc1006 [08:04:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:18] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install stat1007.eqiad.wmnet (stat1005 user replacement) - https://phabricator.wikimedia.org/T203852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` stat1007... [08:06:32] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Well application/x-www-form-urlencoded is the default for POST requests in almost every HTTP client." (032 comments) [software/service-checker] - 10https://gerrit.wikimedia.org/r/461457 (owner: 10Dduvall) [08:07:13] _joe_: what was it? [08:07:22] <_joe_> addshore: still no idea [08:07:28] *looks* [08:07:43] <_joe_> we had some more during the night, too [08:07:47] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 73 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [08:07:53] <_joe_> not huge total outages, but still [08:08:29] [{exception_id}] {exception_url} Wikimedia\Rdbms\DBQueryError from line 1484 of /srv/mediawiki/php-1.32.0-wmf.24/includes/libs/rdbms/database/Database.php: A database query error has occurred. Did you forget to run your application's database schema upd.... [08:08:43] interesting.... [08:09:28] looks like commons wiki transactions taking too long and throwing [08:09:35] <_joe_> yes [08:09:44] <_joe_> jynus, marostegui, banyek ^^ [08:09:51] addshore: can you give me a hostname? [08:09:52] <_joe_> it was a short burst though [08:10:05] <_joe_> a client marostegui? [08:10:08] <_joe_> or the db server? [08:10:14] this is an example doc in logstash https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2018.10.11/mediawiki/?id=AWZiHl3Ez9bxnQJdGV4D [08:10:15] the db server [08:10:23] addshore: thanks I will take it from there [08:10:45] It doesnt look like the db server is logged in that log line though [08:10:52] that is a known and reported mw issue [08:11:09] https://phabricator.wikimedia.org/T202715 [08:11:11] <_joe_> yeah it's not logged [08:11:29] Yeah, but the error is enough [08:11:34] As jynus said, it is not "new" [08:11:38] <_joe_> oh my, a counter on sql [08:11:44] * _joe_ feels sad [08:12:08] <_joe_> but it's per user, so I'm less sad [08:12:42] <_joe_> still it means every edit makes the cache on that table stale, right? [08:13:29] https://phabricator.wikimedia.org/T202715#4634924 [08:20:29] !log setting up replication from pc2004 -> pc1004 [08:20:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:47] _joe_: I don't think so, if it is a single transaction [08:20:55] everything will fail [08:21:10] but we have like 20 of those report, many made by me [08:21:37] I guess it is low impact because it ony impacts if you edit faster than innodb can handle, which is pretty fast [08:22:44] So I guess only bots are really affected? [08:23:34] sorry, but I don't know, I only reported the issue, mw core people are aware, cannot do more [08:23:58] this is the only thing I know: https://phabricator.wikimedia.org/T202715#4637265 [08:26:12] !log setting up replication from pc2005 -> pc1005 and from pc2006 -> pc2006 [08:26:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:31] !log setting up some automated binlog purge mechanism on pc1004,pc1005,pc1006 [08:30:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:16] (03CR) 10Volans: "I agree with the reasoning, it doesn't look too scary to me in our current production environment given the spread crontabs and the Icinga" [puppet] - 10https://gerrit.wikimedia.org/r/444230 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:49:59] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Refactor 'use_git_deploy' in wdqs puppet module to cater for scap3 and autodeployment modes - https://phabricator.wikimedia.org/T206597 (10Mathew.onipe) 05Open>03Resolved [08:51:10] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install stat1007.eqiad.wmnet (stat1005 user replacement) - https://phabricator.wikimedia.org/T203852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['stat1007.eqiad.wmnet'] ``` and were **ALL** successful. [08:51:55] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Refactor 'use_git_deploy' in wdqs puppet module to cater for scap3 and autodeployment modes - https://phabricator.wikimedia.org/T206597 (10Mathew.onipe) 05Resolved>03Open [08:53:25] (03PS1) 10Elukey: Add IPv6 PTR record for stat1007 [dns] - 10https://gerrit.wikimedia.org/r/466193 (https://phabricator.wikimedia.org/T203852) [08:53:46] (03CR) 10Elukey: [C: 032] Add IPv6 PTR record for stat1007 [dns] - 10https://gerrit.wikimedia.org/r/466193 (https://phabricator.wikimedia.org/T203852) (owner: 10Elukey) [08:54:56] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install stat1007.eqiad.wmnet (stat1005 user replacement) - https://phabricator.wikimedia.org/T203852 (10elukey) a:05RobH>03elukey [08:55:22] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install stat1007.eqiad.wmnet (stat1005 user replacement) - https://phabricator.wikimedia.org/T203852 (10elukey) 05Open>03Resolved Done! Will follow up in another task to replace stat1005 with this new host. [09:16:10] (03PS2) 10Mforns: Add druid_load jobs to analytics refinery [puppet] - 10https://gerrit.wikimedia.org/r/465692 (https://phabricator.wikimedia.org/T206342) [09:19:18] 10Operations, 10DBA, 10MediaWiki-Database: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) The binary logs were purged on pc1004. pc1005 pc1006. Also the binlog_max_size were set to 10M and the hosts now have this running in a screen: ``` while true; do echo "p... [09:20:10] (03CR) 10Mforns: "Thanks Andrew! Fixed a missing coma and Jenkins +2'd, I think this is ready. However, we still need to merge the refinery-source patch fir" [puppet] - 10https://gerrit.wikimedia.org/r/465692 (https://phabricator.wikimedia.org/T206342) (owner: 10Mforns) [09:26:39] 10Operations, 10DBA, 10MediaWiki-Database: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) [09:27:05] 10Operations, 10DBA, 10MediaWiki-Database: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) [09:27:29] 10Operations, 10DBA, 10MediaWiki-Database: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) [09:28:31] (03PS1) 10Mathew.onipe: wdqs: change user for autodeploy cron task [puppet] - 10https://gerrit.wikimedia.org/r/466209 (https://phabricator.wikimedia.org/T206597) [09:33:45] (03PS1) 10Muehlenhoff: Cleanup package status during Diamond removal [puppet] - 10https://gerrit.wikimedia.org/r/466217 (https://phabricator.wikimedia.org/T183454) [09:34:41] 10Operations, 10DBA, 10MediaWiki-Database: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10akosiaris) [09:34:57] 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Deprovision Diamond collectors no longer in use - https://phabricator.wikimedia.org/T183454 (10MoritzMuehlenhoff) There's a bug when removing Diamond via the diamond::remove option: The diamond.service remains in a failed state, as the serv... [09:39:30] (03PS2) 10Muehlenhoff: Cleanup systemd state on Diamond removal [puppet] - 10https://gerrit.wikimedia.org/r/466217 (https://phabricator.wikimedia.org/T183454) [09:43:55] 10Operations, 10DBA, 10MediaWiki-General-or-Unknown, 10Wikidata, and 4 others: Investigate decrease in wikidata dispatch times due to eqiad -> codfw DC switch - https://phabricator.wikimedia.org/T205865 (10hoo) Thanks @Marostegui but this sadly didn't help. Do you have any other ideas what could cause thes... [09:44:03] 10Operations, 10Traffic: Separate Traffic layer caches for PHP7/HHVM - https://phabricator.wikimedia.org/T206339 (10ema) p:05Triage>03Normal [09:45:36] 10Operations, 10Domains, 10Traffic: SOA serial numbers returned by authoritative nameservers differ - https://phabricator.wikimedia.org/T206688 (10ema) p:05Triage>03Normal [09:47:08] 10Operations, 10Traffic: certcentral: delay deployment of renewed certs to wait out skewed client clocks - https://phabricator.wikimedia.org/T204997 (10ema) p:05Triage>03Normal [09:47:35] 10Operations, 10Traffic: Puppetise OCSP stapling for all one-off HTTPS servers - https://phabricator.wikimedia.org/T204992 (10ema) p:05Triage>03Normal [09:47:56] 10Operations, 10Traffic: Consider adding Must-Staple header to enforce revocation checking - https://phabricator.wikimedia.org/T204987 (10ema) p:05Triage>03Normal [09:48:49] 10Operations, 10DBA, 10MediaWiki-General-or-Unknown, 10Wikidata, and 4 others: Investigate decrease in wikidata dispatch times due to eqiad -> codfw DC switch - https://phabricator.wikimedia.org/T205865 (10hoo) ```wikiadmin@db1109(wikidatawiki)> SELECT * FROM information_schema.tables WHERE table_name = 'w... [09:55:41] (03CR) 10Gehel: [C: 032] wdqs: change user for autodeploy cron task [puppet] - 10https://gerrit.wikimedia.org/r/466209 (https://phabricator.wikimedia.org/T206597) (owner: 10Mathew.onipe) [10:00:01] (03PS1) 10Ladsgroup: Set some small wikis to read new for change tag backend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466271 (https://phabricator.wikimedia.org/T194164) [10:05:43] (03PS2) 10Muehlenhoff: Remove obsolete mediawiki-firejail-rsvg-convert [puppet] - 10https://gerrit.wikimedia.org/r/465590 [10:06:06] (03PS1) 10Alexandros Kosiaris: First draft of a zotero helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/466287 (https://phabricator.wikimedia.org/T201611) [10:07:57] (03CR) 10Muehlenhoff: [C: 032] Remove obsolete mediawiki-firejail-rsvg-convert [puppet] - 10https://gerrit.wikimedia.org/r/465590 (owner: 10Muehlenhoff) [10:08:17] (03PS4) 10Mathew.onipe: base::monitoring::host: added prometheus check for network drops [puppet] - 10https://gerrit.wikimedia.org/r/465450 (https://phabricator.wikimedia.org/T206114) [10:11:00] (03CR) 10Alexandros Kosiaris: [C: 04-1] "While well intended, there some technicalities that make the two forms not compatible. This will probably require some string mangling to " (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/465519 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [10:14:19] (03CR) 10Muehlenhoff: base/icinga: use MONITORING_HOSTS constant as NRPE allowed_hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/465519 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [10:15:54] (03CR) 10Mathew.onipe: base::monitoring::host: added prometheus check for network drops (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/465450 (https://phabricator.wikimedia.org/T206114) (owner: 10Mathew.onipe) [10:22:55] (03PS7) 10Alex Monk: Certcentral-authdns integration [puppet] - 10https://gerrit.wikimedia.org/r/459809 [10:23:42] (03CR) 10jerkins-bot: [V: 04-1] Certcentral-authdns integration [puppet] - 10https://gerrit.wikimedia.org/r/459809 (owner: 10Alex Monk) [10:26:40] (03PS1) 10ArielGlenn: use lbzip2 for recombining page content dumps, if available and configured [dumps] - 10https://gerrit.wikimedia.org/r/466344 (https://phabricator.wikimedia.org/T179059) [10:28:56] (03PS2) 10Volans: Tests: refactor puppetdb tests with parametrize [software/cumin] - 10https://gerrit.wikimedia.org/r/465611 [10:28:58] (03PS2) 10Volans: PuppetDB: fix regex matching [software/cumin] - 10https://gerrit.wikimedia.org/r/465612 [10:31:57] (03CR) 10Mforns: "When I use --deploy-mode with this job, it fails with:" [puppet] - 10https://gerrit.wikimedia.org/r/465692 (https://phabricator.wikimedia.org/T206342) (owner: 10Mforns) [10:32:20] (03CR) 10Mforns: "I meant --deploy-mode cluster" [puppet] - 10https://gerrit.wikimedia.org/r/465692 (https://phabricator.wikimedia.org/T206342) (owner: 10Mforns) [10:33:30] (03CR) 10Volans: "> Patch Set 1:" [software/cumin] - 10https://gerrit.wikimedia.org/r/465611 (owner: 10Volans) [10:55:37] (03PS8) 10Alex Monk: Certcentral-authdns integration [puppet] - 10https://gerrit.wikimedia.org/r/459809 [11:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for European Mid-day SWAT(Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181011T1100). [11:00:04] Amir1: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:23] o/ [11:00:39] I'm around but I guess Amir1 will deploy his changes [11:00:41] zeljkof: I have to deploy a patch, we are under an emergency [11:01:23] marostegui: swat on hold until you give us green light? [11:01:30] thanks should not take long [11:01:55] o/ [11:01:56] Amir1: please wait for marostegui to finish [11:02:05] sure [11:02:55] (03PS1) 10Marostegui: db-eqiad,db-codfw: Depool db2085:3318, db1099:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466431 [11:03:01] banyek: ^ [11:04:49] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 34 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [11:04:53] (03CR) 10Marostegui: [C: 032] db-eqiad,db-codfw: Depool db2085:3318, db1099:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466431 (owner: 10Marostegui) [11:05:35] (03PS9) 10Alex Monk: Certcentral-authdns integration [puppet] - 10https://gerrit.wikimedia.org/r/459809 [11:06:15] (03Merged) 10jenkins-bot: db-eqiad,db-codfw: Depool db2085:3318, db1099:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466431 (owner: 10Marostegui) [11:07:08] !log binlog expiration set to 60 days on db2045 [11:07:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:36] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db2085:3318 and db1099:3318 (duration: 00m 49s) [11:07:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:31] zeljkof: I am done - thanks! [11:08:32] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2085:3318 and db1099:3318 (duration: 00m 49s) [11:08:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:41] marostegui: thanks! [11:09:45] Amir1: swat is yours [11:09:49] !log Stop MYSQL on db2088:3318 and db1099:3318 T206743 [11:09:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:52] T206743: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 [11:10:00] on it [11:10:24] !log Stop MYSQL on db2085:3318 and db1099:3318 T206743 [11:10:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:13] (03CR) 10jenkins-bot: db-eqiad,db-codfw: Depool db2085:3318, db1099:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466431 (owner: 10Marostegui) [11:12:26] (03CR) 10Ladsgroup: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466271 (https://phabricator.wikimedia.org/T194164) (owner: 10Ladsgroup) [11:13:47] (03Merged) 10jenkins-bot: Set some small wikis to read new for change tag backend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466271 (https://phabricator.wikimedia.org/T194164) (owner: 10Ladsgroup) [11:14:23] live on mwdebug1002 [11:16:39] I made a mistake, let me fix it ASAP [11:17:08] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 36 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [11:18:29] (03PS1) 10Ladsgroup: Fix constants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466478 [11:18:51] (03CR) 10Ladsgroup: [C: 032] Fix constants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466478 (owner: 10Ladsgroup) [11:20:27] (03Merged) 10jenkins-bot: Fix constants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466478 (owner: 10Ladsgroup) [11:23:17] zeljkof: ignore "Notice: Use of undefined constant MIGRATION_READ_NEW" in fatal monitor, it's fixed now [11:23:58] * zeljkof 's hair is on fire ;) [11:25:43] \o [11:25:45] * addshore has one to squeeze into swat if there will be time [11:26:07] it was mwdebug only :D [11:26:19] * addshore hands zeljkof a glass of water for his hair [11:27:13] :D [11:27:22] Amir1: if you have time https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/Wikibase/+/466031/ :D [11:27:24] (03CR) 10jenkins-bot: Set some small wikis to read new for change tag backend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466271 (https://phabricator.wikimedia.org/T194164) (owner: 10Ladsgroup) [11:27:26] (03CR) 10jenkins-bot: Fix constants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466478 (owner: 10Ladsgroup) [11:27:28] addshore: it's just you and Amir1 [11:27:32] oh, jenkins said no anyway... [11:27:44] why not :P [11:28:41] *fixed* [11:29:08] logs seems clean, moving forward [11:30:41] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:466271|Set some small wikis to read new for change tag backend (T194164)]] (duration: 00m 50s) [11:30:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:44] T194164: Start reading from change_tag_def in production - https://phabricator.wikimedia.org/T194164 [11:33:33] addshore: it's not merged on master yet [11:33:42] nope, you want to? :P [11:33:49] its already running on mwmaint1002 ;) [11:35:04] I have one nitpick for that :P [11:35:07] addshore: ^ [11:35:14] *looks* [11:35:42] Amir1: will do as a followup [11:35:47] and will do it in the other places then too [11:36:21] sounds good [11:38:08] I wait for jenkins and then merge [11:38:13] (The cerry-pick) [11:40:18] PROBLEM - DPKG on ores1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:41:40] Amir1: thanks! [11:42:28] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 33 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [11:44:45] bah, Amir1, just fixed 1 more phpcs issue [11:48:05] addshore: ping me when jenkins is happy :D [11:48:19] Amir1: will do [11:49:49] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 66 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [11:52:37] ^ akosiaris: the removal of ltrace triggered a broken dpkg state for some of the python dbg packages [11:54:28] (03PS1) 10Ema: ATS: define Icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/466483 (https://phabricator.wikimedia.org/T204209) [12:06:04] (03PS1) 10Ema: profile::cache::kafka: fix varnishkafka Icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/466516 [12:06:57] (03PS1) 10Elukey: Release version 0.4.1+git20181010.2fa99eb [debs/prometheus-memcached-exporter] - 10https://gerrit.wikimedia.org/r/466519 [12:07:34] (03CR) 10Elukey: [C: 032] Release version 0.4.1+git20181010.2fa99eb [debs/prometheus-memcached-exporter] - 10https://gerrit.wikimedia.org/r/466519 (owner: 10Elukey) [12:08:25] (03PS2) 10ArielGlenn: use lbzip2 for recombining page content dumps, if available and configured [dumps] - 10https://gerrit.wikimedia.org/r/466344 (https://phabricator.wikimedia.org/T179059) [12:08:32] (03CR) 10Elukey: [C: 031] "<3" [puppet] - 10https://gerrit.wikimedia.org/r/466516 (owner: 10Ema) [12:10:35] Amir1: not green yet, jenkins is taking an age [12:10:43] * addshore is stepping out for 15 mins [12:11:05] (03CR) 10Ema: [C: 032] profile::cache::kafka: fix varnishkafka Icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/466516 (owner: 10Ema) [12:12:27] (03PS3) 10ArielGlenn: use lbzip2 for recombining page content dumps, if available and configured [dumps] - 10https://gerrit.wikimedia.org/r/466344 (https://phabricator.wikimedia.org/T179059) [12:12:49] addshore: I close the SWAT, let's do it later [12:12:55] !log EU SWAT is done [12:12:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:55] (03PS1) 10BBlack: move role::authdns::data::nameservers to common hiera [puppet] - 10https://gerrit.wikimedia.org/r/466539 [12:14:22] !log upload prometheus-memcached-exporter_0.4.1+git20181010.2fa99eb-1 to (jessie|stretch)-wikimedia [12:14:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:38] (03CR) 10jerkins-bot: [V: 04-1] move role::authdns::data::nameservers to common hiera [puppet] - 10https://gerrit.wikimedia.org/r/466539 (owner: 10BBlack) [12:14:43] this has been already tested in deployment-prep, contains new metrics --^ [12:15:11] !log upgrade prometheus-memcached-exporter on mc2035 [12:15:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:48] (03PS2) 10Ema: ATS: define Icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/466483 (https://phabricator.wikimedia.org/T204209) [12:20:42] (03PS1) 10Mathew.onipe: wdqs: fixed autodeploy cron logging permission issue [puppet] - 10https://gerrit.wikimedia.org/r/466550 (https://phabricator.wikimedia.org/T206597) [12:25:50] (03PS1) 10ArielGlenn: dumps config settings to use lbzip2 for recombining page content files [puppet] - 10https://gerrit.wikimedia.org/r/466554 (https://phabricator.wikimedia.org/T179059) [12:26:27] (03CR) 10jerkins-bot: [V: 04-1] dumps config settings to use lbzip2 for recombining page content files [puppet] - 10https://gerrit.wikimedia.org/r/466554 (https://phabricator.wikimedia.org/T179059) (owner: 10ArielGlenn) [12:27:35] (03PS2) 10BBlack: move role::authdns::data::nameservers to common hiera [puppet] - 10https://gerrit.wikimedia.org/r/466539 [12:27:54] (03CR) 10Ema: [C: 032] ATS: define Icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/466483 (https://phabricator.wikimedia.org/T204209) (owner: 10Ema) [12:28:13] (03CR) 10jerkins-bot: [V: 04-1] move role::authdns::data::nameservers to common hiera [puppet] - 10https://gerrit.wikimedia.org/r/466539 (owner: 10BBlack) [12:29:25] (03CR) 10BBlack: [V: 032 C: 032] "Jenkins is complaining about style there's no reasonable fix for. This verifies as NO-OP on the authdns servers: https://puppet-compiler." [puppet] - 10https://gerrit.wikimedia.org/r/466539 (owner: 10BBlack) [12:29:43] (03PS3) 10BBlack: move role::authdns::data::nameservers to common hiera [puppet] - 10https://gerrit.wikimedia.org/r/466539 [12:29:53] (03CR) 10BBlack: [V: 032 C: 032] move role::authdns::data::nameservers to common hiera [puppet] - 10https://gerrit.wikimedia.org/r/466539 (owner: 10BBlack) [12:32:32] (03PS2) 10ArielGlenn: dumps config settings to use lbzip2 for recombining page content files [puppet] - 10https://gerrit.wikimedia.org/r/466554 (https://phabricator.wikimedia.org/T179059) [12:35:21] (03PS1) 10Mathew.onipe: profile::maps::osm_master: change osmupdater and osmimporter auth method to peer [puppet] - 10https://gerrit.wikimedia.org/r/466574 (https://phabricator.wikimedia.org/T206639) [12:38:40] !log upgrade prometheus-memcached-exporter on mc2* [12:38:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:34] (03CR) 10Gehel: "good start, did you check that all the clients are connecting in a way that is compatible?" [puppet] - 10https://gerrit.wikimedia.org/r/466574 (https://phabricator.wikimedia.org/T206639) (owner: 10Mathew.onipe) [12:43:26] !log upgrade prometheus-memcached-exporter on mc1* [12:43:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:19] (03CR) 10Gehel: [C: 04-1] "Minor comment inline. I'm pretty sure that puppet compiler will show you an issue." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/466550 (https://phabricator.wikimedia.org/T206597) (owner: 10Mathew.onipe) [12:46:11] (03CR) 10Gehel: [C: 031] "Readable enough, LGTM" [software/cumin] - 10https://gerrit.wikimedia.org/r/465611 (owner: 10Volans) [12:47:44] (03CR) 10Volans: [C: 032] Tests: refactor puppetdb tests with parametrize [software/cumin] - 10https://gerrit.wikimedia.org/r/465611 (owner: 10Volans) [12:48:39] PROBLEM - HHVM rendering on mw1285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:49:38] RECOVERY - HHVM rendering on mw1285 is OK: HTTP OK: HTTP/1.1 200 OK - 74827 bytes in 0.127 second response time [12:51:16] (03Merged) 10jenkins-bot: Tests: refactor puppetdb tests with parametrize [software/cumin] - 10https://gerrit.wikimedia.org/r/465611 (owner: 10Volans) [12:52:53] (03CR) 10jenkins-bot: Tests: refactor puppetdb tests with parametrize [software/cumin] - 10https://gerrit.wikimedia.org/r/465611 (owner: 10Volans) [13:01:12] (03CR) 10Imarlier: "Ping @bblack @ema" [puppet] - 10https://gerrit.wikimedia.org/r/465538 (https://phabricator.wikimedia.org/T206496) (owner: 10Imarlier) [13:02:28] (03PS1) 10Anomie: Fix wgActorTableSchemaMigrationStage on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466590 (https://phabricator.wikimedia.org/T206732) [13:03:22] (03CR) 10Anomie: "Deploying Beta Cluster config change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466590 (https://phabricator.wikimedia.org/T206732) (owner: 10Anomie) [13:03:35] (03CR) 10Anomie: [C: 032] Fix wgActorTableSchemaMigrationStage on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466590 (https://phabricator.wikimedia.org/T206732) (owner: 10Anomie) [13:05:20] (03Merged) 10jenkins-bot: Fix wgActorTableSchemaMigrationStage on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466590 (https://phabricator.wikimedia.org/T206732) (owner: 10Anomie) [13:06:20] (03PS1) 10Gehel: relforge: setup 2 instances to validate multi-instance configuration [puppet] - 10https://gerrit.wikimedia.org/r/466591 (https://phabricator.wikimedia.org/T198352) [13:09:21] (03CR) 10jenkins-bot: Fix wgActorTableSchemaMigrationStage on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466590 (https://phabricator.wikimedia.org/T206732) (owner: 10Anomie) [13:11:47] (03PS2) 10Gehel: relforge: setup 2 instances to validate multi-instance configuration [puppet] - 10https://gerrit.wikimedia.org/r/466591 (https://phabricator.wikimedia.org/T198352) [13:12:40] (03PS1) 10Marostegui: db-codfw.php: Depool db2083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466594 (https://phabricator.wikimedia.org/T206743) [13:14:09] banyek ^ [13:14:37] (03CR) 10Banyek: [C: 031] db-codfw.php: Depool db2083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466594 (https://phabricator.wikimedia.org/T206743) (owner: 10Marostegui) [13:15:27] PROBLEM - Device not healthy -SMART- on db2050 is CRITICAL: cluster=mysql device=cciss,6 instance=db2050:9100 job=node site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2050&var-datasource=codfw%2520prometheus%252Fops [13:17:48] banyek: can you just ack that alert? ^ [13:18:07] don't even create a ticket for it, let's leave the disk fail by itself [13:18:46] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466594 (https://phabricator.wikimedia.org/T206743) (owner: 10Marostegui) [13:19:00] (03Abandoned) 10Mathew.onipe: wdqs: fixed autodeploy cron logging permission issue [puppet] - 10https://gerrit.wikimedia.org/r/466550 (https://phabricator.wikimedia.org/T206597) (owner: 10Mathew.onipe) [13:19:53] (03PS3) 10Gehel: relforge: setup 2 instances to validate multi-instance configuration [puppet] - 10https://gerrit.wikimedia.org/r/466591 (https://phabricator.wikimedia.org/T198352) [13:20:24] !log Stop MySQL on db1116:3318 to reclone it from db2083 - T206743 [13:20:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:28] T206743: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 [13:20:28] moritzm: yeah I know. I am gdbing [13:20:36] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466594 (https://phabricator.wikimedia.org/T206743) (owner: 10Marostegui) [13:20:44] unsuccessfully up to now [13:21:05] (03CR) 10jerkins-bot: [V: 04-1] relforge: setup 2 instances to validate multi-instance configuration [puppet] - 10https://gerrit.wikimedia.org/r/466591 (https://phabricator.wikimedia.org/T198352) (owner: 10Gehel) [13:21:22] seems like mwparserfromhell is not honoring what gdb python scripts are expecting and e.g. py-bt doesn't work [13:21:32] (03PS4) 10Gehel: relforge: setup 2 instances to validate multi-instance configuration [puppet] - 10https://gerrit.wikimedia.org/r/466591 (https://phabricator.wikimedia.org/T198352) [13:21:38] ok [13:21:42] that damn backtrace ranges from 220-307 frames [13:21:46] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2083 (duration: 00m 49s) [13:21:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:00] at least give what I 've seen up t now [13:22:27] ACKNOWLEDGEMENT - Device not healthy -SMART- on db2050 is CRITICAL: cluster=mysql device=cciss,6 instance=db2050:9100 job=node site=codfw Banyek ack https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2050&var-datasource=codfw%2520prometheus%252Fops [13:22:45] (03CR) 10jerkins-bot: [V: 04-1] relforge: setup 2 instances to validate multi-instance configuration [puppet] - 10https://gerrit.wikimedia.org/r/466591 (https://phabricator.wikimedia.org/T198352) (owner: 10Gehel) [13:23:44] !log Stop MySQL on db2083 to reclone db1116:3318 - T206743 [13:23:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:48] 10Operations, 10cloud-services-team: Create a jessie netboot image with the 4.9 Linux kernel - https://phabricator.wikimedia.org/T206761 (10MoritzMuehlenhoff) [13:25:01] (03CR) 10jenkins-bot: db-codfw.php: Depool db2083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466594 (https://phabricator.wikimedia.org/T206743) (owner: 10Marostegui) [13:26:33] !log filling in missing rows on dbstore1002 [13:26:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:04] (03PS1) 10Ema: varnish: add vtc test for sitemap rewrites [puppet] - 10https://gerrit.wikimedia.org/r/466602 (https://phabricator.wikimedia.org/T206496) [13:29:58] (03PS1) 10Mathew.onipe: wdqs: fixed autodeploy cron logging permission issue [puppet] - 10https://gerrit.wikimedia.org/r/466603 (https://phabricator.wikimedia.org/T206597) [13:31:44] (03Abandoned) 10Mathew.onipe: wdqs: fixed autodeploy cron logging permission issue [puppet] - 10https://gerrit.wikimedia.org/r/466603 (https://phabricator.wikimedia.org/T206597) (owner: 10Mathew.onipe) [13:32:18] (03CR) 10Ema: [C: 032] varnish: add vtc test for sitemap rewrites [puppet] - 10https://gerrit.wikimedia.org/r/466602 (https://phabricator.wikimedia.org/T206496) (owner: 10Ema) [13:36:18] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 35 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [13:40:07] PROBLEM - puppet last run on ms-be1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:42:42] (03PS1) 10Mathew.onipe: wdqs: fixed autodeploy cron logging permission issue [puppet] - 10https://gerrit.wikimedia.org/r/466608 (https://phabricator.wikimedia.org/T206597) [13:43:38] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 68 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [13:48:28] (03CR) 10Mathew.onipe: "Puppet compiler:" [puppet] - 10https://gerrit.wikimedia.org/r/466608 (https://phabricator.wikimedia.org/T206597) (owner: 10Mathew.onipe) [13:53:35] (03PS1) 10Mathew.onipe: wdqs: add logging message to autodeploy cron task [puppet] - 10https://gerrit.wikimedia.org/r/466610 (https://phabricator.wikimedia.org/T206597) [13:55:33] !log recovering rows to db1092 [13:55:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:27] RECOVERY - puppet last run on ms-be1014 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [14:07:34] (03CR) 10Giuseppe Lavagetto: [C: 031] cache::upload: Move swift to active/active [puppet] - 10https://gerrit.wikimedia.org/r/458796 (https://phabricator.wikimedia.org/T203777) (owner: 10Alexandros Kosiaris) [14:08:09] (03CR) 10Giuseppe Lavagetto: [C: 031] "+1 as long as we move to a/a ASAP after the switchback is done." [puppet] - 10https://gerrit.wikimedia.org/r/458797 (https://phabricator.wikimedia.org/T203777) (owner: 10Alexandros Kosiaris) [14:08:58] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 31 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [14:09:22] (03CR) 10Faidon Liambotis: [C: 04-1] smarthost: create mail smarthost role/profile (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/463143 (https://phabricator.wikimedia.org/T41785) (owner: 10Herron) [14:11:47] PROBLEM - Nginx local proxy to apache on mw1339 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.006 second response time [14:11:57] PROBLEM - HHVM rendering on mw1339 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [14:12:57] RECOVERY - Nginx local proxy to apache on mw1339 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.069 second response time [14:13:07] RECOVERY - HHVM rendering on mw1339 is OK: HTTP OK: HTTP/1.1 200 OK - 74827 bytes in 0.151 second response time [14:13:22] !log install libxml2 security updates on jessie servers [14:13:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:51] !log applying row filling to (most) eqiad s8 dbs, including the mater [14:15:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:58] !log reboot eventlog1002 for kernel upgrades [14:15:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:08] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 64 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [14:22:27] (03PS1) 10Banyek: mariadb: depool db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466618 [14:23:47] (03CR) 10jerkins-bot: [V: 04-1] mariadb: depool db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466618 (owner: 10Banyek) [14:24:03] 10Operations, 10DBA, 10MediaWiki-Database: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Marostegui) p:05Unbreak!>03High This is no longer "unbreak now" ``` pc1004 Filesystem Type Size Used Avail Use% Mounted on /dev/mapper/pc1004--vg-srv xfs 2.2T... [14:24:42] FYI, T-6m for services switchover [14:25:38] (03PS2) 10Banyek: mariadb: depool db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466618 (https://phabricator.wikimedia.org/T206743) [14:27:31] (03CR) 10Jcrespo: [C: 032] mariadb: depool db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466618 (https://phabricator.wikimedia.org/T206743) (owner: 10Banyek) [14:27:47] (03CR) 10Banyek: [V: 032] mariadb: depool db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466618 (https://phabricator.wikimedia.org/T206743) (owner: 10Banyek) [14:28:09] !log depooling db1087 [14:28:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:17] !log depooling db1087 (T206743) [14:28:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:20] T206743: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 [14:28:39] FYI, T-2m for services switchover [14:29:24] please cease all other operational activity for 10mins [14:30:04] Deploy window Datacenter Switchback - Services (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181011T1430) [14:30:11] starting [14:30:13] !log START - Cookbook sre.switchdc.services.00-reduce-ttl-and-sleep (volans@neodymium) [14:30:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:20] akosiaris: only 10 mins? :) [14:30:36] akosiaris: user stealer! [14:30:39] I am being optimistic [14:30:44] !log banyek@deploy1001 Synchronized wmf-config/db-eqiad.php: T206743: mariadb: Depool db1087 (duration: 00m 49s) [14:30:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:51] volans: indeed. I did not think of that [14:31:00] well anyway you get extra credits :P [14:31:00] <_joe_> what are you doing guys? [14:31:11] I'm doing nothing [14:31:18] alex is using my tmux from yesterday :D [14:31:21] with my $USER set [14:31:35] banyek: please pause for a while [14:31:39] <_joe_> akosiaris: you can just run the script I pasted you [14:31:52] better blame volans.jpg [14:31:57] lol [14:32:19] _joe_: I 'll do that after with argument codfw [14:32:28] but for now I am running the cookbook as normal [14:32:40] <_joe_> uhm ok, we need to remove aqs then [14:33:01] and restbase for now [14:33:12] and do it after swift is also switched back or tomorrow [14:33:27] <_joe_> ok, your call :) [14:34:24] something about aqs? :) [14:34:25] this five mins wait for the TTL ... [14:34:43] _joe_: why was aqs in the list in the first place anyway ? [14:35:20] <_joe_> I just didn't remove it, I noticed some minutes ago but thought it wouldn't count [14:35:30] !log END (PASS) - Cookbook sre.switchdc.services.00-reduce-ttl-and-sleep (exit_code=0) (volans@neodymium) [14:35:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:32] <_joe_> akosiaris: still blaming me? [14:35:36] <_joe_> :P [14:36:05] !log START - Cookbook sre.switchdc.services.01-switch-dc (volans@neodymium) [14:36:05] !log Switching services parsoid, restbase, restbase-async, mobileapps, apertium, citoid, cxserver, eventstreams, graphoid, mathoid, proton, pdfrender, recommendation-api, zotero, eventbus, ores, wdqs, wdqs-internal: codfw => eqiad (volans@neodymium) [14:36:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:24] !log END (PASS) - Cookbook sre.switchdc.services.01-switch-dc (exit_code=0) (volans@neodymium) [14:36:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:49] ok, checking before restoing TTL [14:37:12] 10Operations, 10Gadgets, 10MediaWiki-Cache, 10Performance-Team (Radar), and 2 others: Mcrouter periodically reports soft TKOs for mc[1,2]035 leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) Finally build and deployed the new prometheus-memcached-exporter on the mc*... [14:37:15] <_joe_> akosiaris: I think you missed kartotherian [14:37:25] <_joe_> but it's not used, so meh [14:37:31] <_joe_> the discovery endpoint I mean [14:38:03] yeah it's directly connected to from varnish [14:38:08] it was done yesterday [14:38:10] in upload [14:38:15] <_joe_> yup [14:38:47] ok looks good [14:40:57] <_joe_> akosiaris: do you want to repool codfw now or later? [14:41:03] <_joe_> everything is ok in etcd [14:41:08] _joe_: later, after swift is done [14:41:19] let's play it by the book [14:41:41] everything looks fine up to now [14:41:47] <_joe_> you can restore the ttl, btw, the cookbook checks by itself [14:41:50] (03CR) 10jenkins-bot: mariadb: depool db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466618 (https://phabricator.wikimedia.org/T206743) (owner: 10Banyek) [14:41:54] I 'll do another round of checks and restore the TTL [14:42:05] <_joe_> rb timeouts in codfw [14:42:07] <_joe_> uhm [14:42:14] yup just noticed them [14:42:19] both eqiad+codfw [14:42:34] still at 1/3 checks [14:42:42] <_joe_> so most probably unrelated [14:42:47] <_joe_> but ofc it happens now [14:43:05] looking [14:43:19] alert is /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [14:43:33] 2/3 for some hosts, but they are fewer now [14:43:45] ah ok [14:43:48] that's mcs [14:44:18] <_joe_> yes [14:44:33] <_joe_> and btw I ran it on a couple servers and it completes successfully now [14:45:01] I 'll give it another minute just in case [14:45:12] yeah we really have to look into that one and why that is happening [14:45:36] tried on some hosts and looks to be back now [14:45:59] yeah it's flapping a bit [14:46:09] well.. icinga just now detects it [14:46:18] I am guessing things are fine though [14:46:25] we need something better than icinga soon [14:46:37] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 34 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [14:46:58] ok all alerts gone, I am restoring the TTL [14:47:03] !log START - Cookbook sre.switchdc.services.02-restore-ttl (volans@neodymium) [14:47:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:23] !log END (PASS) - Cookbook sre.switchdc.services.02-restore-ttl (exit_code=0) (volans@neodymium) [14:47:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:51] (03PS1) 10Marostegui: db-eqiad.php: Repool db1099:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466652 (https://phabricator.wikimedia.org/T206743) [14:51:21] T-10 for swift [14:51:41] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1099:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466652 (https://phabricator.wikimedia.org/T206743) (owner: 10Marostegui) [14:52:25] (03PS2) 10Alexandros Kosiaris: cache::upload: Move swift to active/active [puppet] - 10https://gerrit.wikimedia.org/r/458796 (https://phabricator.wikimedia.org/T203777) [14:52:35] !log deploying wikidata row fix to db1087 with replication enabled [14:52:36] akosiaris: was the swiftrepl totally stopped or has to be switched too manually? [14:52:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:41] I know is only for reconciliation now [14:52:54] (03PS2) 10Alexandros Kosiaris: cache::upload: Move swift to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/458797 (https://phabricator.wikimedia.org/T203777) [14:53:09] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1099:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466652 (https://phabricator.wikimedia.org/T206743) (owner: 10Marostegui) [14:53:48] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 51 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [14:53:54] volans: IIRC, no [14:54:49] (03PS1) 10Jcrespo: Revert "mariadb: depool db1087" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466653 [14:54:59] (03CR) 10jerkins-bot: [V: 04-1] Revert "mariadb: depool db1087" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466653 (owner: 10Jcrespo) [14:55:00] akosiaris: so we just stopped it, trust on the double write and alarm on the graph if the diff is too high [14:55:20] and use it to reconcile [14:56:56] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1099:3318 (duration: 00m 48s) [14:56:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:01] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1099:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466652 (https://phabricator.wikimedia.org/T206743) (owner: 10Marostegui) [14:59:05] T-2m for swift [15:00:04] Deploy window Datacenter Switchback - Media storage/Swift (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181011T1500) [15:00:09] starting swift switchback [15:00:38] (03CR) 10Alexandros Kosiaris: [C: 032] cache::upload: Move swift to active/active [puppet] - 10https://gerrit.wikimedia.org/r/458796 (https://phabricator.wikimedia.org/T203777) (owner: 10Alexandros Kosiaris) [15:00:47] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 98 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [15:01:10] !log Media storage/Swift Swift set to active/active [15:01:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:27] (03PS1) 10Marostegui: db-eqiad.php: Depool db1101:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466658 (https://phabricator.wikimedia.org/T206743) [15:02:58] puppet runs done [15:03:11] proceeding with moving swift to eqiad (and then undo it later on :)) [15:03:17] (03CR) 10Alexandros Kosiaris: [C: 032] cache::upload: Move swift to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/458797 (https://phabricator.wikimedia.org/T203777) (owner: 10Alexandros Kosiaris) [15:03:35] akosiaris: export USER=akosiaris :-P [15:03:41] sorry [15:03:44] SUDO_USER [15:03:44] SUDO_USER :P [15:04:30] !log Media storage/Swift Swift set to active/passive [15:04:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:45] akosiaris: I will wait for your green light to deploy that db-eqiad.php to depool a DB [15:05:27] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 42 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [15:06:48] puppet done [15:07:07] graphs look normal [15:07:34] no alerts [15:07:57] marostegui: I think you can proceed [15:08:09] ta! [15:08:11] everything looks fine [15:08:12] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1101:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466658 (https://phabricator.wikimedia.org/T206743) (owner: 10Marostegui) [15:09:55] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1101:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466658 (https://phabricator.wikimedia.org/T206743) (owner: 10Marostegui) [15:11:09] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1101:3318 (duration: 00m 49s) [15:11:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:15] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1101:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466658 (https://phabricator.wikimedia.org/T206743) (owner: 10Marostegui) [15:12:19] 10Operations, 10Goal, 10Patch-For-Review: Migrate the hardware inventory from Racktables to Netbox - https://phabricator.wikimedia.org/T199083 (10Volans) 05Open>03Resolved [15:12:38] 10Operations, 10Tracking: Hardware Automation Workflow - Overall Tracking - https://phabricator.wikimedia.org/T116063 (10Volans) [15:12:45] 10Operations, 10netops, 10Patch-For-Review: Evaluate NetBox as a Racktables replacement & IPAM - https://phabricator.wikimedia.org/T170144 (10Volans) 05Open>03Resolved [15:12:50] !log Stop MySQL on db2085:3318 to reclone db1101:3318 - T206743 [15:12:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:54] T206743: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 [15:13:26] (03CR) 10Imarlier: [C: 031] Enable base::service_auto_restart for uwsgi-coal [puppet] - 10https://gerrit.wikimedia.org/r/465593 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:14:53] 10Operations, 10Release Pipeline (Blubber), 10Release-Engineering-Team (Watching / External): Update Debian package of Blubber (0.6.0-1) - https://phabricator.wikimedia.org/T206766 (10thcipriani) [15:14:59] what do folks think about the increased thumbor host latency? [15:15:17] <_joe_> apergos: can you explain better? [15:15:24] https://grafana.wikimedia.org/dashboard/db/thumbor?orgId=1&from=now-3h&to=now [15:16:02] <_joe_> apergos: we're back to reading on swift eqiad [15:16:14] <_joe_> which is "colder" I'd expect [15:17:06] I had some vague notion that, like file uploads, thumbs are synced; maybe that's wrong [15:17:19] <_joe_> also looks like codfw performs better [15:17:33] <_joe_> see https://grafana.wikimedia.org/dashboard/db/thumbor?orgId=1&from=1535375521844&to=1539270996453&panelId=5&fullscreen [15:17:53] <_joe_> after the switchover, poerf in codfw is apparently much better [15:18:08] <_joe_> not sure what swift url is used by thumbor, though [15:18:21] that's interesting [15:18:47] ACKNOWLEDGEMENT - DPKG on ores1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages alexandros kosiaris gdbing celery to figure out why some workers are busy looping. [15:19:47] left to do is still restbase, is that right? anything else? [15:19:54] _joe_: en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 responds with malformed body (AttributeError: 'NoneType' object has no attribute 'get'): [15:20:00] that looks like a bug in service-checker [15:20:18] probably needs an if blah: somewhere [15:20:38] apergos: what's left to do is repool all the active/active services [15:20:47] <_joe_> akosiaris: no, it looks like we expected a json response and we got back an empty body [15:20:50] ah indeed [15:20:53] hm that means the body does not match the expected response [15:21:11] which host is that on? [15:21:16] <_joe_> mobrovac: no, that the body is empty, tipically [15:21:18] yeah but doing a get on NoneType is wrong [15:21:18] <_joe_> all of them [15:21:22] <_joe_> no sorry [15:21:29] <_joe_> on other hosts you get 504 [15:21:34] check if it's None and possibly alert on that [15:21:46] multiple restbase hosts [15:21:56] restbase1013, restbase1015, restbase2007 [15:22:00] <_joe_> it's now just a couple [15:22:02] huh k [15:22:11] <_joe_> but more of them were responding 504 to that same test [15:23:10] <_joe_> it's all gone now [15:23:23] _joe_: looking at the output, there is a return body, but apparently a part of the expected hash obj is missing so we get that obscure error [15:23:43] heh indeed, it's healthy now [15:23:46] <_joe_> mobrovac: ok can you write a bug? I might have to look at service-checker tomorrow [15:23:53] <_joe_> anyways [15:23:59] btw _joe_, 504's and python errors were for different checks [15:26:52] (03PS2) 10Jcrespo: Revert "mariadb: depool db1087" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466653 [15:28:36] there was a brief crit for thumbor.svc.eqiad but it went away before I coudl read it [15:29:53] (03PS3) 10Jcrespo: Revert "mariadb: depool db1087" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466653 [15:30:47] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 18 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [15:31:17] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 17 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [15:32:24] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: depool db1087" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466653 (owner: 10Jcrespo) [15:32:29] (03CR) 10Cwhite: "The reasoning appears sound to me." [puppet] - 10https://gerrit.wikimedia.org/r/444230 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:33:21] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, and 2 others: Add Accept header to webrequest logs - https://phabricator.wikimedia.org/T170606 (10Ottomata) [15:33:49] (03Merged) 10jenkins-bot: Revert "mariadb: depool db1087" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466653 (owner: 10Jcrespo) [15:34:12] (03CR) 10Cwhite: [C: 031] Cleanup systemd state on Diamond removal [puppet] - 10https://gerrit.wikimedia.org/r/466217 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [15:38:34] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1087 (duration: 00m 50s) [15:38:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:44] 10Operations, 10Goal, 10Patch-For-Review: Perform a datacenter switchover (2018-19 Q1) - https://phabricator.wikimedia.org/T199073 (10akosiaris) [15:38:47] 10Operations, 10Goal, 10Patch-For-Review: Successfully switch backend traffic (MediaWiki, Swift, RESTBase, Parsoid and services) to be served from eqiad - https://phabricator.wikimedia.org/T203777 (10akosiaris) 05Open>03Resolved a:03akosiaris Mediawiki and traffic were successfully switched yesterday,... [15:39:08] 10Operations, 10DBA, 10MediaWiki-Database, 10Datacenter-Switchover-2018: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10jcrespo) Forgetting the codfw -> eqiad replication was the most likely cause of overload on the application servers (and on External storage hosts). [15:42:33] (03CR) 10jenkins-bot: Revert "mariadb: depool db1087" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466653 (owner: 10Jcrespo) [15:54:17] (03PS3) 10Cwhite: hiera: enable ntp collector on role::recursor [puppet] - 10https://gerrit.wikimedia.org/r/464905 (https://phabricator.wikimedia.org/T183454) [16:00:04] godog and _joe_: (Dis)respected human, time to deploy Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181011T1600). Please do the needful. [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:01:03] (03CR) 10Volans: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/444230 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [16:04:45] (03PS2) 10Gehel: wdqs: add logging message to autodeploy cron task [puppet] - 10https://gerrit.wikimedia.org/r/466610 (https://phabricator.wikimedia.org/T206597) (owner: 10Mathew.onipe) [16:06:24] (03PS2) 10Gehel: wdqs: fixed autodeploy cron logging permission issue [puppet] - 10https://gerrit.wikimedia.org/r/466608 (https://phabricator.wikimedia.org/T206597) (owner: 10Mathew.onipe) [16:06:58] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 33 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [16:08:49] (03CR) 10Gehel: [C: 032] wdqs: fixed autodeploy cron logging permission issue [puppet] - 10https://gerrit.wikimedia.org/r/466608 (https://phabricator.wikimedia.org/T206597) (owner: 10Mathew.onipe) [16:10:37] 10Operations, 10Goal, 10Patch-For-Review: Perform a datacenter switchover (2018-19 Q1) - https://phabricator.wikimedia.org/T199073 (10akosiaris) [16:10:38] PROBLEM - Nginx local proxy to apache on mw1348 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.006 second response time [16:10:38] PROBLEM - Apache HTTP on mw1348 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [16:10:48] (03PS3) 10Gehel: wdqs: add logging message to autodeploy cron task [puppet] - 10https://gerrit.wikimedia.org/r/466610 (https://phabricator.wikimedia.org/T206597) (owner: 10Mathew.onipe) [16:11:05] (03CR) 10EBernhardson: "Overall looks good, not clear on why it was necessary to have some tlsproxy's not be a default_server" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/466591 (https://phabricator.wikimedia.org/T198352) (owner: 10Gehel) [16:11:15] (03CR) 10Gehel: [C: 032] wdqs: add logging message to autodeploy cron task [puppet] - 10https://gerrit.wikimedia.org/r/466610 (https://phabricator.wikimedia.org/T206597) (owner: 10Mathew.onipe) [16:11:23] (03CR) 10Cwhite: [C: 032] hiera: enable ntp collector on role::recursor [puppet] - 10https://gerrit.wikimedia.org/r/464905 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [16:11:25] (03PS4) 10Cwhite: hiera: enable ntp collector on role::recursor [puppet] - 10https://gerrit.wikimedia.org/r/464905 (https://phabricator.wikimedia.org/T183454) [16:11:38] RECOVERY - Nginx local proxy to apache on mw1348 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.029 second response time [16:11:42] 10Operations, 10Goal, 10Patch-For-Review: Perform a datacenter switchover (2018-19 Q1) - https://phabricator.wikimedia.org/T199073 (10akosiaris) 05Open>03Resolved a:03akosiaris Successfully switched (with some aftermath and actionables but successfully nevertheless) to codfw and back per the subtasks,... [16:11:47] RECOVERY - Apache HTTP on mw1348 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.031 second response time [16:13:18] (03CR) 10Gehel: "Note that this is still WIP and non working yet at all." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/466591 (https://phabricator.wikimedia.org/T198352) (owner: 10Gehel) [16:13:58] PROBLEM - Check systemd state on dns1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:14:09] ^ looking [16:14:17] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 63 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [16:15:04] oh hmm [16:15:08] cwhite: ?? dns1001 [16:15:36] wrong IRC name! [16:16:27] PROBLEM - High lag on wdqs1003 is CRITICAL: 3632 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:16:28] shdubsh: what's up on dns1001? [16:16:55] bblack: must be something with prometheus-node-exporter. having a look [16:17:28] PROBLEM - Check systemd state on dns4001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:18:02] ^^ dns4001 as well [16:18:46] (03PS1) 10Cwhite: Revert "hiera: enable ntp collector on role::recursor" [puppet] - 10https://gerrit.wikimedia.org/r/466686 [16:18:57] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test [16:18:57] read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 504 (expecting: 200) [16:19:06] yeah it seems like the ntp collector is faulty somehow [16:19:13] (03PS2) 10Cwhite: Revert "hiera: enable ntp collector on role::recursor" [puppet] - 10https://gerrit.wikimedia.org/r/466686 [16:19:21] shouldn't break dns/ntp though [16:19:24] (03CR) 10Cwhite: [V: 032 C: 032] Revert "hiera: enable ntp collector on role::recursor" [puppet] - 10https://gerrit.wikimedia.org/r/466686 (owner: 10Cwhite) [16:19:46] right [16:19:58] PROBLEM - Check systemd state on dns5001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:19:58] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [16:20:22] reverting now [16:20:28] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is CRITICAL: 48.27 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [16:20:57] PROBLEM - Check systemd state on dns2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:21:07] PROBLEM - Check systemd state on dns5002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:21:28] RECOVERY - Check systemd state on dns1001 is OK: OK - running: The system is fully operational [16:21:57] RECOVERY - Check systemd state on dns4001 is OK: OK - running: The system is fully operational [16:23:08] RECOVERY - Check systemd state on dns2001 is OK: OK - running: The system is fully operational [16:23:18] RECOVERY - Check systemd state on dns5002 is OK: OK - running: The system is fully operational [16:24:23] (03PS1) 10Gehel: logrotate: add type contraints to parameters [puppet] - 10https://gerrit.wikimedia.org/r/466687 [16:24:48] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is OK: (C)60 le (W)70 le 79.72 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [16:25:56] 10Operations, 10Domains, 10Traffic: SOA serial numbers returned by authoritative nameservers differ - https://phabricator.wikimedia.org/T206688 (10Dzahn) Thank you for the detailed explanation. I will get back to Legal and MarkMonitor about it. [16:26:03] 10Operations, 10Domains, 10Traffic: SOA serial numbers returned by authoritative nameservers differ - https://phabricator.wikimedia.org/T206688 (10Dzahn) a:03Dzahn [16:26:17] RECOVERY - High lag on wdqs1003 is OK: (C)3600 ge (W)1200 ge 441 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:26:38] RECOVERY - Check systemd state on dns5001 is OK: OK - running: The system is fully operational [16:27:48] (03CR) 10jerkins-bot: [V: 04-1] logrotate: add type contraints to parameters [puppet] - 10https://gerrit.wikimedia.org/r/466687 (owner: 10Gehel) [16:28:37] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 109 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [16:29:17] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 122 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [16:29:26] (03PS51) 10Alex Monk: Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) [16:29:45] (03PS2) 10Gehel: logrotate: add type contraints to parameters [puppet] - 10https://gerrit.wikimedia.org/r/466687 [16:30:34] (03CR) 10Gehel: "puppet compiler on a few nodes agrees this is a NOOP: https://puppet-compiler.wmflabs.org/compiler1002/12874/" [puppet] - 10https://gerrit.wikimedia.org/r/466687 (owner: 10Gehel) [16:30:36] (03PS10) 10Alex Monk: Certcentral-authdns integration [puppet] - 10https://gerrit.wikimedia.org/r/459809 [16:31:14] (03CR) 10jerkins-bot: [V: 04-1] Certcentral-authdns integration [puppet] - 10https://gerrit.wikimedia.org/r/459809 (owner: 10Alex Monk) [16:32:32] (03CR) 10Herron: [C: 04-1] "Seems safe on paper, but I'm -1 because it carries risk of ssh and cumin lockout across many systems at a time leaving only serial access " [puppet] - 10https://gerrit.wikimedia.org/r/444230 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [16:32:44] (03CR) 10jerkins-bot: [V: 04-1] logrotate: add type contraints to parameters [puppet] - 10https://gerrit.wikimedia.org/r/466687 (owner: 10Gehel) [16:32:47] PROBLEM - High lag on wdqs1003 is CRITICAL: 4397 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:33:38] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 20 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [16:34:18] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 18 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [16:38:48] (03PS52) 10Alex Monk: Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) [16:38:50] (03PS11) 10Alex Monk: Certcentral-authdns integration [puppet] - 10https://gerrit.wikimedia.org/r/459809 [16:38:52] (03CR) 10Mathew.onipe: [C: 031] logrotate: add type contraints to parameters [puppet] - 10https://gerrit.wikimedia.org/r/466687 (owner: 10Gehel) [16:39:38] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 32 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [16:44:01] (03PS3) 10Gehel: logrotate: add type contraints to parameters [puppet] - 10https://gerrit.wikimedia.org/r/466687 [16:44:52] (03PS53) 10Vgutierrez: Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [16:46:35] (03CR) 10jerkins-bot: [V: 04-1] Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [16:46:57] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 70 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [16:47:07] XioNoX, bblack ^^ [16:47:08] what's that? [16:47:32] there's a bunch of them hitting the threshold [16:47:57] RECOVERY - High lag on wdqs1003 is OK: (C)3600 ge (W)1200 ge 515 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:48:04] good question! [16:48:10] yeah, and I increased the threshold in https://phabricator.wikimedia.org/T205829 [16:48:43] either our ipv6 is awful, or the internet's is, or ripe's ipv6 probe set is awful, one of the three? [16:48:46] I had a quick look and can't find anything related to our network, or any common middleman [16:48:49] hopefully not the first option [16:49:09] (03PS54) 10Vgutierrez: Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [16:49:26] maybe next step is to email the ripe atlas team? [16:49:33] (03PS1) 10Valerie: Add copyviobot group management to relevant wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466692 (https://phabricator.wikimedia.org/T206731) [16:49:48] have you checked the failing probes themselves? [16:50:01] maybe we can find out why for some subset of cases and see a pattern? [16:50:29] bblack: that's what I did for the 20ish failing on that comment https://phabricator.wikimedia.org/T205829#4652785 [16:51:29] (03PS1) 10Gehel: rsyslog: replace deprecated validate_numeric() with type contraints [puppet] - 10https://gerrit.wikimedia.org/r/466693 [16:52:13] I have a list of 53 now, from the most recent of https://atlas.ripe.net/measurements/1790947/#!probes [16:53:17] PROBLEM - High lag on wdqs1003 is CRITICAL: 5349 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:53:37] (03PS1) 10Kaldari: Add copyviobot group management to relevant wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466694 (https://phabricator.wikimedia.org/T206731) [16:54:17] !log depooling wdqs1003 to let it catch up on lag [16:54:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:39] (03Abandoned) 10Kaldari: Add copyviobot group management to relevant wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466692 (https://phabricator.wikimedia.org/T206731) (owner: 10Valerie) [16:57:17] also they are anchors to anchors probes, so they shoud be more reliable than probes [16:57:28] RECOVERY - High lag on wdqs1003 is OK: (C)3600 ge (W)1200 ge 178 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:58:36] (03PS55) 10Alex Monk: Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) [16:59:56] here is the new traceroute mesurement https://atlas.ripe.net/measurements/1790947/#!probes [17:00:16] er, https://atlas.ripe.net/measurements/16459053/#!general [17:00:43] (03PS1) 10Cwhite: prometheus: add collector.ntp.server option and enable on recursor nodes [puppet] - 10https://gerrit.wikimedia.org/r/466696 (https://phabricator.wikimedia.org/T183454) [17:00:45] (03CR) 10Kaldari: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466694 (https://phabricator.wikimedia.org/T206731) (owner: 10Kaldari) [17:04:28] taking a random example of the "unreachable" there [17:04:31] https://atlas.ripe.net/probes/6028/#!tab-builtins [17:04:44] ^ this one seems healthy on ipv4, but can't reach most (all?) of the DNS root servers over ipv6 [17:04:48] doesn't seem like our problem [17:05:08] PROBLEM - High lag on wdqs1003 is CRITICAL: 4427 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [17:05:16] I've only checked one or two of them though [17:05:55] but this is another one, where they mostly succeed on ipv6: https://atlas.ripe.net/probes/6085/#!tab-builtins [17:05:58] (but fail with us) [17:07:04] this one is unreachable to us but succeeds with all the built-ins: https://atlas.ripe.net/probes/6116/#!tab-builtins [17:07:33] bblack: the 6085, I can reach it from bast1002, so it will probably recover on the next run [17:08:09] 10Operations, 10Analytics, 10Analytics-Cluster, 10User-Elukey: Manage Hue via systemd unit - https://phabricator.wikimedia.org/T206484 (10fdans) p:05Triage>03Normal [17:08:18] PROBLEM - Blazegraph Port on wdqs1009 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused [17:08:24] !log automated binlog purging started on pc2004, pc2005, pc2006 [17:08:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:28] 6116 works too from us, but goes through some congested HE network [17:09:17] * gehel is looking at blazegraph on wdqs1009 (test server, not critical), cc onimisionipe [17:09:27] RECOVERY - Blazegraph Port on wdqs1009 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 [17:09:53] (03PS3) 10Cwhite: nutcracker: ensure absent nutcracker.py [puppet] - 10https://gerrit.wikimedia.org/r/464917 (https://phabricator.wikimedia.org/T183454) [17:10:23] onimisionipe: looks like autodeploy worked just fine, but took slightly longer than expected to restart and was detected by icinga [17:12:21] moritzm: about nutcracker.py - it seems that we are still using it, maybe worth keeping it for a bit? I don't remember if we have a prometheus exporter for nutcracker (probably not) [17:12:32] 10Operations, 10netops: Configure v6 OOB for ulsfo - https://phabricator.wikimedia.org/T206778 (10ayounsi) p:05Triage>03Low [17:12:49] Cc: shdubsh --^ [17:13:18] PROBLEM - WDQS HTTP Port on wdqs1009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.000 second response time [17:13:22] 10Operations, 10ops-eqiad, 10Datacenter-Switchover-2018, 10Patch-For-Review: rack/setup/install mwmaint1002.eqiad.wmnet - https://phabricator.wikimedia.org/T201343 (10akosiaris) @Dzahn anything left here ? [17:14:27] RECOVERY - WDQS HTTP Port on wdqs1009 is OK: HTTP OK: HTTP/1.1 200 OK - 434 bytes in 0.047 second response time [17:14:31] elukey: puppet has one here: modules/prometheus/manifests/nutcracker_exporter.pp is it not in use? [17:14:36] 10Operations, 10netops: Configure v6 OOB for ulsfo - https://phabricator.wikimedia.org/T206778 (10RobH) Your case # 00532252: Wikimedia Foundation, Inc._Existing Customer_San Francisco has been updated with the following: "IPv6 Network Information: Network: 2607:fb58:9000:7::/64 Gateway: 2607:fb58:900... [17:15:37] (03PS56) 10Alex Monk: Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) [17:16:34] yep! https://grafana.wikimedia.org/dashboard/db/nutcracker [17:16:37] nevermind then [17:16:40] just wanted to make sure [17:17:06] thanks for double-checking :) [17:17:19] (03CR) 10Cwhite: [C: 032] nutcracker: ensure absent nutcracker.py [puppet] - 10https://gerrit.wikimedia.org/r/464917 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [17:18:02] (03PS1) 10Gehel: base::service_unit: add type constraints on parameters [puppet] - 10https://gerrit.wikimedia.org/r/466697 [17:20:49] (03PS1) 10Cwhite: nutcracker: remove diamond collector resource [puppet] - 10https://gerrit.wikimedia.org/r/466698 (https://phabricator.wikimedia.org/T183454) [17:21:17] PROBLEM - Blazegraph Port on wdqs1009 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused [17:22:20] (03PS2) 10Cwhite: nutcracker: remove diamond collector resource [puppet] - 10https://gerrit.wikimedia.org/r/466698 (https://phabricator.wikimedia.org/T183454) [17:22:58] ACKNOWLEDGEMENT - Blazegraph Port on wdqs1009 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused Mathew.onipe wdqs-autodeployment causing this - T197187 [17:23:29] RECOVERY - Blazegraph Port on wdqs1009 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 [17:26:18] PROBLEM - WDQS HTTP Port on wdqs1009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time [17:26:20] (03PS1) 10Gehel: wdqs: run autodeploy on the hour, 4 times per day [puppet] - 10https://gerrit.wikimedia.org/r/466700 (https://phabricator.wikimedia.org/T206597) [17:27:27] RECOVERY - WDQS HTTP Port on wdqs1009 is OK: HTTP OK: HTTP/1.1 200 OK - 434 bytes in 0.058 second response time [17:28:24] fix coming up for wdqs1009, again, test server not critical [17:28:40] (03CR) 10Mathew.onipe: [C: 031] wdqs: run autodeploy on the hour, 4 times per day [puppet] - 10https://gerrit.wikimedia.org/r/466700 (https://phabricator.wikimedia.org/T206597) (owner: 10Gehel) [17:29:25] (03CR) 10Gehel: [C: 032] wdqs: run autodeploy on the hour, 4 times per day [puppet] - 10https://gerrit.wikimedia.org/r/466700 (https://phabricator.wikimedia.org/T206597) (owner: 10Gehel) [17:29:31] (03CR) 10Herron: [C: 04-1] "TIL what mjolnir is! For starters we will want some ferm rules to permit connections from the prometheus servers to the mjolnir metrics l" [puppet] - 10https://gerrit.wikimedia.org/r/454644 (owner: 10EBernhardson) [17:31:09] RECOVERY - High lag on wdqs1003 is OK: (C)3600 ge (W)1200 ge 1188 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [17:36:57] !log repooling wdqs1003, catched up on lag [17:36:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:37] (03CR) 10Volans: [C: 031] "LGTM, thanks for this. Just double verify with the compiler on a couple of hosts." [puppet] - 10https://gerrit.wikimedia.org/r/466697 (owner: 10Gehel) [17:38:49] (03CR) 10Volans: [C: 031] "LGTM, thanks for this. Just double verify with the compiler on a couple of hosts." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/466693 (owner: 10Gehel) [17:41:20] (03CR) 10Volans: [C: 031] "LGTM, thanks for this. Just double verify with the compiler on a couple of hosts." [puppet] - 10https://gerrit.wikimedia.org/r/466687 (owner: 10Gehel) [17:43:48] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is CRITICAL: 56.96 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:44:48] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) is CRITICAL: Test Retrieve all events for Jan 15 returned the unexpected status 504 (expecting: 200): /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated feed content for April 29, 2016 [17:44:48] pected value at path = Missing keys: [umostread] [17:45:57] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [17:48:07] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is OK: (C)60 le (W)70 le 75.74 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [18:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Time to snap out of that daydream and deploy Morning SWAT (Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181011T1800). [18:00:04] stephanebisson: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:20] hello [18:00:38] I'll run the SWAT [18:01:22] (03CR) 10Sbisson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466694 (https://phabricator.wikimedia.org/T206731) (owner: 10Kaldari) [18:02:47] (03Merged) 10jenkins-bot: Add copyviobot group management to relevant wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466694 (https://phabricator.wikimedia.org/T206731) (owner: 10Kaldari) [18:08:06] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Repool db1101:3318, db2085:3318, db2083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466710 (https://phabricator.wikimedia.org/T206743) [18:09:11] stephanebisson: I need to deploy ^ for a high priority ticket, can I sneak in and deploy? [18:09:22] !log sbisson@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:466694|Add copyviobot group management to relevant wikis]] (duration: 00m 49s) [18:09:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:37] marostegui: yep, go ahead I'll continue after [18:09:50] stephanebisson: thank you, it should take a minute [18:09:53] (03CR) 10Marostegui: [C: 032] db-eqiad,db-codfw.php: Repool db1101:3318, db2085:3318, db2083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466710 (https://phabricator.wikimedia.org/T206743) (owner: 10Marostegui) [18:12:09] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Repool db1101:3318, db2085:3318, db2083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466710 (https://phabricator.wikimedia.org/T206743) (owner: 10Marostegui) [18:12:59] (03PS2) 10Sbisson: Remove config for RCFilters variables being removed from Core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463319 (https://phabricator.wikimedia.org/T196033) [18:13:12] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1101:3318 (duration: 00m 49s) [18:13:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:41] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10cloud-services-team: Provide a way to have test servers on real hardware, isolated from production - https://phabricator.wikimedia.org/T206636 (10Andrew) Hello @Gehel! We're unlikely to support bare metal on Labs in the near future, largely because our... [18:14:13] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2083 and db2085:3318 (duration: 00m 48s) [18:14:13] (03CR) 10Catrope: [C: 031] Remove config for RCFilters variables being removed from Core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463319 (https://phabricator.wikimedia.org/T196033) (owner: 10Sbisson) [18:14:14] stephanebisson: I am all done, thanks a lot! [18:14:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:49] marostegui: no worries, resuming SWAT [18:15:27] (03CR) 10Sbisson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463319 (https://phabricator.wikimedia.org/T196033) (owner: 10Sbisson) [18:15:42] 10Operations, 10DBA, 10MediaWiki-Database, 10Datacenter-Switchover-2018: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10akosiaris) >>! In T206740#4658660, @jcrespo wrote: > Forgetting the codfw -> eqiad replication was the most likely cause of overload on the applicatio... [18:16:14] (03CR) 10jenkins-bot: Add copyviobot group management to relevant wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466694 (https://phabricator.wikimedia.org/T206731) (owner: 10Kaldari) [18:16:16] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Repool db1101:3318, db2085:3318, db2083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466710 (https://phabricator.wikimedia.org/T206743) (owner: 10Marostegui) [18:17:07] (03CR) 10Herron: [C: 031] "The check/retry intervals and prometheus query time frame seem a big long (in particular a bit of long time to wait for recovery) but from" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/465450 (https://phabricator.wikimedia.org/T206114) (owner: 10Mathew.onipe) [18:17:14] 10Operations, 10DBA, 10MediaWiki-Database, 10Datacenter-Switchover-2018: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Marostegui) >>! In T206740#4659202, @akosiaris wrote: >>>! In T206740#4658660, @jcrespo wrote: >> Forgetting the codfw -> eqiad replication was the mo... [18:18:16] (03Merged) 10jenkins-bot: Remove config for RCFilters variables being removed from Core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463319 (https://phabricator.wikimedia.org/T196033) (owner: 10Sbisson) [18:18:37] 10Operations, 10ops-eqiad, 10Datacenter-Switchover-2018, 10Patch-For-Review: rack/setup/install mwmaint1002.eqiad.wmnet - https://phabricator.wikimedia.org/T201343 (10Dzahn) Yes. mariadb: remove mwmaint1001 from prod-m5 SQL grants - https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/465645/ network::c... [18:21:19] 10Operations, 10ops-eqiad, 10Datacenter-Switchover-2018, 10Patch-For-Review: rack/setup/install mwmaint1002.eqiad.wmnet - https://phabricator.wikimedia.org/T201343 (10Marostegui) >>! In T201343#4659214, @Dzahn wrote: > Yes. > > mariadb: remove mwmaint1001 from prod-m5 SQL grants - https://gerrit.wikimedia... [18:22:08] !log sbisson@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:463319|Remove config for RCFilters variables being removed from Core]] (duration: 00m 49s) [18:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:39] (03CR) 10jenkins-bot: Remove config for RCFilters variables being removed from Core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463319 (https://phabricator.wikimedia.org/T196033) (owner: 10Sbisson) [18:32:24] 10Operations, 10ops-eqiad, 10Datacenter-Switchover-2018, 10Patch-For-Review: rack/setup/install mwmaint1002.eqiad.wmnet - https://phabricator.wikimedia.org/T201343 (10Dzahn) There is no urgency at all and it wasn't expected. I only listed what is left to be done. Please dont worry about this at all, especi... [18:34:49] 10Operations, 10ops-eqiad, 10Datacenter-Switchover-2018, 10Patch-For-Review: rack/setup/install mwmaint1002.eqiad.wmnet - https://phabricator.wikimedia.org/T201343 (10Dzahn) p:05High>03Normal lowering priority because mwmaint1002 is in production and the remaining steps are all just cleanup [18:34:57] !log sbisson@deploy1001 Synchronized php-1.32.0-wmf.24/extensions/PageTriage/modules/ext.pageTriage.views.list/ext.pageTriage.listControlNav.js: SWAT: [[gerrit:465677|Default to deleted and others when no type is selected on mode switch]] (duration: 00m 50s) [18:34:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:06] (03PS3) 10EBernhardson: Collect prometheus metrics from mjolnir [puppet] - 10https://gerrit.wikimedia.org/r/454644 [18:39:17] (03CR) 10EBernhardson: "ferm rules added, and port changed to 9170/9171" [puppet] - 10https://gerrit.wikimedia.org/r/454644 (owner: 10EBernhardson) [18:39:18] PROBLEM - High lag on wdqs1003 is CRITICAL: 3603 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [18:39:57] 10Operations, 10ops-eqiad, 10Datacenter-Switchover-2018, 10Patch-For-Review: rack/setup/install mwmaint1002.eqiad.wmnet - https://phabricator.wikimedia.org/T201343 (10Dzahn) 05Open>03Resolved Ok, let's keep the ticket within the original focus.. setting up mwmaint1002. That is done. Normally there wo... [18:40:16] 10Operations, 10ops-eqiad, 10Datacenter-Switchover-2018: rack/setup/install mwmaint1002.eqiad.wmnet - https://phabricator.wikimedia.org/T201343 (10Dzahn) [18:46:16] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install cloudvirt102[34] - https://phabricator.wikimedia.org/T199125 (10RobH) a:05MoritzMuehlenhoff>03RobH [18:46:22] andrewbogott: Heyas, you about? [18:46:22] !log sbisson@deploy1001 Synchronized php-1.32.0-wmf.24/extensions/PageTriage/: SWAT: [[gerrit:465676|Handle page that are unnominated for deletion]] (duration: 00m 50s) [18:46:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:35] 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-General-or-Unknown, and 2 others: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10Imarlier) 05Open>03Resolved [18:47:36] and that concludes the SWAT [19:01:46] (03PS3) 10Dzahn: Revert "mw_maintenance: temp hack to avoid duplicate crons on switch to eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/465645 [19:01:51] (03CR) 10Dzahn: [C: 031] "Compilation results for mwmaint2001.codfw.wmnet: no change" [puppet] - 10https://gerrit.wikimedia.org/r/465645 (owner: 10Dzahn) [19:03:55] (03CR) 10Dzahn: [C: 032] Revert "mw_maintenance: temp hack to avoid duplicate crons on switch to eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/465645 (owner: 10Dzahn) [19:10:01] (03PS1) 10Ladsgroup: ores: Add logstash config [puppet] - 10https://gerrit.wikimedia.org/r/466716 (https://phabricator.wikimedia.org/T181546) [19:10:48] (03CR) 10jerkins-bot: [V: 04-1] ores: Add logstash config [puppet] - 10https://gerrit.wikimedia.org/r/466716 (https://phabricator.wikimedia.org/T181546) (owner: 10Ladsgroup) [19:14:47] (03PS2) 10Ladsgroup: ores: Add logstash config [puppet] - 10https://gerrit.wikimedia.org/r/466716 (https://phabricator.wikimedia.org/T181546) [19:15:40] (03CR) 10jerkins-bot: [V: 04-1] ores: Add logstash config [puppet] - 10https://gerrit.wikimedia.org/r/466716 (https://phabricator.wikimedia.org/T181546) (owner: 10Ladsgroup) [19:19:56] (03PS3) 10Ladsgroup: ores: Add logstash config [puppet] - 10https://gerrit.wikimedia.org/r/466716 (https://phabricator.wikimedia.org/T181546) [19:20:25] (03CR) 10Dzahn: [C: 032] "noop and double-checked:" [puppet] - 10https://gerrit.wikimedia.org/r/465645 (owner: 10Dzahn) [19:28:24] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10cloud-services-team: Provide a way to have test servers on real hardware, isolated from production - https://phabricator.wikimedia.org/T206636 (10Gehel) The main contention point for WDQS (or investigating alternatives) seems to be IOPS. We tried settin... [19:29:18] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 34 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [19:35:24] (03PS1) 10RobH: cloudvirt1023 to attempt to use jessie [puppet] - 10https://gerrit.wikimedia.org/r/466721 [19:35:59] (03CR) 10RobH: [C: 032] cloudvirt1023 to attempt to use jessie [puppet] - 10https://gerrit.wikimedia.org/r/466721 (owner: 10RobH) [19:36:21] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 44 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [19:38:32] (03PS1) 10Gehel: wdqs: use recent change poller on public cluster instead of kafka [puppet] - 10https://gerrit.wikimedia.org/r/466722 (https://phabricator.wikimedia.org/T206423) [19:48:32] PROBLEM - High lag on wdqs1003 is CRITICAL: 6999 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [19:48:52] (03CR) 10Gehel: "> Patch Set 4: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/465450 (https://phabricator.wikimedia.org/T206114) (owner: 10Mathew.onipe) [19:50:40] 10Operations, 10monitoring, 10Discovery-Search (Current work), 10Patch-For-Review: Create an Icinga check to alert on packet dropped - https://phabricator.wikimedia.org/T206114 (10Gehel) General question on how to deploy this kind of change: This will most probably trip on a number of nodes (I know that a... [19:50:42] PROBLEM - High lag on wdqs1003 is CRITICAL: 7103 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [19:51:31] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 26 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [19:54:48] 10Operations, 10monitoring, 10Discovery-Search (Current work), 10Patch-For-Review: Create an Icinga check to alert on packet dropped - https://phabricator.wikimedia.org/T206114 (10Volans) >>! In T206114#4659596, @Gehel wrote: > How do we ensure that a check like this does not generate too much noise when w... [19:55:11] PROBLEM - High lag on wdqs1003 is CRITICAL: 7341 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [19:55:36] (03PS1) 10Alexandros Kosiaris: uwsgi: Remove the uwsgi-dbg package [puppet] - 10https://gerrit.wikimedia.org/r/466723 [19:55:41] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: Define an SLO for Wikidata Query Service public endpoint and communicate it - https://phabricator.wikimedia.org/T199228 (10Gehel) >>! In T199228#4655321, @Smalyshev wrote: > I think update lag is not the biggest... [19:58:31] PROBLEM - High lag on wdqs1003 is CRITICAL: 7491 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [19:58:51] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 58 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [19:59:08] (03PS3) 10Mforns: Add druid_load jobs to analytics refinery [puppet] - 10https://gerrit.wikimedia.org/r/465692 (https://phabricator.wikimedia.org/T206342) [20:01:30] (03CR) 10Legoktm: [C: 031] "Seems fine, I didn't review all of the bash logic. I'm not sure whether it's worth maintaining all of that logic vs just hardcoding php7.0" [puppet] - 10https://gerrit.wikimedia.org/r/462748 (https://phabricator.wikimedia.org/T205313) (owner: 10Thcipriani) [20:04:43] (03CR) 10Gehel: "puppet compiler looks good: https://puppet-compiler.wmflabs.org/compiler1002/12877/" [puppet] - 10https://gerrit.wikimedia.org/r/466722 (https://phabricator.wikimedia.org/T206423) (owner: 10Gehel) [20:04:51] (03PS2) 10Gehel: wdqs: use recent change poller on public cluster instead of kafka [puppet] - 10https://gerrit.wikimedia.org/r/466722 (https://phabricator.wikimedia.org/T206423) [20:05:17] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): rack/setup/install cloudvirt102[34] - https://phabricator.wikimedia.org/T199125 (10RobH) So, cloudvirt1023 is now installed and has puppet signed and running with jessie. [20:05:35] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): rack/setup/install cloudvirt102[34] - https://phabricator.wikimedia.org/T199125 (10RobH) [20:06:43] PROBLEM - High lag on wdqs1003 is CRITICAL: 7819 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [20:07:39] (03CR) 10Mathew.onipe: [C: 031] wdqs: use recent change poller on public cluster instead of kafka [puppet] - 10https://gerrit.wikimedia.org/r/466722 (https://phabricator.wikimedia.org/T206423) (owner: 10Gehel) [20:07:54] (03CR) 10Gehel: [C: 032] wdqs: use recent change poller on public cluster instead of kafka [puppet] - 10https://gerrit.wikimedia.org/r/466722 (https://phabricator.wikimedia.org/T206423) (owner: 10Gehel) [20:10:14] (03PS5) 10Mathew.onipe: base::monitoring::host: added icinga prometheus check for network drops [puppet] - 10https://gerrit.wikimedia.org/r/465450 (https://phabricator.wikimedia.org/T206114) [20:16:53] PROBLEM - puppet last run on lvs5003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:20:20] 10Operations, 10Traffic, 10Patch-For-Review: Renew unified certificates 2017 - https://phabricator.wikimedia.org/T178173 (10BBlack) 05Open>03Resolved a:03BBlack Yes, these certs are long-deployed :) [20:22:01] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): rack/setup/install cloudvirt102[34] - https://phabricator.wikimedia.org/T199125 (10RobH) a:05RobH>03Cmjohnson Chris, Please install the replacement h730P when it arrives this Friday into cloudvirt1024, then assign this task back t... [20:26:11] (03CR) 10Alexandros Kosiaris: [C: 031] "Will merge it tomorrow EU morning." [puppet] - 10https://gerrit.wikimedia.org/r/466716 (https://phabricator.wikimedia.org/T181546) (owner: 10Ladsgroup) [20:26:23] (03CR) 10Mathew.onipe: "> Patch Set 4: Code-Review+1" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/465450 (https://phabricator.wikimedia.org/T206114) (owner: 10Mathew.onipe) [20:27:09] (03Abandoned) 10Dzahn: Revert "rename wmf6936 from mw1297 to mwmaint1001" [dns] - 10https://gerrit.wikimedia.org/r/465689 (https://phabricator.wikimedia.org/T192457) (owner: 10Dzahn) [20:27:40] (03Restored) 10Dzahn: Revert "rename wmf6936 from mw1297 to mwmaint1001" [dns] - 10https://gerrit.wikimedia.org/r/465689 (https://phabricator.wikimedia.org/T192457) (owner: 10Dzahn) [20:29:01] (03PS3) 10Dzahn: Revert "rename wmf6936 from mw1297 to mwmaint1001" [dns] - 10https://gerrit.wikimedia.org/r/465689 (https://phabricator.wikimedia.org/T192457) [20:29:04] (03CR) 10Mforns: "After chat with Andrew, this seems ready. However, still waiting to deploy this change first: https://gerrit.wikimedia.org/r/#/c/analytics" [puppet] - 10https://gerrit.wikimedia.org/r/465692 (https://phabricator.wikimedia.org/T206342) (owner: 10Mforns) [20:29:04] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 33 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [20:29:56] (03PS4) 10Dzahn: Revert "rename wmf6936 from mw1297 to mwmaint1001" [dns] - 10https://gerrit.wikimedia.org/r/465689 (https://phabricator.wikimedia.org/T192457) [20:35:32] (03PS1) 10Dzahn: mediawiki_maintenance: switch home rsync to 1002->2001 [puppet] - 10https://gerrit.wikimedia.org/r/466731 (https://phabricator.wikimedia.org/T201343) [20:36:10] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 52 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [20:37:02] !log add IPv6 to mr1-ulsfo OOB - T206778 [20:37:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:05] T206778: Configure v6 OOB for ulsfo - https://phabricator.wikimedia.org/T206778 [20:40:32] (03PS2) 10MarcoAurelio: Disable CongressLookup everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462173 (https://phabricator.wikimedia.org/T205049) [20:42:00] RECOVERY - puppet last run on lvs5003 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [20:42:18] 10Operations, 10Traffic: Renew GlobalSign Unified in 2018 - https://phabricator.wikimedia.org/T206804 (10BBlack) p:05Triage>03Normal [20:45:43] (03PS1) 10Dzahn: re-add mw1297 to site.pp and DHCP, formerly mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/466773 (https://phabricator.wikimedia.org/T192185) [20:46:11] (03PS1) 10MarcoAurelio: Enable FileExporter to Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466774 (https://phabricator.wikimedia.org/T205201) [20:47:39] (03CR) 10Muehlenhoff: re-add mw1297 to site.pp and DHCP, formerly mwmaint1001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/466773 (https://phabricator.wikimedia.org/T192185) (owner: 10Dzahn) [20:49:23] (03PS2) 10MarcoAurelio: Enable FileExporter to Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466774 (https://phabricator.wikimedia.org/T205201) [20:51:11] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 26 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [20:51:11] (03CR) 10Dzahn: "and also removing mwmaint1001 from site in the same patch" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/466773 (https://phabricator.wikimedia.org/T192185) (owner: 10Dzahn) [20:52:50] (03PS2) 10Dzahn: re-add mw1297 to site.pp and DHCP, remove mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/466773 (https://phabricator.wikimedia.org/T192185) [20:53:17] (03PS1) 10MarcoAurelio: Enable ShortURL on gom.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466778 (https://phabricator.wikimedia.org/T206741) [20:55:15] (03CR) 10Dzahn: "so... schedule downtime, shut down, change DNS.. wait 1H .., change DHCP and site, run reimage script? sounds right?" [puppet] - 10https://gerrit.wikimedia.org/r/466773 (https://phabricator.wikimedia.org/T192185) (owner: 10Dzahn) [20:55:37] (03PS2) 10MarcoAurelio: Enable ShortURL on gom.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466778 (https://phabricator.wikimedia.org/T206741) [20:56:00] (03PS3) 10MarcoAurelio: Enable ShortURL on gom.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466778 (https://phabricator.wikimedia.org/T206741) [20:56:30] (03PS7) 10Herron: smarthost: create mail smarthost role/profile [puppet] - 10https://gerrit.wikimedia.org/r/463143 (https://phabricator.wikimedia.org/T41785) [20:56:59] (03CR) 10Herron: smarthost: create mail smarthost role/profile (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/463143 (https://phabricator.wikimedia.org/T41785) (owner: 10Herron) [20:57:20] (03CR) 10Muehlenhoff: [C: 031] re-add mw1297 to site.pp and DHCP, remove mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/466773 (https://phabricator.wikimedia.org/T192185) (owner: 10Dzahn) [20:58:05] (03CR) 10jerkins-bot: [V: 04-1] smarthost: create mail smarthost role/profile [puppet] - 10https://gerrit.wikimedia.org/r/463143 (https://phabricator.wikimedia.org/T41785) (owner: 10Herron) [20:58:31] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 57 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [20:59:47] jouncebot: refresh [20:59:48] I refreshed my knowledge about deployments. [20:59:51] jouncebot: next [20:59:51] In 2 hour(s) and 0 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181011T2300) [21:11:52] (03PS1) 10Ayounsi: Add v6 OOB IP for mr1-ulsfo [dns] - 10https://gerrit.wikimedia.org/r/466783 (https://phabricator.wikimedia.org/T206778) [21:12:53] hey, I have a quick question and honestly I have no idea where to ask. I think that sometime ago I heard that the web crawlers (like google search) are logged in when the crawl wikipedias. Is it true or did I misheard something? [21:17:16] orly [21:18:23] why would they need to be logged-in? it wouldn't really matter either way? [21:19:27] in general, crawlers aren't logged-in, no. at least not the big ones we observe [21:20:43] pmiazga: Some of them don't even have useful user agents [21:21:10] (03PS1) 10Ayounsi: Icinga, add mr1-ulsfo IPv6 OOB [puppet] - 10https://gerrit.wikimedia.org/r/466787 (https://phabricator.wikimedia.org/T206778) [21:21:32] I think it's the first time I hear that [21:21:40] they are probably *identified* [21:21:44] ok, so I had to misheard something [21:21:56] as in using a proper User-Agent and from their company ip addresses [21:22:04] 10Operations, 10Wikimedia-Mailing-lists, 10User-Urbanecm: Non-working archive for wikimediacz-l list - https://phabricator.wikimedia.org/T205380 (10Urbanecm) Started to work magically... [21:22:08] so we could know they are them [21:22:20] maybe that was the source of the confusion [21:22:22] np, nah, we're working on a feature that is related to the search engines (we're adding the json-ld schema definition) [21:22:50] and I heard that we can identify the search engines, that was most probably it. I don't know why I thought they are logged in. thanks for quick answer guys! [21:23:51] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 30 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [21:27:59] (03CR) 10Dzahn: [C: 032] re-add mw1297 to site.pp and DHCP, remove mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/466773 (https://phabricator.wikimedia.org/T192185) (owner: 10Dzahn) [21:30:49] (03CR) 10Dzahn: [C: 032] mediawiki_maintenance: switch home rsync to 1002->2001 [puppet] - 10https://gerrit.wikimedia.org/r/466731 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn) [21:31:10] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 47 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [21:41:29] !log mwmaint2001 - deleting 60G of unneeded files from home [21:41:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:48] wow 60G [21:44:05] Hauskatze: server side uploads of large videos to commons i think , heh [21:44:18] when they are too large for normal upload [21:44:32] yep, I know the process :) [21:44:42] still limited to 5G even server-side, right? [21:45:01] [[Uploading large files]] was the wikitech doc iirc [21:45:25] i am not sure what the current limit is.. *nod* [21:45:32] "MediaWiki currently doesn't support files greater than 4 GB (as size is stored as a 32 bits unsigned integer) while our swift backend storage is limited to 5 Gb. See phab:T191804 and phab:T191802 for discussion to extend this limit respectively to 5 GB and beyond." [21:45:32] T191804: Allow to store files between 4 and 5 Gb - https://phabricator.wikimedia.org/T191804 [21:45:33] T191802: [Epic] Determine a strategy to store files between 5 and 100 Gb - https://phabricator.wikimedia.org/T191802 [21:45:52] whether that is true or not, I cannot tell [21:46:01] cfr. https://wikitech.wikimedia.org/wiki/Uploading_large_files [21:47:28] !log mwmaint2001 - rsyncing home dirs from mwmaint1002 to /root/home-mwmaint1002 (which includes home-terbium even!) in case anyone is missing anything from one of mwaint* [21:47:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:02] these were Astronomy things in HD, B.rion used to deal with the huge files [21:48:21] i had confirmed they were not needed anymore [21:48:32] slashes the size in half or so [21:49:59] (03PS4) 10MarcoAurelio: Close chairwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443585 (https://phabricator.wikimedia.org/T184961) [21:53:57] !log mwmaint1001 - schduled downtime, is being renamed back to mw1297 and reinstalled [21:53:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:19] Hauskatze: one of them is "Thermonuclear Art in Ultra-HD", The Sun , heh https://commons.wikimedia.org/wiki/File:NASA_Thermonuclear_Art_%E2%80%93_The_Sun_In_Ultra-HD_(4K)_(1080p).webm [21:58:23] mutante: so that's what'd happen if we nuked a country... right :P [21:59:04] https://svs.gsfc.nasa.gov/12034 [22:01:00] mutante: some day we will be able to store those locally w/o the need to reconvert or resize them :) [22:02:21] Hauskatze: on IPFS? [22:02:43] would be fitting especially for Astronomy [22:04:11] mutante: I mean, on commons [22:04:29] jouncebot: next [22:04:30] In 0 hour(s) and 55 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181011T2300) [22:04:44] sigh [22:04:47] still an hour [22:04:53] Hauskatze: ah, yes :) [22:04:54] * Hauskatze feels asleep [22:06:48] rsyncs from 2 mw servers and needs de-duplication [22:17:31] gzipping some large files and finding more large files worth asking about [22:20:12] mutante: we're back at eqiad right? [22:21:07] Hauskatze: yes [22:21:38] mutante: okay, it's because a patch I have for swat needs a maintenance script run to add some tables [22:21:55] Hauskatze: the right server will be mmaint1002 [22:22:01] mwmaint1002 [22:22:25] that being said i dont know if "add some tables" is ok [22:22:49] it's a swattable change, I've done it before [22:22:54] for ShortUrl [22:23:41] is it a schema change? there is a special process for those [22:25:23] mutante: no, no schema change [22:25:33] ok, good [22:25:45] https://phabricator.wikimedia.org/diffusion/EWMA/browse/master/createExtensionTables.php$117 [22:29:46] (03CR) 10MarcoAurelio: "Requires:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466778 (https://phabricator.wikimedia.org/T206741) (owner: 10MarcoAurelio) [22:30:02] !log mwmaint1001 - shutting down after final backup of /home, renaming back to mw1297 in DNS and DHCP, and reinstalling (T192457) [22:30:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:08] T192457: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 [22:41:45] * Krinkle staging on mwdebug1001 [22:42:36] (03PS5) 10Dzahn: Revert "rename wmf6936 from mw1297 to mwmaint1001" [dns] - 10https://gerrit.wikimedia.org/r/465689 (https://phabricator.wikimedia.org/T192457) [22:45:14] !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.24/includes/Revision/RenderedRevision.php: I553dba13486 (duration: 00m 51s) [22:45:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:03] (03CR) 10Dzahn: [C: 032] Revert "rename wmf6936 from mw1297 to mwmaint1001" [dns] - 10https://gerrit.wikimedia.org/r/465689 (https://phabricator.wikimedia.org/T192457) (owner: 10Dzahn) [22:48:40] (03CR) 10Dzahn: [C: 032] "gone from Icinga and shut down" [dns] - 10https://gerrit.wikimedia.org/r/465689 (https://phabricator.wikimedia.org/T192457) (owner: 10Dzahn) [22:50:35] !log netbox - renamed mwmaint1001 to mw1279, changed status to inventory, renamed in DNS - T192457 [22:50:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:41] T192457: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 [22:53:17] !log netbox - correction, mwmaint1001 to status "Staged", following new lifecycle docs T192457 [22:53:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:27] (03CR) 10MarcoAurelio: "Also:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466778 (https://phabricator.wikimedia.org/T206741) (owner: 10MarcoAurelio) [22:56:18] !log dzahn@neodymium conftool action : set/pooled=inactive; selector: name=mwmaint1001.eqiad.wmnet [22:56:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear deployers, time to do the Evening SWAT (Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181011T2300). [23:00:05] Hauskatze: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:11] Yes I'm here. [23:01:08] Sorry, I'm rolling out a backport still. 1/2 is out already, the other is stuck in Jenkins for 27 minutes. [23:01:33] :| [23:01:35] 33 min* [23:01:38] jenkins... [23:01:45] lol [23:01:57] I want to go to bed :( [23:02:59] Krinkle: how much do you estimate it'll take to finish? [23:03:18] I don't know. I thought 20min was long. [23:03:39] Is yours mw or config? [23:04:06] k, done, rolling out now [23:04:31] Krinkle: mediawiki-config [23:04:36] Scheduling a SWAT is not suitable before when you go to bed or another important schedule because sometimes SWAT is cancelled or troubled. [23:05:16] rxy: I can wait, but in any case this is the only time in the day I can attend a window [23:05:24] so it's this or nothing [23:05:40] !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.24/includes/jobqueue/jobs/ThumbnailRenderJob.php: T203135 - Ib4640eb13ca93f (duration: 00m 49s) [23:05:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:43] T203135: ThumbnailRender job fails with 429 errors - https://phabricator.wikimedia.org/T203135 [23:06:21] * Krinkle is done [23:06:30] Whoever takes swat today, go ahead :) [23:06:47] --- and then the silence was made --- [23:07:03] :) [23:07:42] most recently active were RoanKattouw and dereckson. If no ping in 10min, I can do it as well. I'm just taking a break for a few minutes first. [23:08:47] Sure, take care [23:09:00] take care :) [23:10:13] 10Operations, 10Patch-For-Review: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on neodymium.eqiad.wmnet for hosts: ``` ['mw1297.eqiad.wmnet'] ``` The log can be found in `/var/log/wmf-auto-reimage/201810112309_d... [23:14:07] 10Operations, 10monitoring, 10Discovery-Search (Current work), 10Patch-For-Review: Create an Icinga check to alert on packet dropped - https://phabricator.wikimedia.org/T206114 (10faidon) How do these packet losses manifest? Are we talking about packets being lost in flight, error counters in interfaces, o... [23:17:51] 10Operations, 10DBA, 10JADE, 10MW-1.32-notes (WMF-deploy-2018-09-25 (1.32.0-wmf.23)), and 3 others: Write our anticipated "phase two" schemas and submit for review - https://phabricator.wikimedia.org/T202596 (10awight) @Marostegui These are the proposed indexes, if you want to discuss something concrete: h... [23:17:58] (03CR) 10Reedy: [C: 032] Disable CongressLookup everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462173 (https://phabricator.wikimedia.org/T205049) (owner: 10MarcoAurelio) [23:18:35] 10Operations, 10Patch-For-Review: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on neodymium.eqiad.wmnet for hosts: ``` ['mw1297.eqiad.wmnet'] ``` The log can be found in `/var/log/wmf-auto-reimage/201810112318_d... [23:20:24] (03Merged) 10jenkins-bot: Disable CongressLookup everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462173 (https://phabricator.wikimedia.org/T205049) (owner: 10MarcoAurelio) [23:21:14] (03CR) 10Faidon Liambotis: [C: 04-1] smarthost: create mail smarthost role/profile (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/463143 (https://phabricator.wikimedia.org/T41785) (owner: 10Herron) [23:21:25] (03PS3) 10Reedy: Enable FileExporter to Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466774 (https://phabricator.wikimedia.org/T205201) (owner: 10MarcoAurelio) [23:21:54] !log reedy@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Disable CongressLookup (duration: 00m 49s) [23:21:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:04] (03CR) 10Reedy: [C: 032] Enable FileExporter to Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466774 (https://phabricator.wikimedia.org/T205201) (owner: 10MarcoAurelio) [23:24:58] (03Merged) 10jenkins-bot: Enable FileExporter to Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466774 (https://phabricator.wikimedia.org/T205201) (owner: 10MarcoAurelio) [23:25:23] Reedy: thx [23:26:56] !log reedy@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable FileExporter to Meta-Wiki (duration: 00m 49s) [23:26:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:05] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 33 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [23:27:52] (03PS4) 10Reedy: Enable ShortURL on gom.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466778 (https://phabricator.wikimedia.org/T206741) (owner: 10MarcoAurelio) [23:27:54] (03CR) 10Faidon Liambotis: [C: 04-1] smarthost: create mail smarthost role/profile (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/463143 (https://phabricator.wikimedia.org/T41785) (owner: 10Herron) [23:28:58] (03CR) 10Reedy: [C: 032] Enable ShortURL on gom.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466778 (https://phabricator.wikimedia.org/T206741) (owner: 10MarcoAurelio) [23:30:17] !log created shorturl table on gomwiki T206741 [23:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:20] T206741: Enable extension ShortURL for the Konkani Wikipedia - https://phabricator.wikimedia.org/T206741 [23:31:12] (03Merged) 10jenkins-bot: Enable ShortURL on gom.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466778 (https://phabricator.wikimedia.org/T206741) (owner: 10MarcoAurelio) [23:32:23] !log reedy@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable shorturl on gomwiki (duration: 00m 48s) [23:32:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:33:20] !log ran mwscript extensions/ShortUrl/populateShortUrlTable.php --wiki=gomwiki T206741 [23:33:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:33:33] all set Reedy ? [23:33:36] (03CR) 10jenkins-bot: Disable CongressLookup everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462173 (https://phabricator.wikimedia.org/T205049) (owner: 10MarcoAurelio) [23:33:38] (03CR) 10jenkins-bot: Enable FileExporter to Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466774 (https://phabricator.wikimedia.org/T205201) (owner: 10MarcoAurelio) [23:33:39] Yeah, all done [23:33:40] (03CR) 10jenkins-bot: Enable ShortURL on gom.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466778 (https://phabricator.wikimedia.org/T206741) (owner: 10MarcoAurelio) [23:33:47] Reedy: thanks a lot :) [23:34:15] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 45 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [23:51:39] 10Operations, 10Patch-For-Review: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 (10Dzahn) mw1297: done, renamed in DNS/DHCP, reinstalled, in Icinga again, renamed in netbox, changed netbox status to "Staged" per new lifecycle docs https://icinga.wikimedia.org/cgi-bin/icinga/statu... [23:52:25] 10Operations, 10Patch-For-Review: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 (10Dzahn) [23:53:15] 10Operations, 10Patch-For-Review: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['mw1297.eqiad.wmnet'] ``` and were **ALL** successful. [23:54:04] 10Operations, 10Patch-For-Review: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 (10Dzahn) [23:54:18] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 34 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts