[00:07:27] (03PS1) 10Dzahn: mediawiki_maintenance: add inactive warning in motd [puppet] - 10https://gerrit.wikimedia.org/r/461011 (https://phabricator.wikimedia.org/T204604) [00:13:25] (03PS2) 10Dzahn: mediawiki_maintenance: add inactive warning in motd [puppet] - 10https://gerrit.wikimedia.org/r/461011 (https://phabricator.wikimedia.org/T204604) [00:14:37] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [00:15:37] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [00:17:34] (03PS1) 10Dzahn: mediawiki_maintenance: remove mw_primary remnant [puppet] - 10https://gerrit.wikimedia.org/r/461012 [00:19:02] (03CR) 10Dzahn: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1002/12486/mwmaint1001.eqiad.wmnet/change.mwmaint1001.eqiad.wmnet.pson" [puppet] - 10https://gerrit.wikimedia.org/r/461011 (https://phabricator.wikimedia.org/T204604) (owner: 10Dzahn) [00:19:09] (03PS3) 10Dzahn: mediawiki_maintenance: add inactive warning in motd [puppet] - 10https://gerrit.wikimedia.org/r/461011 (https://phabricator.wikimedia.org/T204604) [00:22:12] (03PS2) 10Dzahn: mediawiki_maintenance: remove mw_primary remnant [puppet] - 10https://gerrit.wikimedia.org/r/461012 [00:23:08] (03PS3) 10Dzahn: mediawiki_maintenance: remove mw_primary remnant [puppet] - 10https://gerrit.wikimedia.org/r/461012 (https://phabricator.wikimedia.org/T199124) [00:24:30] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/461012/3/modules/profile/manifests/mediawiki/maintenance.pp" [puppet] - 10https://gerrit.wikimedia.org/r/457491 (owner: 10Giuseppe Lavagetto) [00:28:32] (03CR) 10Dzahn: [C: 032] "to my surprise, as opposed to the compiler output linked above, in prod this did it exactly the wrong way around. inactive warning on 2001" [puppet] - 10https://gerrit.wikimedia.org/r/461011 (https://phabricator.wikimedia.org/T204604) (owner: 10Dzahn) [00:29:47] RECOVERY - Filesystem available is greater than filesystem size on ms-be2042 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2042&var-datasource=codfw%2520prometheus%252Fops [00:33:48] (03CR) 10Dzahn: [C: 032] "the name of the active maint server in Hiera was removed in https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/457492/ so cant be used" [puppet] - 10https://gerrit.wikimedia.org/r/461011 (https://phabricator.wikimedia.org/T204604) (owner: 10Dzahn) [00:34:37] (03CR) 10Dzahn: profile::mediawiki::maintenance: depend on mediawiki config, not hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/457492 (owner: 10Giuseppe Lavagetto) [00:35:20] (03PS4) 10Dzahn: mediawiki_maintenance: remove mw_primary remnant [puppet] - 10https://gerrit.wikimedia.org/r/461012 (https://phabricator.wikimedia.org/T199124) [00:35:27] (03CR) 10Dzahn: [C: 032] mediawiki_maintenance: remove mw_primary remnant [puppet] - 10https://gerrit.wikimedia.org/r/461012 (https://phabricator.wikimedia.org/T199124) (owner: 10Dzahn) [00:41:37] (03PS1) 10Dzahn: mediawiki_maintenance: reverse absent/present for inactive motd [puppet] - 10https://gerrit.wikimedia.org/r/461013 (https://phabricator.wikimedia.org/T199124) [00:42:19] (03CR) 10Dzahn: [C: 032] "brain fart, of course it needs to be the other way around for an _in_active warning: https://gerrit.wikimedia.org/r/#/c/operations/puppet/" [puppet] - 10https://gerrit.wikimedia.org/r/461011 (https://phabricator.wikimedia.org/T204604) (owner: 10Dzahn) [00:43:43] (03PS2) 10Dzahn: mediawiki_maintenance: reverse absent/present for inactive motd [puppet] - 10https://gerrit.wikimedia.org/r/461013 (https://phabricator.wikimedia.org/T199124) [00:44:06] (03CR) 10Dzahn: [C: 032] mediawiki_maintenance: reverse absent/present for inactive motd [puppet] - 10https://gerrit.wikimedia.org/r/461013 (https://phabricator.wikimedia.org/T199124) (owner: 10Dzahn) [00:46:58] 10Operations, 10Datacenter-Switchover-2018, 10Patch-For-Review: Add "do not use this server" login message to non active mwmaint* server - https://phabricator.wikimedia.org/T204604 (10Dzahn) follow-up accidentally linked to wrong ticket: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/461013/ [00:50:09] 10Operations, 10Datacenter-Switchover-2018, 10Patch-For-Review: Add "do not use this server" login message to non active mwmaint* server - https://phabricator.wikimedia.org/T204604 (10Dzahn) mwmaint1001 now has a warning, mwmaint2001 does not, as it should. ``` Linux mwmaint1001 4.9.0-8-amd64 #1 SMP Debian... [02:35:21] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.20) (duration: 13m 58s) [02:35:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:46:09] !log l10nupdate@deploy1001 ResourceLoader cache refresh completed at Tue Sep 18 02:46:09 UTC 2018 (duration 10m 49s) [02:46:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:32:02] (03PS1) 10Andrew Bogott: wmcs pdns-recursor: comma-delimit reverse lookup zones [puppet] - 10https://gerrit.wikimedia.org/r/461024 (https://phabricator.wikimedia.org/T202886) [03:32:34] (03PS2) 10Andrew Bogott: wmcs pdns-recursor: comma-delimit reverse lookup zones [puppet] - 10https://gerrit.wikimedia.org/r/461024 (https://phabricator.wikimedia.org/T202886) [03:34:13] (03CR) 10Andrew Bogott: [C: 032] wmcs pdns-recursor: comma-delimit reverse lookup zones [puppet] - 10https://gerrit.wikimedia.org/r/461024 (https://phabricator.wikimedia.org/T202886) (owner: 10Andrew Bogott) [04:00:08] (03CR) 10Subramanya Sastry: [C: 031] "I don't understand the nuances of profile vs role and how this makes a difference, but, if it works and ruthenium and puppet are happy, wo" [puppet] - 10https://gerrit.wikimedia.org/r/460605 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [04:12:16] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:12:57] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:28:56] PROBLEM - Disk space on elastic1028 is CRITICAL: DISK CRITICAL - free space: /srv 46821 MB (9% inode=99%) [04:46:17] RECOVERY - Disk space on elastic1028 is OK: DISK OK [04:58:32] 10Operations, 10DBA, 10Wikidata, 10Wikidata-Query-Service, and 4 others: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 (10Marostegui) >>! In T202764#4590949, @Smalyshev wrote: > I am a bit confused by now - is the original problem because recen... [04:59:48] (03PS1) 10Marostegui: db-codfw.php: Depool db2080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461025 (https://phabricator.wikimedia.org/T202764) [05:03:01] (03PS2) 10Marostegui: db-codfw.php: Depool db2080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461025 (https://phabricator.wikimedia.org/T202764) [05:05:04] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461025 (https://phabricator.wikimedia.org/T202764) (owner: 10Marostegui) [05:06:56] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461025 (https://phabricator.wikimedia.org/T202764) (owner: 10Marostegui) [05:08:25] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2080 - T202764 (duration: 00m 51s) [05:08:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:08:34] T202764: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 [05:08:56] !log Drop tmp_2 and tmp_3 index from wikidatawiki.recentchanges on db2080 - T202764 [05:09:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:09:53] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2080" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461026 [05:11:47] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2080" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461026 (owner: 10Marostegui) [05:13:31] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2080" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461026 (owner: 10Marostegui) [05:14:50] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2080 - T202764 (duration: 00m 49s) [05:14:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:14:57] T202764: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 [05:15:12] (03PS1) 10Marostegui: db-codfw.php: Depool db2081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461027 (https://phabricator.wikimedia.org/T202764) [05:20:00] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461027 (https://phabricator.wikimedia.org/T202764) (owner: 10Marostegui) [05:20:02] (03CR) 10jenkins-bot: db-codfw.php: Depool db2080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461025 (https://phabricator.wikimedia.org/T202764) (owner: 10Marostegui) [05:20:04] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2080" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461026 (owner: 10Marostegui) [05:21:41] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461027 (https://phabricator.wikimedia.org/T202764) (owner: 10Marostegui) [05:22:45] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2081 - T202764 (duration: 00m 49s) [05:22:49] !log Drop tmp_2 and tmp_3 index from wikidatawiki.recentchanges on db2081 - T202764 [05:22:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:22:53] T202764: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 [05:22:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:20] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2081" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461028 [05:26:18] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2081" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461028 (owner: 10Marostegui) [05:26:24] 10Operations, 10DBA, 10Wikidata, 10Wikidata-Query-Service, and 4 others: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 (10Marostegui) @Smalyshev - the indexes have been removed from the API hosts. The queries on those two servers (db2080 and db... [05:27:56] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2081" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461028 (owner: 10Marostegui) [05:28:56] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2081 - T202764 (duration: 00m 49s) [05:29:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:29:04] T202764: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 [05:30:05] (03PS1) 10Marostegui: db-eqiad.php: Depool es1012 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461029 [05:31:48] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool es1012 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461029 (owner: 10Marostegui) [05:33:28] (03Merged) 10jenkins-bot: db-eqiad.php: Depool es1012 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461029 (owner: 10Marostegui) [05:34:29] (03CR) 10jenkins-bot: db-codfw.php: Depool db2081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461027 (https://phabricator.wikimedia.org/T202764) (owner: 10Marostegui) [05:34:31] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2081" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461028 (owner: 10Marostegui) [05:34:33] (03CR) 10jenkins-bot: db-eqiad.php: Depool es1012 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461029 (owner: 10Marostegui) [05:34:55] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool es1012 for kernel and mariadb upgrade (duration: 00m 49s) [05:34:57] !log Stop MySQL on es1012 to upgrade mariadb & kernel [05:35:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:35:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:44:14] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool es1012" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461030 [05:45:44] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool es1012" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461030 (owner: 10Marostegui) [05:47:22] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool es1012" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461030 (owner: 10Marostegui) [05:48:32] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool es1012 after kernel and mariadb upgrade (duration: 00m 50s) [05:48:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:49:02] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool es1012" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461030 (owner: 10Marostegui) [05:50:17] (03PS1) 10Marostegui: db-eqiad.php: Depool es1013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461031 [05:53:19] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool es1013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461031 (owner: 10Marostegui) [05:54:30] (03Merged) 10jenkins-bot: db-eqiad.php: Depool es1013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461031 (owner: 10Marostegui) [05:55:39] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool es1013 for kernel and mariadb upgrade (duration: 00m 49s) [05:55:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:56:07] !log Stop MySQL on es1013 to upgrade mariadb & kernel [05:56:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:47] PROBLEM - Check systemd state on wdqs2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:03:18] (03CR) 10jenkins-bot: db-eqiad.php: Depool es1013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461031 (owner: 10Marostegui) [06:03:53] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool es1013" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461032 [06:05:38] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool es1013" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461032 (owner: 10Marostegui) [06:06:07] RECOVERY - Check systemd state on wdqs2003 is OK: OK - running: The system is fully operational [06:06:58] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool es1013" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461032 (owner: 10Marostegui) [06:09:11] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool es1013 after kernel and mariadb upgrade (duration: 00m 49s) [06:09:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:21] 10Operations, 10DBA, 10Wikidata, 10Wikidata-Query-Service, and 4 others: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 (10Smalyshev) The API requests for recentchanges now seem to be faster, but I still get exceptions in the log :( I also get a... [06:17:11] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool es1013" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461032 (owner: 10Marostegui) [06:17:55] 10Operations, 10DBA, 10Wikidata, 10Wikidata-Query-Service, and 4 others: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 (10Marostegui) >>! In T202764#4592236, @Smalyshev wrote: > The API requests for recentchanges now seem to be faster, but I st... [06:21:10] !log Drop tmp_2 and tmp_3 index from wikidatawiki.recentchanges on dbstore2001, db2079, db2082,db2083 - T202764 [06:21:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:21:18] T202764: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 [06:24:16] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:24:17] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:25:20] 10Operations, 10DBA, 10Wikidata, 10Wikidata-Query-Service, and 4 others: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 (10Marostegui) @Smalyshev eqiad and codfw are not the same. The index only exists on recentchanges replicas and the masters (... [06:30:07] PROBLEM - puppet last run on mw1319 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/php/7.0/cli/php.ini] [06:31:27] 10Operations, 10DBA, 10Patch-For-Review: mysql user and group should be a system user/group - https://phabricator.wikimedia.org/T100501 (10Marostegui) >>! In T100501#3371873, @jcrespo wrote: > Blocked on full stretch migration. So only pending labsdb1004,labsdb1005, dbstore1002 and the parsercache? [06:45:54] 10Operations, 10OCG-General, 10Readers-Community-Engagement, 10Epic, and 3 others: [EPIC] (Proposal) Replicate core OCG features and sunset OCG service - https://phabricator.wikimedia.org/T150871 (10Mholloway) [06:46:06] 10Operations, 10Cleanup, 10Gerrit, 10GitHub-Mirrors, and 7 others: Archive mediawiki/extensions/Collection and others - https://phabricator.wikimedia.org/T183891 (10Mholloway) 05declined>03Open The Reading Infrastructure team discussed this in a weekly meeting yesterday (after consulting with Reading W... [06:46:55] 10Operations, 10Cleanup, 10Gerrit, 10GitHub-Mirrors, and 7 others: Archive mediawiki/extensions/Collection/OfflineContentGenerator and all OCG-related repos - https://phabricator.wikimedia.org/T183891 (10Mholloway) [06:47:06] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:47:07] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:47:57] 10Operations, 10Cleanup, 10Gerrit, 10GitHub-Mirrors, and 7 others: Archive mediawiki/extensions/Collection/OfflineContentGenerator and all OCG-related repos - https://phabricator.wikimedia.org/T183891 (10Mholloway) [06:48:14] 10Operations, 10Cleanup, 10Gerrit, 10GitHub-Mirrors, and 7 others: Archive mediawiki/extensions/Collection/OfflineContentGenerator and all OCG-related repos - https://phabricator.wikimedia.org/T183891 (10Mholloway) [06:55:07] !log installin zsh security updates [06:55:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:47] PROBLEM - MariaDB Slave SQL: s3 on dbstore2002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1146, Errmsg: Error Table test2wiki.echo_event doesnt exist on query. Default database: test2wiki. [Query snipped] [06:57:20] PROBLEM - MariaDB Slave SQL: s3 on db2057 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1146, Errmsg: Error Table test2wiki.echo_event doesnt exist on query. Default database: test2wiki. [Query snipped] [06:57:30] PROBLEM - MariaDB Slave SQL: s3 on db2074 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1146, Errmsg: Error Table test2wiki.echo_event doesnt exist on query. Default database: test2wiki. [Query snipped] [06:58:11] uh, what's up? [06:59:30] marostegui: need help? I need 5 minutes as my laptop crashed on resume from sleep [06:59:32] maybe icinga downtimes expired? [06:59:38] marostegui, banyek ^^^ [07:00:36] RECOVERY - puppet last run on mw1319 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:00:55] I am fixing that yeah [07:00:56] :( [07:01:29] fixing what? [07:01:36] the replication broken [07:02:18] Table 'test2wiki.echo_event' doesn't exist' on query [07:02:19] how many replicas all? [07:02:30] just those [07:02:36] so some s3 slaves did not have that table ? [07:02:50] yeah [07:02:52] RECOVERY - MariaDB Slave SQL: s3 on db2057 is OK: OK slave_sql_state Slave_SQL_Running: Yes [07:02:55] how did that happen ? [07:03:11] Some slaves didn't have that table (and I was doing an alter) [07:03:11] RECOVERY - MariaDB Slave SQL: s3 on db2074 is OK: OK slave_sql_state Slave_SQL_Running: Yes [07:03:35] Another case of data drifts \o/ [07:04:47] RECOVERY - MariaDB Slave SQL: s3 on dbstore2002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [07:05:03] Sorry for the noise [07:05:36] can someone log what was done? [07:05:45] * volans back in business, sorry for the delay :( [07:06:06] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:06:07] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:24:52] (03PS7) 10Jcrespo: mariadb: Enable read_only monitoring on core mariadb hosts [puppet] - 10https://gerrit.wikimedia.org/r/450228 (https://phabricator.wikimedia.org/T172489) [07:25:44] (03PS2) 10Jcrespo: mariadb: Depool db1105 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460919 [07:27:38] 10Operations, 10Cleanup, 10Gerrit, 10GitHub-Mirrors, and 7 others: Archive mediawiki/extensions/Collection/OfflineContentGenerator and all OCG-related repos - https://phabricator.wikimedia.org/T183891 (10hashar) [07:28:02] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1105 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460919 (owner: 10Jcrespo) [07:29:29] 10Operations, 10Cleanup, 10Gerrit, 10GitHub-Mirrors, and 7 others: Archive mediawiki/extensions/Collection/OfflineContentGenerator and all OCG-related repos - https://phabricator.wikimedia.org/T183891 (10hashar) [07:29:51] (03Merged) 10jenkins-bot: mariadb: Depool db1105 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460919 (owner: 10Jcrespo) [07:33:17] +1 jynus, log what was done :) [07:34:34] 10Operations, 10DBA, 10Patch-For-Review: mysql user and group should be a system user/group - https://phabricator.wikimedia.org/T100501 (10jcrespo) > So only pending labsdb1004,labsdb1005, dbstore1002 and the parsercache For those, a non-system user is used, that is why https://gerrit.wikimedia.org/r/454291... [07:35:42] (03PS2) 10Jcrespo: mysql user: Remove exception for mysql user being removed [puppet] - 10https://gerrit.wikimedia.org/r/454291 (https://phabricator.wikimedia.org/T134476) [07:36:39] (03CR) 10Jcrespo: [C: 032] mysql user: Remove exception for mysql user being removed [puppet] - 10https://gerrit.wikimedia.org/r/454291 (https://phabricator.wikimedia.org/T134476) (owner: 10Jcrespo) [07:40:53] (03PS1) 10Jcrespo: mariadb: Remove conditional for system user [puppet] - 10https://gerrit.wikimedia.org/r/461035 (https://phabricator.wikimedia.org/T100501) [07:43:19] (03CR) 10Jcrespo: [C: 04-2] "Bloqued on all systems on stretch or all mariadb host manually moved its group to system (most have already donem but the tail may take a " [puppet] - 10https://gerrit.wikimedia.org/r/461035 (https://phabricator.wikimedia.org/T100501) (owner: 10Jcrespo) [07:45:39] (03CR) 10jenkins-bot: mariadb: Depool db1105 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460919 (owner: 10Jcrespo) [07:45:51] (03PS2) 10Jcrespo: mariadb: Disallow reimage to all hosts except the test db [puppet] - 10https://gerrit.wikimedia.org/r/460890 (https://phabricator.wikimedia.org/T204311) [07:46:25] (03PS3) 10Jcrespo: mariadb: Disallow reimage to all hosts except the test db [puppet] - 10https://gerrit.wikimedia.org/r/460890 (https://phabricator.wikimedia.org/T204311) [07:53:48] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1105 (duration: 00m 50s) [07:53:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:10] (03CR) 10Thiemo Kreuz (WMDE): [C: 031] Fix a typo in zhwikiversity's importsources definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460947 (https://phabricator.wikimedia.org/T201328) (owner: 10Urbanecm) [08:02:13] !log stop and restart db1105 for upgrade [08:02:17] PROBLEM - rsyslog TLS listener on port 6514 on wezen is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer [08:02:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:34] !log bounce rsyslog on wezen/lithium, tls listener was timing out in icinga [08:02:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:04] (03CR) 10Banyek: [V: 031 C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/460890 (https://phabricator.wikimedia.org/T204311) (owner: 10Jcrespo) [08:03:06] RECOVERY - rsyslog TLS listener on port 6514 on wezen is OK: SSL OK - Certificate wezen.codfw.wmnet valid until 2021-08-21 20:09:05 +0000 (expires in 1068 days) [08:04:13] (03CR) 10Jcrespo: [C: 032] mariadb: Disallow reimage to all hosts except the test db [puppet] - 10https://gerrit.wikimedia.org/r/460890 (https://phabricator.wikimedia.org/T204311) (owner: 10Jcrespo) [08:05:17] (03PS9) 10ArielGlenn: dumps: monitor generation nfs server hosts for nfsd cpu usage [puppet] - 10https://gerrit.wikimedia.org/r/363548 (https://phabricator.wikimedia.org/T169680) [08:06:08] (03PS28) 10Mathew.onipe: Elasticsearch module is coming up. [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) [08:16:57] (03PS1) 10Jcrespo: mariadb: Fully depool db1105 (including its s2 load) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461087 [08:17:46] (03PS10) 10ArielGlenn: dumps: monitor generation nfs server hosts for nfsd cpu usage [puppet] - 10https://gerrit.wikimedia.org/r/363548 (https://phabricator.wikimedia.org/T169680) [08:18:30] (03CR) 10jerkins-bot: [V: 04-1] dumps: monitor generation nfs server hosts for nfsd cpu usage [puppet] - 10https://gerrit.wikimedia.org/r/363548 (https://phabricator.wikimedia.org/T169680) (owner: 10ArielGlenn) [08:19:52] (03PS11) 10ArielGlenn: dumps: monitor generation nfs server hosts for nfsd cpu usage [puppet] - 10https://gerrit.wikimedia.org/r/363548 (https://phabricator.wikimedia.org/T169680) [08:20:07] (03CR) 10Jcrespo: [C: 032] mariadb: Fully depool db1105 (including its s2 load) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461087 (owner: 10Jcrespo) [08:21:30] (03Merged) 10jenkins-bot: mariadb: Fully depool db1105 (including its s2 load) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461087 (owner: 10Jcrespo) [08:22:37] (03PS1) 10DCausse: [cirrus] cleanup drop wgCirrusSearchInterwikiCacheTime [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461088 (https://phabricator.wikimedia.org/T191961) [08:31:26] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:31:26] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:31:43] (03CR) 10jenkins-bot: mariadb: Fully depool db1105 (including its s2 load) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461087 (owner: 10Jcrespo) [08:39:37] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [08:41:07] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:41:16] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:41:47] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [08:43:13] so much stuff happening, allfixed by the time I see it [08:43:49] paravoid: I am not sure network is in a great state, alerts flapping? [08:44:39] cr1-eqord and cr2-eqiad issues? [08:45:10] !log stop replication & stop mysql on db1119 (preparing to clone db1114) [08:45:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:17] I know you are not on networking, but may have more visibility than me [08:46:01] jynus: these are alerts for a port going down on eqiad, and another port going on eqord [08:46:25] usually what that means is that a circuit connecting those two ports is down [08:46:45] looking at the maintenance calendar, it looks like we're in a maintenance window for "eqiad-eqord maintenance", so that fits pretty well [08:46:53] how do I differenciate a non-impacting issue from an impacting one [08:46:56] ah, ok [08:47:00] I didn't saw that [08:47:32] it's ok :) [08:47:38] plus last time I assumed that, there were real issues [08:47:40] thanks for alerting me anyway, doesn't hurt! [08:48:06] sorry for the ping, after your advice I prefer that, also knowing you were around [08:48:14] indeed [08:48:19] no need to be sorry [08:48:34] I'm looking at the logs, it fits to the circuit mentioned on the calendar [08:48:39] great [08:48:59] to your question, the network has quite a few redundancies but it depends on what you mean by "impacting" [08:49:25] such an event will in many occasions result into a) a reduced redundancy, b) rerouting over higher-latency circuits [08:49:27] so remember when you tell me it is not immediate clear what is going on when a db* host goes down [08:49:30] :-) [08:49:42] that may give you a perspective on how I see routers [08:50:38] that said, there are a few ways in which this kind of issue could result into service impact [08:50:52] I guess if a second circuit goes down? [08:50:56] or capacity issues [08:51:02] one is if such a circuit flaps a lot, in which case the network convergence may not be able to catch up as quickly, and then there's packet loss [08:51:22] yeah, I pinged because saw it happen for a second time [08:51:31] the second is indeed if we lose multiple circuits, which could result into either much-increased latency (think eqiad->codfw via ulsfo) [08:51:46] capacity not impacted a lot? [08:51:50] or a complete loss of redundancy, which would result into a split brain or a site becoming an island [08:51:55] just redundancy? [08:51:59] 10Operations, 10ops-codfw: ms-be2030 spontaneous reboot - https://phabricator.wikimedia.org/T204567 (10fgiunchedi) @Papaul thanks! indeed looks like the server just reset. I'm skeptical upgrading bios/controller will help but no harm in trying either, let me know when you are online later today and I'll shut t... [08:52:08] we usually have plenty of headroom in capacity, so that's usually not a concern [08:52:13] usually! [08:52:25] cool, those are thing I don't have a lot of knowledge on [08:52:46] yup, understandably so [08:54:30] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully depool db1105 (duration: 00m 49s) [08:54:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:34] (03PS1) 10Muehlenhoff: Reinstate poolcounter2002 as a pool counter [puppet] - 10https://gerrit.wikimedia.org/r/461092 [09:08:36] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:08:36] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:10:35] !log stop mariadb (both instances) and restart db1105 for upgrade [09:10:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:47] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:10:47] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:19:35] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/461092 (owner: 10Muehlenhoff) [09:26:20] (03PS2) 10Ema: cache_misc: cleanup leftovers [puppet] - 10https://gerrit.wikimedia.org/r/460922 (https://phabricator.wikimedia.org/T164609) [09:27:21] (03CR) 10Ema: [C: 032] cache_misc: cleanup leftovers [puppet] - 10https://gerrit.wikimedia.org/r/460922 (https://phabricator.wikimedia.org/T164609) (owner: 10Ema) [09:28:27] 10Operations, 10cloud-services-team: Ferm leftovers on labtestnet2003 - https://phabricator.wikimedia.org/T204667 (10MoritzMuehlenhoff) [09:34:16] 10Operations, 10ORES, 10Scoring-platform-team, 10Traffic: Pass on name of the node serving ORES requests as response header to the user - https://phabricator.wikimedia.org/T204600 (10ema) p:05Triage>03Normal [09:37:10] (03CR) 10Alexandros Kosiaris: [C: 032] "Sigh. My mistake, thanks for spotting it and sorry." [puppet] - 10https://gerrit.wikimedia.org/r/461092 (owner: 10Muehlenhoff) [09:37:18] (03PS2) 10Alexandros Kosiaris: Reinstate poolcounter2002 as a pool counter [puppet] - 10https://gerrit.wikimedia.org/r/461092 (owner: 10Muehlenhoff) [09:37:21] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Reinstate poolcounter2002 as a pool counter [puppet] - 10https://gerrit.wikimedia.org/r/461092 (owner: 10Muehlenhoff) [09:37:38] !log Drop T153638_echo_XXX tables from db2057 - T153638 [09:37:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:46] T153638: Drop echo tables from local wiki databases - https://phabricator.wikimedia.org/T153638 [09:44:02] (03PS1) 10Jcrespo: mariadb: Repool db1105 after kernel and package upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461096 [09:59:39] (03PS1) 10Jcrespo: mariadb: Tuning the load of s3 and s5, still showing some errors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461097 [10:01:02] !log Rename echo_XXX tables on dbstore2002:3313 - T153638 [10:01:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:10] T153638: Drop echo tables from local wiki databases - https://phabricator.wikimedia.org/T153638 [10:02:27] (03PS1) 10Jcrespo: mariadb: Depool db2041 to recover dbstore2002 s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461098 (https://phabricator.wikimedia.org/T204593) [10:02:57] (03CR) 10Jcrespo: [C: 032] mariadb: Repool db1105 after kernel and package upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461096 (owner: 10Jcrespo) [10:04:13] (03Merged) 10jenkins-bot: mariadb: Repool db1105 after kernel and package upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461096 (owner: 10Jcrespo) [10:04:36] marostegui: give a look at my codfw patches above^ [10:04:44] checking [10:04:55] !log ci: updating Quibble Jenkins jobs to 0.0.26 [10:04:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:31] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1105 (duration: 00m 49s) [10:06:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:49] (03CR) 10Marostegui: [C: 031] mariadb: Tuning the load of s3 and s5, still showing some errors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461097 (owner: 10Jcrespo) [10:07:21] (03PS1) 10Ema: ores: add Server response header [puppet] - 10https://gerrit.wikimedia.org/r/461100 (https://phabricator.wikimedia.org/T204600) [10:07:29] (03PS2) 10Jcrespo: mariadb: Tuning the load of s3 and s5, still showing some errors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461097 [10:07:50] (03CR) 10jenkins-bot: mariadb: Repool db1105 after kernel and package upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461096 (owner: 10Jcrespo) [10:07:57] (03CR) 10jerkins-bot: [V: 04-1] ores: add Server response header [puppet] - 10https://gerrit.wikimedia.org/r/461100 (https://phabricator.wikimedia.org/T204600) (owner: 10Ema) [10:09:26] !log Drop T153638_echo_XXX tables from dbstore2002:3313 - T153638 [10:09:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:33] T153638: Drop echo tables from local wiki databases - https://phabricator.wikimedia.org/T153638 [10:09:44] (03CR) 10Jcrespo: [C: 032] mariadb: Tuning the load of s3 and s5, still showing some errors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461097 (owner: 10Jcrespo) [10:11:12] (03Merged) 10jenkins-bot: mariadb: Tuning the load of s3 and s5, still showing some errors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461097 (owner: 10Jcrespo) [10:11:28] (03PS2) 10Ema: ores: add Server response header [puppet] - 10https://gerrit.wikimedia.org/r/461100 (https://phabricator.wikimedia.org/T204600) [10:12:26] (03PS2) 10Jcrespo: mariadb: Depool db2041 to recover dbstore2002 s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461098 (https://phabricator.wikimedia.org/T204593) [10:12:52] !log jynus@deploy1001 Synchronized wmf-config/db-codfw.php: Tune the s3 and s5 database loads (duration: 00m 49s) [10:12:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:00] (03CR) 10Ema: "pcc output looks good to me https://puppet-compiler.wmflabs.org/compiler1002/12489/ores1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/461100 (https://phabricator.wikimedia.org/T204600) (owner: 10Ema) [10:13:09] will wait a bit and then try deploying the s2 depool [10:15:05] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Minor inline comment, otherwise LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/461100 (https://phabricator.wikimedia.org/T204600) (owner: 10Ema) [10:15:30] (03PS1) 10Arturo Borrero Gonzalez: cloudvps: horizon: grantreview project is now in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/461102 [10:16:35] (03CR) 10Arturo Borrero Gonzalez: [C: 032] cloudvps: horizon: grantreview project is now in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/461102 (owner: 10Arturo Borrero Gonzalez) [10:20:45] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db2041 to recover dbstore2002 s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461098 (https://phabricator.wikimedia.org/T204593) (owner: 10Jcrespo) [10:22:06] (03Merged) 10jenkins-bot: mariadb: Depool db2041 to recover dbstore2002 s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461098 (https://phabricator.wikimedia.org/T204593) (owner: 10Jcrespo) [10:22:44] (03CR) 10Alexandros Kosiaris: [C: 031] "Commented retracted." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/461100 (https://phabricator.wikimedia.org/T204600) (owner: 10Ema) [10:22:56] (03CR) 10jenkins-bot: mariadb: Tuning the load of s3 and s5, still showing some errors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461097 (owner: 10Jcrespo) [10:22:57] (03CR) 10jenkins-bot: mariadb: Depool db2041 to recover dbstore2002 s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461098 (https://phabricator.wikimedia.org/T204593) (owner: 10Jcrespo) [10:23:28] (03PS3) 10Ema: ores: add Server response header [puppet] - 10https://gerrit.wikimedia.org/r/461100 (https://phabricator.wikimedia.org/T204600) [10:24:53] !log Rename echo_XXX tables on db1077 and db1078  - T153638 [10:24:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:00] T153638: Drop echo tables from local wiki databases - https://phabricator.wikimedia.org/T153638 [10:25:10] (03CR) 10Ema: [C: 032] ores: add Server response header [puppet] - 10https://gerrit.wikimedia.org/r/461100 (https://phabricator.wikimedia.org/T204600) (owner: 10Ema) [10:25:52] !log jynus@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2041 (duration: 00m 49s) [10:25:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:29] (03PS5) 10Marostegui: Get only active replicas to page for mysqld process number [puppet] - 10https://gerrit.wikimedia.org/r/459764 (https://phabricator.wikimedia.org/T200509) [10:28:27] (03CR) 10Jcrespo: [C: 04-1] Get only active replicas to page for mysqld process number [puppet] - 10https://gerrit.wikimedia.org/r/459764 (https://phabricator.wikimedia.org/T200509) (owner: 10Marostegui) [10:29:20] (03CR) 10Jcrespo: [C: 04-1] "I don't mind the change, but core and core::multiinstance should have the same config." [puppet] - 10https://gerrit.wikimedia.org/r/459764 (https://phabricator.wikimedia.org/T200509) (owner: 10Marostegui) [10:31:50] !log stop db2041 to clone to dbstore2002 [10:31:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:24] !log stopping dbstore2002:s2 mariadb instance [10:36:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:43] (03PS6) 10Marostegui: Get only active replicas to page for mysqld process number [puppet] - 10https://gerrit.wikimedia.org/r/459764 (https://phabricator.wikimedia.org/T200509) [10:41:14] (03PS7) 10Marostegui: WIP: Get only active replicas to page for mysqld process number [puppet] - 10https://gerrit.wikimedia.org/r/459764 (https://phabricator.wikimedia.org/T200509) [10:41:57] (03CR) 10jerkins-bot: [V: 04-1] WIP: Get only active replicas to page for mysqld process number [puppet] - 10https://gerrit.wikimedia.org/r/459764 (https://phabricator.wikimedia.org/T200509) (owner: 10Marostegui) [10:42:32] 10Operations, 10cloud-services-team: Ferm leftovers on labtestnet2003 - https://phabricator.wikimedia.org/T204667 (10aborrero) hey @GTirloni would you like to do this reimage? I can guide you in the process. [10:44:01] (03PS8) 10Marostegui: WIP: Get only active replicas to page for mysqld process number [puppet] - 10https://gerrit.wikimedia.org/r/459764 (https://phabricator.wikimedia.org/T200509) [10:45:29] PROBLEM - Host ms-be2030 is DOWN: PING CRITICAL - Packet loss = 100% [10:45:39] godog: ^^ [10:46:36] he's out for lunch I think [10:46:53] * volans checking console [10:48:45] nothing there [10:49:44] all non-responsive, no ping, I'll reboot it [10:50:16] ah found T204567 [10:50:16] T204567: ms-be2030 spontaneous reboot - https://phabricator.wikimedia.org/T204567 [10:54:45] 10Operations, 10ops-codfw: ms-be2030 spontaneous reboot - https://phabricator.wikimedia.org/T204567 (10MoritzMuehlenhoff) Server went down again at 10:45 UTC. [10:55:53] moritzm: do you think it's ok to leave it down until godog comes back from lunch? [10:56:00] (03PS9) 10Marostegui: WIP: Get only active replicas to page for mysqld process number [puppet] - 10https://gerrit.wikimedia.org/r/459764 (https://phabricator.wikimedia.org/T200509) [10:56:29] volans: seems fine [10:56:57] ack, thnx [10:57:14] 10Operations, 10ops-codfw: ms-be2030 spontaneous reboot - https://phabricator.wikimedia.org/T204567 (10Volans) Console unresponsive, nothing in `show /system1/log1/`, no ping. As agreed on IRC I'm leaving it down for now until @fgiunchedi comes back from lunch to allow further investigation. [10:59:51] (03CR) 10Marostegui: "Works as expected: https://puppet-compiler.wmflabs.org/compiler1002/12491/" [puppet] - 10https://gerrit.wikimedia.org/r/459764 (https://phabricator.wikimedia.org/T200509) (owner: 10Marostegui) [11:03:36] 10Operations, 10ops-codfw, 10DBA: db2042 RAID battery failed - https://phabricator.wikimedia.org/T202051 (10jcrespo) a:05jcrespo>03None [11:10:00] PROBLEM - MariaDB read only s2 on dbstore2002 is CRITICAL: Could not connect to localhost:3312 [11:15:53] that is s2 being under maintenance, apparently I forgot to down that [11:24:21] (03CR) 10Jcrespo: "So what I would do is to keep the profile and adapt it to use it everywhere (so we avoid duplicate code) but I have nothing against this c" [puppet] - 10https://gerrit.wikimedia.org/r/459764 (https://phabricator.wikimedia.org/T200509) (owner: 10Marostegui) [11:24:48] (03CR) 10Jcrespo: "BTW, we could merge this and the read_only change at the same time for testing purposes." [puppet] - 10https://gerrit.wikimedia.org/r/459764 (https://phabricator.wikimedia.org/T200509) (owner: 10Marostegui) [11:25:59] 10Operations, 10ORES, 10Scoring-platform-team, 10Traffic, 10Patch-For-Review: Pass on name of the node serving ORES requests as response header to the user - https://phabricator.wikimedia.org/T204600 (10ema) 05Open>03Resolved a:03ema Done: ``` $ curl -v https://ores.wikimedia.org/v3/scores/wikidat... [11:31:49] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [11:34:00] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [11:35:21] 10Operations, 10SRE-Access-Requests, 10Release-Engineering-Team (Watching / External): Add contint-roots to releases{1,2}001 - https://phabricator.wikimedia.org/T201470 (10mark) Although we didn't manage to discuss this in our SRE meeting yesterday I discussed it with relevant people afterwards. There are n... [11:35:31] 10Operations, 10ORES, 10Scoring-platform-team, 10Traffic, 10Patch-For-Review: Pass on name of the node serving ORES requests as response header to the user - https://phabricator.wikimedia.org/T204600 (10Ladsgroup) Thank you! [11:35:35] 10Operations, 10SRE-Access-Requests, 10Release-Engineering-Team (Watching / External): Add contint-roots to releases{1,2}001 - https://phabricator.wikimedia.org/T201470 (10mark) a:05mark>03RobH [11:44:59] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [11:46:28] thanks volans / paravoid ! I'll take a look [11:46:45] godog: great, thanks! [11:46:48] all yours :) [11:46:57] I've closed my mgmt shell [11:47:09] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [11:49:29] bah indeed nothing on console, nor in remote syslog afaics [11:51:38] what happened to jouncebot? I've just noticed it did not ping me for EU SWAT [11:52:25] good thing is there were no patches for swat, I guess somebody would ping me otherwise :) [12:12:09] !log rebooting labnodepool1001 for kernel security update [12:12:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:50] 10Operations, 10ops-codfw: ms-be2030 spontaneous reboot - https://phabricator.wikimedia.org/T204567 (10fgiunchedi) Also `power` commands from ilo don't seem to respond timely or work at all: ``` hpiLO-> power off status=0 status_tag=COMMAND COMPLETED Tue Sep 18 12:09:16 2018... [12:16:10] RECOVERY - Host ms-be2030 is UP: PING OK - Packet loss = 0%, RTA = 38.31 ms [12:23:12] 10Operations, 10ops-codfw: ms-be2030 spontaneous reboot - https://phabricator.wikimedia.org/T204567 (10fgiunchedi) The host is back up now, clearly not stable enough for production though. [12:23:17] I've restarted jouncebot, welcome back [12:36:31] (03PS1) 10Bmansurov: Enable logging for CitationUsage and CitationUsagePageLoad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461117 (https://phabricator.wikimedia.org/T191086) [12:40:32] (03CR) 10Gehel: [C: 031] "LGTM, let's wait to see if volans has something to add" (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [12:42:25] (03PS7) 10Gehel: Icinga disk space check for old elasticsearch servers [puppet] - 10https://gerrit.wikimedia.org/r/460763 (https://phabricator.wikimedia.org/T204361) (owner: 10Mathew.onipe) [12:45:46] (03CR) 10Gehel: [C: 032] Icinga disk space check for old elasticsearch servers [puppet] - 10https://gerrit.wikimedia.org/r/460763 (https://phabricator.wikimedia.org/T204361) (owner: 10Mathew.onipe) [12:47:50] PROBLEM - Filesystem available is greater than filesystem size on ms-be2041 is CRITICAL: cluster=swift device=/dev/sdd1 fstype=xfs instance=ms-be2041:9100 job=node mountpoint=/srv/swift-storage/sdd1 site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2041&var-datasource=codfw%2520prometheus%252Fops [12:47:58] (03CR) 10Gehel: [C: 031] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/460426 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [12:49:02] (03CR) 10Gehel: [C: 031] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/460425 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [12:52:37] 10Operations, 10Maps-Sprint, 10Traffic, 10Maps (Tilerator), and 2 others: Tilerator should purge Varnish cache - https://phabricator.wikimedia.org/T109776 (10Mholloway) Confirmed that events are being sent by Tilerator upon tile generation, and received and produced to kafka by `eventlogging-service` on be... [12:55:45] (03CR) 10Marostegui: "> BTW, we could merge this and the read_only change at the same time" [puppet] - 10https://gerrit.wikimedia.org/r/459764 (https://phabricator.wikimedia.org/T200509) (owner: 10Marostegui) [12:57:10] PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet operation_type={create_container,remove_container,start_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [12:58:19] RECOVERY - kubelet operational latencies on kubernetes2003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [13:00:04] hashar: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - European version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180918T1300). [13:02:13] (03PS2) 10Bmansurov: Enable logging for CitationUsage and CitationUsagePageLoad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461117 (https://phabricator.wikimedia.org/T191086) [13:10:20] !log Drop T153638_echo_XXX tables from db1077 - T153638 [13:10:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:30] T153638: Drop echo tables from local wiki databases - https://phabricator.wikimedia.org/T153638 [13:12:14] 10Operations, 10Discovery-Search, 10Elasticsearch: Resolve elasticsearch shard size alert - https://phabricator.wikimedia.org/T204362 (10Mathew.onipe) p:05Triage>03Normal [13:12:47] 10Operations, 10Elasticsearch, 10Discovery-Search (Current work): Resolve elasticsearch shard size alert - https://phabricator.wikimedia.org/T204362 (10Mathew.onipe) [13:13:55] 10Operations, 10CirrusSearch, 10Discovery, 10Elasticsearch, 10Discovery-Search (Current work): Resolve elasticsearch shard size alert - https://phabricator.wikimedia.org/T204362 (10Mathew.onipe) [13:16:14] (03PS1) 10Banyek: MariaDB: Enable notifications db1114 & db1119 [puppet] - 10https://gerrit.wikimedia.org/r/461124 (https://phabricator.wikimedia.org/T203565) [13:18:02] (03CR) 10Marostegui: [C: 031] MariaDB: Enable notifications db1114 & db1119 [puppet] - 10https://gerrit.wikimedia.org/r/461124 (https://phabricator.wikimedia.org/T203565) (owner: 10Banyek) [13:19:12] (03CR) 10Banyek: [C: 032] MariaDB: Enable notifications db1114 & db1119 [puppet] - 10https://gerrit.wikimedia.org/r/461124 (https://phabricator.wikimedia.org/T203565) (owner: 10Banyek) [13:21:22] !log Cutting branches wmf/1.32.0-wmf.22 | T191068 [13:21:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:29] T191068: 1.32.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T191068 [13:24:16] !log Drop T153638_echo_XXX tables from db1078 - T153638 [13:24:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:24] T153638: Drop echo tables from local wiki databases - https://phabricator.wikimedia.org/T153638 [13:26:33] fatal: remote error: mediawiki/extenisons/EUCopyrightCampaign unavailable [13:26:34] pfff [13:26:52] !log reimaging mw2245 (spare host) to test reimages from cumin2001 [13:26:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:37] extenisons [13:27:38] ahah [13:27:40] (03PS2) 10Muehlenhoff: Enable cumin2001 as mysql maintenance client [puppet] - 10https://gerrit.wikimedia.org/r/460323 (https://phabricator.wikimedia.org/T177385) [13:27:54] (03PS1) 10Andrew Bogott: wmcs pdns: added forwarding for floating IP ptr records [puppet] - 10https://gerrit.wikimedia.org/r/461126 (https://phabricator.wikimedia.org/T202886) [13:30:57] (03CR) 10Andrew Bogott: [C: 032] wmcs pdns: added forwarding for floating IP ptr records [puppet] - 10https://gerrit.wikimedia.org/r/461126 (https://phabricator.wikimedia.org/T202886) (owner: 10Andrew Bogott) [13:31:54] !log Rename echo_XXX tables on db1095 and db1123 [13:31:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:02] !log repair sdd on ms-be2041 - T199198 [13:32:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:11] T199198: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 [13:39:35] RECOVERY - MariaDB read only s2 on dbstore2002 is OK: Version 10.1.35-MariaDB, Uptime 82s, read_only: True, 56.70 QPS, connection latency: 0.005405s, query latency: 0.001170s [13:40:58] (03PS1) 10Banyek: db-equiad: repool db1114 and db1119 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461128 (https://phabricator.wikimedia.org/T203565) [13:41:14] RECOVERY - MariaDB Slave SQL: s2 on dbstore2002 is OK: OK slave_sql_state not a slave [13:41:38] (03CR) 10Marostegui: [C: 04-1] db-equiad: repool db1114 and db1119 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461128 (https://phabricator.wikimedia.org/T203565) (owner: 10Banyek) [13:41:44] RECOVERY - MariaDB Slave IO: s2 on dbstore2002 is OK: OK slave_io_state not a slave [13:43:39] !log installing redis security updates on maps* servers [13:43:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:45] (03PS2) 10Banyek: db-equiad: repool db1114 and db1119 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461128 (https://phabricator.wikimedia.org/T203565) [13:47:43] (03CR) 10Jcrespo: "See inline" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461128 (https://phabricator.wikimedia.org/T203565) (owner: 10Banyek) [13:47:52] (03CR) 10Marostegui: "Nitpick on the commit message: it should be db-eqiad.php db-equiad doesn't exist :-)" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461128 (https://phabricator.wikimedia.org/T203565) (owner: 10Banyek) [13:47:54] RECOVERY - Filesystem available is greater than filesystem size on ms-be2041 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2041&var-datasource=codfw%2520prometheus%252Fops [13:48:17] !log updating intel-microcode on Debian jessie/stretch to 3.20180807a.1 [13:48:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:54] (03PS3) 10Banyek: db-eqiad: repool db1114 and db1119 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461128 (https://phabricator.wikimedia.org/T203565) [13:53:58] (03CR) 10Marostegui: [C: 031] db-eqiad: repool db1114 and db1119 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461128 (https://phabricator.wikimedia.org/T203565) (owner: 10Banyek) [13:54:13] jouncebot: next [13:54:13] In 2 hour(s) and 5 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180918T1600) [13:54:24] (03CR) 10Banyek: [C: 032] db-eqiad: repool db1114 and db1119 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461128 (https://phabricator.wikimedia.org/T203565) (owner: 10Banyek) [13:55:36] (03Merged) 10jenkins-bot: db-eqiad: repool db1114 and db1119 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461128 (https://phabricator.wikimedia.org/T203565) (owner: 10Banyek) [13:56:18] (03PS1) 10Andrew Bogott: wmfkeystonehooks: add eqiad1 security group rules to new projects [puppet] - 10https://gerrit.wikimedia.org/r/461132 [13:56:20] (03PS1) 10Andrew Bogott: toolforge k8s: allow access to eqiad1-r IPs. [puppet] - 10https://gerrit.wikimedia.org/r/461133 [13:56:30] !log poweroff ms-be2030 - T204567 [13:56:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:38] T204567: ms-be2030 spontaneous reboot - https://phabricator.wikimedia.org/T204567 [13:56:40] papaul: ^ all yours [13:57:44] (03CR) 10Andrew Bogott: [C: 032] wmfkeystonehooks: add eqiad1 security group rules to new projects [puppet] - 10https://gerrit.wikimedia.org/r/461132 (owner: 10Andrew Bogott) [13:57:49] !log banyek@deploy1001 Synchronized wmf-config/db-eqiad.php: T203565: Repool db1114 and db1119 (duration: 00m 49s) [13:57:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:57] T203565: Reclone db1114 (s1 api) from another API host - https://phabricator.wikimedia.org/T203565 [13:58:23] godog: thanks [13:58:51] !log upgrading BIOS on ms-be2030 [13:58:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:20] (03CR) 10jenkins-bot: db-eqiad: repool db1114 and db1119 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461128 (https://phabricator.wikimedia.org/T203565) (owner: 10Banyek) [13:59:26] (03CR) 10Andrew Bogott: [C: 032] toolforge k8s: allow access to eqiad1-r IPs. [puppet] - 10https://gerrit.wikimedia.org/r/461133 (owner: 10Andrew Bogott) [14:02:50] !log Drop T153638_echo_XXX tables from db1095:3313 - T153638 [14:02:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:59] T153638: Drop echo tables from local wiki databases - https://phabricator.wikimedia.org/T153638 [14:03:00] jynus: ^ 1500 tables less to backup :p [14:06:59] (03PS1) 10Hashar: admin: hashar: drop git pushInsteadOf [puppet] - 10https://gerrit.wikimedia.org/r/461134 (https://phabricator.wikimedia.org/T204710) [14:13:08] (03PS1) 10Jcrespo: Revert "mariadb: Depool db2041 to recover dbstore2002 s2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461136 [14:14:44] (03PS2) 10Jcrespo: Revert "mariadb: Depool db2041 to recover dbstore2002 s2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461136 [14:15:20] (03PS3) 10Jcrespo: Revert "mariadb: Depool db2041 to recover dbstore2002 s2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461136 [14:16:07] marostegui: lol [14:16:53] (03CR) 10Volans: [C: 04-1] "Very nice! Much cleaner that few days ago! There is still one thing that is missing from my previous comments (the use of remote instead o" (0317 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [14:19:11] !log Drop T153638_echo_XXX tables from db1123 [14:19:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:58] (03CR) 10Jcrespo: [C: 04-2] "db2041 hasn't cached up yet, may pool tomorrow." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461136 (owner: 10Jcrespo) [14:23:09] can someone merge a dummy change for my cluster ~/.gitconfig please? :] https://gerrit.wikimedia.org/r/461134 [14:24:18] (03CR) 10Ema: [C: 032] admin: hashar: drop git pushInsteadOf [puppet] - 10https://gerrit.wikimedia.org/r/461134 (https://phabricator.wikimedia.org/T204710) (owner: 10Hashar) [14:24:40] ema thanks :) [14:24:46] yw! [14:29:03] (03PS2) 10C. Scott Ananian: Use core default for Parser preprocessor class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460202 [14:29:05] 10Operations, 10Wikimedia-Logstash, 10Goal, 10Patch-For-Review: Audit log producers across the infrastructure and plan their transition to centralized logging. - https://phabricator.wikimedia.org/T198756 (10fgiunchedi) Another source of logs that are not in logstash nowadays is logs on disk, I've ran a cru... [14:29:29] :q [14:29:32] heh [14:30:11] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Systemd restart loop of timer filled the disk on tegmen - https://phabricator.wikimedia.org/T199413 (10Volans) a:05Volans>03None [14:30:34] (03PS1) 10Hashar: Group0 to 1.32.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461140 [14:31:39] !log hashar@deploy1001 Started scap: testwiki to php-1.32.0-wmf.20 [14:31:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:14] 10Operations, 10ops-codfw: ms-be2030 spontaneous reboot - https://phabricator.wikimedia.org/T204567 (10Papaul) @fgiunchedi getting "The last firmware update attempt was not successful. Ready for the next update." error when trying to update the BIOS. so I email the HP engineer and waiting for response. Sin... [14:38:04] 10Operations, 10Wikidata, 10wikidata-tech-focus: Move dispatching of wikidata to a dedicated node - https://phabricator.wikimedia.org/T193733 (10Addshore) [14:38:55] 10Operations, 10Wikidata, 10wikidata-tech-focus: 503 error raises again while trying to load a Wikidata page - https://phabricator.wikimedia.org/T140879 (10Addshore) [15:00:54] 10Operations, 10Wikimedia-Logstash, 10User-fgiunchedi, 10User-herron: Logstash hardware expansion - https://phabricator.wikimedia.org/T203169 (10fgiunchedi) I did a quick spreadsheet with some back of envelope calculations for Logstash disk requirements at https://docs.google.com/spreadsheets/d/18RJKd5-bF3... [15:07:26] !log hashar@deploy1001 Finished scap: testwiki to php-1.32.0-wmf.20 (duration: 35m 46s) [15:07:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:29] (03PS1) 10Muehlenhoff: Enable ferm for role::analytics_cluster::hadoop::client [puppet] - 10https://gerrit.wikimedia.org/r/461143 [15:09:22] 10Operations, 10DBA, 10Wikidata, 10Wikidata-Query-Service, and 4 others: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 (10Marostegui) @Smalyshev have you noticed any improvements since the above comment was done, and the index is gone from ever... [15:17:15] !log hashar@deploy1001 Started scap: Sync again testwiki to php-1.32.0-wmf.20, I might have screwed up l10ncache [15:17:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:43] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/12493/" [puppet] - 10https://gerrit.wikimedia.org/r/461143 (owner: 10Muehlenhoff) [15:20:13] !log hashar scap for testwiki is actually php-1.32.0-wmf.22 [15:20:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:47] !log hashar@deploy1001 Finished scap: Sync again testwiki to php-1.32.0-wmf.20, I might have screwed up l10ncache (duration: 06m 31s) [15:23:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:19] !log add cumin2001 to labs-in4 on cr1/2-eqiad [15:26:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:09] (03PS2) 10Herron: mx: enable gnutls %SERVER_PRECEDENCE in exim [puppet] - 10https://gerrit.wikimedia.org/r/460961 (https://phabricator.wikimedia.org/T203260) [15:28:32] (03CR) 10Herron: [C: 032] mx: enable gnutls %SERVER_PRECEDENCE in exim [puppet] - 10https://gerrit.wikimedia.org/r/460961 (https://phabricator.wikimedia.org/T203260) (owner: 10Herron) [15:31:04] 10Operations, 10SRE-Access-Requests, 10Discovery-Search (Current work), 10Patch-For-Review: Onboarding Mathew Onipe - https://phabricator.wikimedia.org/T202708 (10Gehel) @Mathew.onipe has access to the elastic and wdqs clusters, which is what we need at the moment. We'll reopen specific tasks for specific... [15:31:33] 10Operations, 10netops: Enable cumin2001 in router ACLs - https://phabricator.wikimedia.org/T204730 (10MoritzMuehlenhoff) [15:35:56] (03CR) 10Subramanya Sastry: [C: 031] Set $wgSiteMatrixNonGlobalSites global for SiteMatrix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460034 (owner: 10Arlolra) [15:41:57] (03PS2) 10BBlack: authdns-local-update: gdnsd-3.x compat [puppet] - 10https://gerrit.wikimedia.org/r/460937 [15:43:50] (03CR) 10BBlack: [C: 032] authdns-local-update: gdnsd-3.x compat [puppet] - 10https://gerrit.wikimedia.org/r/460937 (owner: 10BBlack) [15:46:21] !log elukey@deploy1001 Started deploy [analytics/refinery@1a6235a]: Fix cron scrips from the NYC offsite [15:46:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:26] !log installing spice security updates [15:47:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:11] (03PS1) 10RobH: adding contint-roots to releases servers with sudo rights [puppet] - 10https://gerrit.wikimedia.org/r/461148 (https://phabricator.wikimedia.org/T201470) [15:51:18] I am 99% sure that patchset does what I need (add contint-roots to release systems) [15:51:23] anyone wanna review and +1? [15:51:50] robh: having a look [15:51:55] thanks! [15:52:05] PROBLEM - HHVM rendering on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:52:20] group already existws and was approved to be added, so i appended to the heira data for the releases role (which is only applied to the two releases servers) [15:53:02] (03CR) 10Muehlenhoff: [C: 031] adding contint-roots to releases servers with sudo rights [puppet] - 10https://gerrit.wikimedia.org/r/461148 (https://phabricator.wikimedia.org/T201470) (owner: 10RobH) [15:53:04] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb={PATCH,PUT} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:53:13] thx!! [15:53:25] (03CR) 10RobH: [C: 032] adding contint-roots to releases servers with sudo rights [puppet] - 10https://gerrit.wikimedia.org/r/461148 (https://phabricator.wikimedia.org/T201470) (owner: 10RobH) [15:53:37] !log add cumin2001 to mr* security policies - T204730 [15:53:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:44] T204730: Enable cumin2001 in router ACLs - https://phabricator.wikimedia.org/T204730 [15:54:05] RECOVERY - HHVM rendering on mwdebug1002 is OK: HTTP OK: HTTP/1.1 200 OK - 75384 bytes in 0.135 second response time [15:55:35] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Add contint-roots to releases{1,2}001 - https://phabricator.wikimedia.org/T201470 (10RobH) 05stalled>03Resolved a:05RobH>03None Ok, with @mark's approval I've gone ahead and merged a patchset... [15:55:53] !log elukey@deploy1001 Finished deploy [analytics/refinery@1a6235a]: Fix cron scrips from the NYC offsite (duration: 09m 32s) [15:55:55] PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation={compareAndSwap,get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:55:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:05] RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:59:34] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:00:04] godog and _joe_: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180918T1600). [16:00:05] thcipriani: A patch you scheduled for Puppet SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:12] o/ [16:01:19] my puppet swat changes: couple of small html/js changes for adding CoC in footer of gerrit [16:02:30] !log installing policykit-1 security updates on jessie [16:02:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:47] I wonder thcipriani if we should do the pg change too. (it has several commits for that) https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/458523/ https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/458833/ and https://gerrit.wikimedia.org/r/#/c/operations/software/gerrit/+/458524/ [16:04:24] paladox: yep, those are the ones I scheduled [16:04:31] polygerrit and old-ui [16:04:33] ah ok :) [16:04:59] figured I do it all at once [16:05:18] :) [16:07:53] o/ thcipriani [16:07:57] jouncebot: now [16:07:57] For the next 0 hour(s) and 52 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180918T1600) [16:08:17] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install mwmaint1002.eqiad.wmnet - https://phabricator.wikimedia.org/T201343 (10Dzahn) I'll take this. We should do it while eqiad is still not active, that will make it a lot easier than doing it later after the switch back. [16:08:21] \o addshore [16:08:41] * addshore is going to prepare a backport for https://phabricator.wikimedia.org/T204729 [16:09:13] addshore and anomie thanks a lot [16:10:18] * addshore waits for CI.... [16:11:28] https://upload.wikimedia.org/wikipedia/commons/thumb/7/7b/Basilica_Santa_Maria_della_Salute_Dorsoduro_Venezia.jpg/3000px-Basilica_Santa_Maria_della_Salute_Dorsoduro_Venezia.jpg [16:11:50] any reason we can get such a thumbnail? [16:11:55] *can't [16:12:17] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install mwmaint1002.eqiad.wmnet - https://phabricator.wikimedia.org/T201343 (10Dzahn) a:05RobH>03Dzahn [16:13:34] jouncebot: next [16:13:34] In 0 hour(s) and 46 minute(s): Services – Graphoid / Parsoid / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180918T1700) [16:16:12] this is a big file: 10,405 × 11,267 pixels [16:16:40] but yet, is there a limitation for JPEG thumbs? [16:19:20] 10Operations, 10SRE-Access-Requests, 10Discovery-Search (Current work), 10Patch-For-Review: Onboarding Mathew Onipe - https://phabricator.wikimedia.org/T202708 (10Dzahn) @Mathew.onipe Let's meet on IRC and finish the Icinga part together if you like. [16:19:30] it seems there is no hard limit on the image dimensions, but it could still be failing due to exceeding memory or time limits while trying to thumbnail it [16:20:21] <_joe_> MatmaRex: nah I don't think that's the problem, it fails too fast at any resolution [16:20:41] right now it's also failing with a 429 (rate limit exceeded) [16:20:49] so we all probably refreshed that page too many times [16:20:56] <_joe_> yes [16:20:57] <_joe_> eheh [16:21:15] <_joe_> the file is huge btw [16:21:36] <_joe_> I wouldn't be surprised if our infrastructure times out trying to render it [16:21:41] 10Operations, 10Commons, 10MediaWiki-Database, 10Multimedia, and 4 others: Storage backend errors on commons when deleting/restoring pages - https://phabricator.wikimedia.org/T141704 (10Krinkle) @MusikAnimal Was that message shown within the wiki interface? I would expect that to correlate with an "ERROR"-... [16:21:53] 10Operations, 10Commons, 10Multimedia, 10media-storage, and 2 others: Storage backend errors on commons when deleting/restoring pages - https://phabricator.wikimedia.org/T141704 (10Krinkle) [16:22:50] !log delete `term cumin` from cr1/2-eqiad analytics filter (already permited by established-tcp term) [16:22:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:59] <_joe_> it took a couple mintes to my google chrome browser to resize it to 600px width to show it to me [16:23:25] 10Operations, 10netops: Enable cumin2001 in router ACLs - https://phabricator.wikimedia.org/T204730 (10ayounsi) 05Open>03Resolved a:03ayounsi Done! [16:25:33] * addshore will be doing 2 small mediawiki syncs for https://phabricator.wikimedia.org/T204729 in the next 15 mins [16:26:33] thanks :-) [16:27:12] addshore@mwlog2001:~$ fatalmonitor [16:27:16] ^^ that doesnt work? O_o [16:27:40] I guess i should still be using mwlog1001? [16:28:05] PROBLEM - Check size of conntrack table on mwmaint2001 is CRITICAL: Return code of 255 is out of bounds [16:28:15] PROBLEM - Disk space on mwmaint2001 is CRITICAL: Return code of 255 is out of bounds [16:28:24] PROBLEM - configured eth on mwmaint2001 is CRITICAL: Return code of 255 is out of bounds [16:28:24] PROBLEM - nutcracker port on mwmaint2001 is CRITICAL: Return code of 255 is out of bounds [16:28:34] PROBLEM - dhclient process on mwmaint2001 is CRITICAL: Return code of 255 is out of bounds [16:28:35] PROBLEM - Check systemd state on mwmaint2001 is CRITICAL: Return code of 255 is out of bounds [16:28:44] PROBLEM - MD RAID on mwmaint2001 is CRITICAL: Return code of 255 is out of bounds [16:28:55] PROBLEM - mcrouter process on mwmaint2001 is CRITICAL: Return code of 255 is out of bounds [16:28:55] PROBLEM - nutcracker process on mwmaint2001 is CRITICAL: Return code of 255 is out of bounds [16:29:04] PROBLEM - Check whether ferm is active by checking the default input chain on mwmaint2001 is CRITICAL: Return code of 255 is out of bounds [16:29:05] PROBLEM - DPKG on mwmaint2001 is CRITICAL: Return code of 255 is out of bounds [16:29:35] PROBLEM - puppet last run on mwmaint2001 is CRITICAL: Return code of 255 is out of bounds [16:30:03] ^ wfm [16:32:15] RECOVERY - DPKG on mwmaint2001 is OK: All packages OK [16:32:16] !log mwmaint2001 - starting nagios-nrpe-server [16:32:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:24] RECOVERY - Check size of conntrack table on mwmaint2001 is OK: OK: nf_conntrack is 1 % full [16:32:34] !log radon - re-enabled disabled puppet without reason (decom) T202040 [16:32:34] RECOVERY - Disk space on mwmaint2001 is OK: DISK OK [16:32:35] RECOVERY - nutcracker port on mwmaint2001 is OK: TCP OK - 0.001 second response time on 127.0.0.1 port 11212 [16:32:35] RECOVERY - configured eth on mwmaint2001 is OK: OK - interfaces up [16:32:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:41] T202040: Decommission radon - https://phabricator.wikimedia.org/T202040 [16:32:45] RECOVERY - dhclient process on mwmaint2001 is OK: PROCS OK: 0 processes with command name dhclient [16:32:47] gilles: it looks like you have a merged but undeployed change on mw wmf.20 ? [16:32:53] * addshore looks at the SAL... [16:32:55] RECOVERY - MD RAID on mwmaint2001 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [16:33:15] RECOVERY - mcrouter process on mwmaint2001 is OK: PROCS OK: 1 process with UID = 114 (mcrouter), command name mcrouter [16:33:15] RECOVERY - nutcracker process on mwmaint2001 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker [16:33:15] RECOVERY - Check whether ferm is active by checking the default input chain on mwmaint2001 is OK: OK ferm input default policy is set [16:33:45] 10Operations, 10wikidiff2, 10Patch-For-Review: Create releasers-wikidiff2 group, split from releasers-mediawiki - https://phabricator.wikimedia.org/T202473 (10RobH) [16:33:49] !log mwmaint2001 - nagios-nrpe-server had status 'failed' and caused all NRPE Icinga checks to fail but recovered after simply starting it again [16:33:50] 10Operations, 10SRE-Access-Requests, 10wikidiff2, 10Patch-For-Review, 10User-Addshore: Give thiemowmde permission to upload wikidiff2 releases (releasers-wikidiff2) - https://phabricator.wikimedia.org/T202476 (10RobH) 05Open>03Resolved This has sat pending @thiemowmde's acknowledgement of access. Ho... [16:33:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:58] !log delete `filter common-infrastructure4` on cr1/2-eqiad, unused/obsolete after T198623 [16:34:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:06] T198623: Review analytics-in4/6 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 [16:34:44] RECOVERY - puppet last run on mwmaint2001 is OK: OK: Puppet is currently enabled, last run 12 minutes ago with 0 failures [16:34:48] thcipriani: mind helping me figure out what to do with this undeployed change on .20? :) [16:34:56] * thcipriani looks [16:35:21] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Kalliope Tsouroupidou - https://phabricator.wikimedia.org/T202486 (10RobH) 05Open>03Resolved a:05ayounsi>03Kalliope @kalliope, This should be all set for you... [16:35:55] thcipriani: also on .20 .. i guess they use the same wmf_deploy branch thing [16:35:59] *.22 [16:37:46] addshore: you're talking about the CentralNotice change? [16:37:55] yup [16:39:02] it looks harmless. I'm not sure why it was merged to wmf_deploy outside of a window. AndyRussG is https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/CentralNotice/+/461151/ deployed anywhere? [16:39:41] thcipriani: okay, I'll just continue and forget I saw it [16:40:14] I just know if I move forward and then don't sync that dir that the next person that does a full scap or syncs it will get it without noticing... [16:40:15] addshore: in the interim, just rebase and sync your change and we'll get to the bottom of the centralnotice change in parallel [16:40:22] thcipriani: ack [16:40:31] indeed, good looking out [16:40:43] always better to make noise [16:40:50] yes :) [16:42:57] syncing :) [16:42:59] marostegui: ^^ [16:43:36] thcipriani: arrg I didn't notice the train is earlier now [16:43:51] !log addshore@deploy1001 Synchronized php-1.32.0-wmf.20/includes/watcheditem/WatchedItemStore.php: [[gerrit:461154|WatchedItemStore::countVisitingWatchersMultiple() fix]] T204729 (duration: 00m 59s) [16:43:52] addshore: ^ [16:43:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:59] was hoping to put that on today's train [16:44:06] but I can put it on a SWAT later instead [16:44:14] marostegui: that should be enwiki patched [16:44:17] apologies for the bother [16:44:21] I'll do the other branch now [16:44:33] AndyRussG: also remain mindful that because centralnotice isn't branched like the other extensions (i.e., uses a wmf_deploy branch) it shows up in all live branches [16:44:46] thcipriani: yep [16:45:00] that's why I usually try to do those merges before the branch is cut [16:45:07] addshore: ok, let me remove the killers and we can see where we are at [16:45:12] sorry for the mess [16:45:40] !log addshore@deploy1001 Synchronized php-1.32.0-wmf.22/includes/watcheditem/WatchedItemStore.php: [[gerrit:461155|WatchedItemStore::countVisitingWatchersMultiple() fix]] T204729 (duration: 00m 57s) [16:45:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:58] (03PS3) 10Dzahn: parsoid: role/profile refactoring [puppet] - 10https://gerrit.wikimedia.org/r/460605 (https://phabricator.wikimedia.org/T201366) [16:46:22] addshore: query killers stopped [16:46:25] let's monitor the graphs [16:46:30] marostegui: ack [16:46:51] I immediately noticed a drop in exception / hhvm .log for mediawiki which was a good sign [16:48:01] marostegui: your looking at https://grafana.wikimedia.org/dashboard/db/mysql?panelId=3&fullscreen&orgId=1&var-dc=codfw%20prometheus%2Fops&var-server=db2088&var-port=13311&from=now-3h&to=now ? [16:48:18] AndyRussG: I'm going to revert for the time being and fetch that down to the deployment machines just to ensure consistency and so the next scap sync doesn't deploy something that isn't intended. [16:48:39] addshore: yep, and also for db2085 too, and looking good! [16:48:44] marostegui: woo! [16:48:54] marostegui: I'll let you finish up the ticket etc. [16:49:09] addshore: thanks a lot :) [16:49:25] 10Operations, 10Commons, 10Multimedia, 10media-storage, and 2 others: Storage backend errors on commons when deleting/restoring pages - https://phabricator.wikimedia.org/T141704 (10MusikAnimal) >>! In T141704#4594564, @Krinkle wrote: > @MusikAnimal Was that message shown within the wiki interface? I would... [16:49:28] thcipriani: ok for sure, thx and apologies again! [16:49:44] no worries :) [16:49:48] ;) [16:53:24] 10Operations, 10Commons, 10Multimedia, 10media-storage, and 2 others: Unable to delete certain files due to "inconsistent state within the internal storage backends" - https://phabricator.wikimedia.org/T141704 (10Krinkle) [16:53:24] RECOVERY - Check systemd state on mwmaint2001 is OK: OK - running: The system is fully operational [16:53:39] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 3 others: Unable to delete certain files due to "inconsistent state within the internal storage backends" - https://phabricator.wikimedia.org/T141704 (10Krinkle) [17:00:04] cscott, arlolra, subbu, halfak, and Amir1: That opportune time is upon us again. Time for a Services – Graphoid / Parsoid / Citoid / ORES deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180918T1700). [17:01:52] thcipriani: the test failure there is a flapping test [17:02:16] it's been around for a while but somehow has just gotten more frequent for some reason [17:02:27] pls don't worry about it [17:02:56] (https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/CentralNotice/+/461163/) [17:05:07] okie doke, I'll re +2 and see if it goes away [17:05:44] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is CRITICAL: 52.02 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:06:15] addshore: I got a notification from you about an undeployed change but it seems like my bouncer swallowed some of the history [17:06:50] addshore: did you figure it out? looks like a CN change AndyRussG told me he would deploy as part of this week's train [17:07:06] gilles: I goofed... [17:07:17] gilles: we got it figured out, just reverting for now, will SWAT later [17:07:35] the train schedule changed and I tried to deploy it too late [17:08:55] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is OK: (C)60 le (W)70 le 77.13 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:09:08] alright, glad you guys figure it out :) thanks for taking care of this, AndyRussG [17:09:13] figured [17:09:32] (03PS1) 10Dzahn: icinga: give privs to run commands to Matt Onipe [puppet] - 10https://gerrit.wikimedia.org/r/461166 (https://phabricator.wikimedia.org/T202708) [17:10:59] gilles: no worries, apologies for the confusion just now [17:18:34] AndyRussG: all good now, FYI: fetched and staged revert [17:18:58] thcipriani: okok :) [17:19:40] (03CR) 10Dzahn: [C: 032] parsoid: role/profile refactoring [puppet] - 10https://gerrit.wikimedia.org/r/460605 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [17:23:23] (03CR) 10Dzahn: [C: 032] "no change on wtp1025, ruthenium only motd change confirmed" [puppet] - 10https://gerrit.wikimedia.org/r/460605 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [17:23:44] (03CR) 10Dzahn: "this is now unblocked after https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/460605/" [puppet] - 10https://gerrit.wikimedia.org/r/454443 (https://phabricator.wikimedia.org/T201366) (owner: 10RobH) [17:23:57] (03PS2) 10Dzahn: pushing scandium into parsoid test service [puppet] - 10https://gerrit.wikimedia.org/r/454443 (https://phabricator.wikimedia.org/T201366) (owner: 10RobH) [17:24:42] (03CR) 10jerkins-bot: [V: 04-1] pushing scandium into parsoid test service [puppet] - 10https://gerrit.wikimedia.org/r/454443 (https://phabricator.wikimedia.org/T201366) (owner: 10RobH) [17:26:17] (03PS3) 10Dzahn: pushing scandium into parsoid test service [puppet] - 10https://gerrit.wikimedia.org/r/454443 (https://phabricator.wikimedia.org/T201366) (owner: 10RobH) [17:27:22] (03PS1) 10Volans: CLI: improve help message [software/cumin] - 10https://gerrit.wikimedia.org/r/461168 (https://phabricator.wikimedia.org/T204680) [17:27:26] (03CR) 10Dzahn: "fixed.. see how it gets +1 now and just a single role is used.. going ahead" [puppet] - 10https://gerrit.wikimedia.org/r/454443 (https://phabricator.wikimedia.org/T201366) (owner: 10RobH) [17:27:50] (03CR) 10Dzahn: [C: 032] pushing scandium into parsoid test service [puppet] - 10https://gerrit.wikimedia.org/r/454443 (https://phabricator.wikimedia.org/T201366) (owner: 10RobH) [17:29:19] !log scandium - move from role(spare) to role(parsoid_testing), making it equal to ruthenium (T201366) [17:29:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:28] T201366: rack/setup/install scandium.eqiad.wmnet (parsoid test box) - https://phabricator.wikimedia.org/T201366 [17:31:25] (03CR) 10Dzahn: "merged parsoid_testing role refactoring change. ruthenium does _not_ include the role(test) anymore now. instead it directly includes stan" [puppet] - 10https://gerrit.wikimedia.org/r/460064 (owner: 10Dzahn) [17:35:04] 10Operations, 10DBA, 10JADE, 10Scoring-platform-team (Current), 10User-Joe: Write our anticipated "phase two" schemas and submit for review - https://phabricator.wikimedia.org/T202596 (10awight) See also {T204250}. [17:37:16] (03PS1) 10Dzahn: parsoid::testing: move Hiera values from host to role level [puppet] - 10https://gerrit.wikimedia.org/r/461169 (https://phabricator.wikimedia.org/T201366) [17:37:58] PROBLEM - puppet last run on scandium is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 4 minutes ago with 4 failures. Failed resources (up to 3 shown): Package[npm],Package[uprightdiff],File[/etc/my.cnf],File[/srv/deployment/parsoid/deploy] [17:40:07] did somebody disable notifications for scandium a minute ago? just wondering where that came from [17:41:00] ACKNOWLEDGEMENT - puppet last run on scandium is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 4 minutes ago with 4 failures. Failed resources (up to 3 shown): Package[npm],Package[uprightdiff],File[/etc/my.cnf],File[/srv/deployment/parsoid/deploy] daniel_zahn new host, WIP [17:41:43] (03CR) 10Dzahn: [C: 032] parsoid::testing: move Hiera values from host to role level [puppet] - 10https://gerrit.wikimedia.org/r/461169 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [17:42:39] (03PS1) 10Ayounsi: Allow mgmt hosts to send syslog to central syslog servers [puppet] - 10https://gerrit.wikimedia.org/r/461170 [17:42:58] (03PS1) 10BBlack: prometheus-gdnsd-stats: add gdnsd-3.x support [puppet] - 10https://gerrit.wikimedia.org/r/461171 [17:44:30] (03CR) 10Dzahn: [C: 032] "this created all the shell users on scandium" [puppet] - 10https://gerrit.wikimedia.org/r/461169 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [17:45:28] PROBLEM - parsoid on scandium is CRITICAL: connect to address 10.64.48.94 and port 8142: Connection refused [17:46:31] 10Operations, 10Cleanup, 10Gerrit, 10GitHub-Mirrors, and 7 others: Archive mediawiki/extensions/Collection/OfflineContentGenerator and all OCG-related repos - https://phabricator.wikimedia.org/T183891 (10MarcoAurelio) [17:46:39] 10Operations, 10ops-eqiad, 10Operations-Software-Development: rack/setup/install clustermgmt1001.eqiad.wmnet (new cumin master) - https://phabricator.wikimedia.org/T201346 (10Volans) a:05RobH>03Volans I'm taking over this to use it for a live-session with the new hires. I'll take care of fixing the mgmt... [17:50:18] (03CR) 10BBlack: [C: 032] prometheus-gdnsd-stats: add gdnsd-3.x support [puppet] - 10https://gerrit.wikimedia.org/r/461171 (owner: 10BBlack) [17:51:19] 10Operations, 10Parsoid, 10Patch-For-Review: rack/setup/install scandium.eqiad.wmnet (parsoid test box) - https://phabricator.wikimedia.org/T201366 (10Dzahn) @ssastry @RobH I have fixed the issue with multiple roles being applied on ruthenium by refactoring the puppet code. Now there is just "parsoid_test... [17:56:44] 10Operations, 10Parsoid, 10Patch-For-Review: rack/setup/install scandium.eqiad.wmnet (parsoid test box) - https://phabricator.wikimedia.org/T201366 (10Dzahn) @Muehlenhoff How are chances to get npm and uprightdiff packages on stretch? @RobH It might have to be reinstalled with jessie (for now). @ssastry D... [17:57:40] 10Operations, 10ops-eqiad, 10netops: Ensure scs-c1-eqiad:eth1 is not connected - https://phabricator.wikimedia.org/T204743 (10ayounsi) p:05Triage>03Low [18:04:36] !log ppchelko@deploy1001 Started deploy [restbase/deploy@55100d4]: Remove restrictions table T203835, Split 25% PDF traffic to Proton T186748, Metrics endpoints improvements [18:04:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:49] T186748: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 [18:07:02] 10Operations, 10Parsoid, 10Patch-For-Review: rack/setup/install scandium.eqiad.wmnet (parsoid test box) - https://phabricator.wikimedia.org/T201366 (10Arlolra) > Aware of not having npm in stretch? The nodejs package should include the npm bin [18:08:01] 10Operations, 10Parsoid, 10Patch-For-Review: rack/setup/install scandium.eqiad.wmnet (parsoid test box) - https://phabricator.wikimedia.org/T201366 (10Dzahn) https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=857986 [18:08:13] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@55100d4]: Remove restrictions table T203835, Split 25% PDF traffic to Proton T186748, Metrics endpoints improvements (duration: 03m 37s) [18:08:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:37] jouncebot: !next [18:08:52] Eh, ruddy commands. [18:09:36] jouncebot: next [18:09:36] In 0 hour(s) and 50 minute(s): MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180918T1900) [18:10:06] So… nothing for five hours? [18:12:51] * Krinkle is messing on mwdebug1001 [18:13:31] * Krinkle is now messing on mwdebug2001 [18:22:09] !log ppchelko@deploy1001 Started deploy [restbase/deploy@4c3128f]: Remove restrictions table T203835, Split 25% PDF traffic to Proton T186748, Metrics endpoints improvements, don't monitor PDF endpoints [18:22:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:18] T186748: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 [18:27:40] (03PS2) 10Dzahn: icinga: give privs to run commands to Matt Onipe [puppet] - 10https://gerrit.wikimedia.org/r/461166 (https://phabricator.wikimedia.org/T202708) [18:28:13] 10Operations, 10Parsoid, 10Patch-For-Review: rack/setup/install scandium.eqiad.wmnet (parsoid test box) - https://phabricator.wikimedia.org/T201366 (10Arlolra) Ok :/ [18:29:37] (03CR) 10Dzahn: [C: 032] icinga: give privs to run commands to Matt Onipe [puppet] - 10https://gerrit.wikimedia.org/r/461166 (https://phabricator.wikimedia.org/T202708) (owner: 10Dzahn) [18:31:04] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@4c3128f]: Remove restrictions table T203835, Split 25% PDF traffic to Proton T186748, Metrics endpoints improvements, don't monitor PDF endpoints (duration: 08m 55s) [18:31:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:12] T186748: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 [18:32:52] !log ppchelko@deploy1001 Started deploy [restbase/deploy@4c3128f]: Remove restrictions table T203835, Split 25% PDF traffic to Proton T186748, Metrics endpoints improvements, don't monitor PDF endpoints, take 2, feed timing out [18:32:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:39] 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-General-or-Unknown, and 2 others: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10Imarlier) @Nemo_bis Regarding your comment above, about Google not respectin... [18:37:49] 10Operations, 10SRE-Access-Requests, 10Discovery-Search (Current work), 10Patch-For-Review: Onboarding Mathew Onipe - https://phabricator.wikimedia.org/T202708 (10Dzahn) [18:37:59] (03PS1) 10Andrew Bogott: firewall: add new eqiad1-r bastions for VM ssh access [puppet] - 10https://gerrit.wikimedia.org/r/461180 [18:40:28] (03CR) 10Paladox: [C: 031] firewall: add new eqiad1-r bastions for VM ssh access [puppet] - 10https://gerrit.wikimedia.org/r/461180 (owner: 10Andrew Bogott) [18:40:49] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@4c3128f]: Remove restrictions table T203835, Split 25% PDF traffic to Proton T186748, Metrics endpoints improvements, don't monitor PDF endpoints, take 2, feed timing out (duration: 07m 58s) [18:40:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:57] T186748: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 [18:41:08] (03CR) 10Andrew Bogott: [C: 032] firewall: add new eqiad1-r bastions for VM ssh access [puppet] - 10https://gerrit.wikimedia.org/r/461180 (owner: 10Andrew Bogott) [18:41:32] 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-General-or-Unknown, and 2 others: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10Imarlier) @kaldari The >500,000 number is based on Google Search Console, wh... [18:43:40] !log ppchelko@deploy1001 Started deploy [restbase/deploy@4c3128f]: Remove restrictions table T203835, Split 25% PDF traffic to Proton T186748, Metrics endpoints improvements, don't monitor PDF endpoints, take 2, feed timing out [18:43:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:13] 10Operations, 10SRE-Access-Requests, 10Discovery-Search (Current work), 10Patch-For-Review: Onboarding Mathew Onipe - https://phabricator.wikimedia.org/T202708 (10Dzahn) 05Open>03Resolved To the best of my knowledge this is resolved now. We went through the Icinga part and added permissions and confir... [18:49:54] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@4c3128f]: Remove restrictions table T203835, Split 25% PDF traffic to Proton T186748, Metrics endpoints improvements, don't monitor PDF endpoints, take 2, feed timing out (duration: 06m 14s) [18:50:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:03] T186748: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 [18:55:47] 10Operations, 10Electron-PDFs, 10Proton, 10Patch-For-Review, and 3 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 (10Pchelolo) I've deployed a change that splits traffic in RESTBase and also sends requests to Proton (25% of traffic) Also, for testing, you c... [18:59:39] 10Operations, 10Parsoid, 10Patch-For-Review: rack/setup/install scandium.eqiad.wmnet (parsoid test box) - https://phabricator.wikimedia.org/T201366 (10Dzahn) We should try putting this role on a cloud VPS and then manually install the jessie packages on stretch. As pointed out by Moritz this might be a valid... [19:00:04] Deploy window MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180918T1900) [19:03:07] !sal [19:03:07] https://wikitech.wikimedia.org/wiki/Server_Admin_Log https://tools.wmflabs.org/sal/production See it and you will know all you need. [19:03:15] doing group0 on US time [19:03:25] took me too long to cut branches this afternoon [19:05:47] (03CR) 10Hashar: [C: 032] Group0 to 1.32.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461140 (owner: 10Hashar) [19:07:06] (03Merged) 10jenkins-bot: Group0 to 1.32.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461140 (owner: 10Hashar) [19:09:08] (03CR) 10jenkins-bot: Group0 to 1.32.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461140 (owner: 10Hashar) [19:10:13] (03PS1) 10Krinkle: tests: Remove obsolete exceptions from allDblist test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461191 [19:10:54] 10Operations, 10Cloud-VPS (Project-requests): Request removal of puppet3-diffs VPS project - https://phabricator.wikimedia.org/T204532 (10Dzahn) added Andrew, i know he is usually looking for projects that can be deleted to get resources back. [19:11:47] !log ppchelko@deploy1001 Started deploy [restbase/deploy@4c3128f] (dev-cluster): Remove restrictions table T203835, Split 25% PDF traffic to Proton T186748, Metrics endpoints improvements, don't monitor PDF endpoints, take 2, feed timing out [19:11:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:55] T186748: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 [19:18:18] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@4c3128f] (dev-cluster): Remove restrictions table T203835, Split 25% PDF traffic to Proton T186748, Metrics endpoints improvements, don't monitor PDF endpoints, take 2, feed timing out (duration: 06m 32s) [19:18:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:27] T186748: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 [19:20:00] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: group0 to 1.32.0-wmf.22 [19:20:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:35] (03PS1) 10Herron: mx: remove local ip dns lookup and wiki-mail.wikimedia.org default [puppet] - 10https://gerrit.wikimedia.org/r/461193 [19:25:02] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler1002/12494/" [puppet] - 10https://gerrit.wikimedia.org/r/461193 (owner: 10Herron) [19:26:08] (03CR) 10Krinkle: [C: 032] "Yay, that was easy." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461191 (owner: 10Krinkle) [19:27:24] (03Merged) 10jenkins-bot: tests: Remove obsolete exceptions from allDblist test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461191 (owner: 10Krinkle) [19:31:27] (03CR) 10Herron: "> Still appears used by wikimedia/puppet – exim4.conf.mx.erb:" [dns] - 10https://gerrit.wikimedia.org/r/143762 (owner: 10Faidon Liambotis) [19:32:25] 10Operations, 10Datacenter-Switchover-2018, 10Patch-For-Review: Add "do not use this server" login message to non active mwmaint* server - https://phabricator.wikimedia.org/T204604 (10Dzahn) 05Open>03Resolved please feel free to reopen if you think it also needs to have the server name and/or have a sugg... [19:33:23] 10Operations, 10Datacenter-Switchover-2018: Add "do not use this server" login message to non active mwmaint* server - https://phabricator.wikimedia.org/T204604 (10Dzahn) [19:33:59] 10Operations, 10Datacenter-Switchover-2018: Add "do not use this server" login message to non active mwmaint* server - https://phabricator.wikimedia.org/T204604 (10Dzahn) 05Resolved>03Open [19:34:14] so [19:34:17] https://www.mediawiki.org/wiki/Special:OAuthListConsumers fatals out :\ [19:34:23] 10Operations, 10Datacenter-Switchover-2018: Add "do not use this server" login message to non active mwmaint* server - https://phabricator.wikimedia.org/T204604 (10Dzahn) 05Open>03Resolved [19:34:30] that is T204757 [19:34:31] T204757: [OAuth] PHP Fatal Error: Call to undefined method MediaWiki\Extensions\OAuth\MWOAuthDAOAccessControl::getForHtml() - https://phabricator.wikimedia.org/T204757 [19:38:35] (03CR) 10jenkins-bot: tests: Remove obsolete exceptions from allDblist test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461191 (owner: 10Krinkle) [19:40:39] hashar: https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/OAuth/+/461196 [19:40:43] wait [19:40:57] HOW THE HELL DO DEVELOPERS HAVE A PATCH BEFORE I EVEN HAD TIME TO COMPLETE THE TASK!!!!!!!! [19:40:59] :D [19:41:53] tgr: note that previous that line finished with a call to ->text() and now that is ->escaped() , so maybe there are too many escapes? [19:43:00] the escape is for the raw parameter [19:43:18] !log uploaded gdnsd-2.99.9-beta1-1+wmf1 to reprepro for stretch-wikimedia [19:43:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:19] tgr: :] [19:44:59] My bad, i reviewed the original change :S. Sorry folks [19:45:07] it happens :] [19:45:12] (03PS1) 10Dzahn: tor: class to extract fingerprints of multiple relays (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/461197 [19:45:35] what really matter is that a couple minute after I filled the task / noticed the issue, I already have a patch and two person showing up ! [19:46:46] otherwise, train looks fine so far [19:46:55] (03CR) 10Dzahn: [C: 04-2] "nah.. need something more like https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/461197/ because this would still be manual and we ne" [puppet] - 10https://gerrit.wikimedia.org/r/459876 (owner: 10Dzahn) [19:47:02] (03PS1) 10Andrew Bogott: mwopenstackclients: fix the 'allregions' case of allinstances() [puppet] - 10https://gerrit.wikimedia.org/r/461198 [19:48:29] (03CR) 10Andrew Bogott: [C: 032] mwopenstackclients: fix the 'allregions' case of allinstances() [puppet] - 10https://gerrit.wikimedia.org/r/461198 (owner: 10Andrew Bogott) [19:50:20] (03CR) 10Dzahn: "We could parse the fingerprint for each relay out of the config files, like this POC/WIP: https://gerrit.wikimedia.org/r/#/c/operations/pu" [puppet] - 10https://gerrit.wikimedia.org/r/399972 (owner: 10Faidon Liambotis) [19:50:21] (03CR) 10Jforrester: [C: 031] Set MCR migration to write-both/read-new on testwiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454534 (https://phabricator.wikimedia.org/T198309) (owner: 10Daniel Kinzler) [19:50:58] bawolff: tgr: I am backporting the oauth patch to wmf.22 , will deploy and resolve task [19:54:07] (03PS1) 10Andrew Bogott: novastats: make a bunch of these scripts handle multi-region cases [puppet] - 10https://gerrit.wikimedia.org/r/461201 [19:55:24] !log ppchelko@deploy1001 Started deploy [cpjobqueue/deploy@58f9ed3]: Log errors for HTTP error T203929 [19:55:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:31] T203929: cpjobqueue should log a warning when there is an HTTP error - https://phabricator.wikimedia.org/T203929 [19:56:13] !log ppchelko@deploy1001 Finished deploy [cpjobqueue/deploy@58f9ed3]: Log errors for HTTP error T203929 (duration: 00m 49s) [19:56:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:59] (03CR) 10Andrew Bogott: [C: 032] novastats: make a bunch of these scripts handle multi-region cases [puppet] - 10https://gerrit.wikimedia.org/r/461201 (owner: 10Andrew Bogott) [19:58:10] (03PS1) 10Dzahn: mwmaint1002: add prod DNS entries (v4) [dns] - 10https://gerrit.wikimedia.org/r/461202 (https://phabricator.wikimedia.org/T201343) [20:00:22] jouncebot: next [20:00:22] In 2 hour(s) and 59 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180918T2300) [20:01:34] train window still running? [20:04:47] (03PS1) 10Dzahn: DHCP: add mwmaint1002 MAC address [puppet] - 10https://gerrit.wikimedia.org/r/461204 (https://phabricator.wikimedia.org/T201343) [20:04:59] jouncebot: now [20:04:59] For the next 0 hour(s) and 55 minute(s): MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180918T1900) [20:05:23] i believe technically yes [20:05:34] hashar: ^ [20:05:34] !log hashar@deploy1001 Synchronized php-1.32.0-wmf.22/extensions/OAuth/frontend/specialpages/SpecialMWOAuthListConsumers.php: Fix escapeForHtml method name - T204757 (duration: 00m 58s) [20:05:40] oh, that answers it :) [20:05:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:43] T204757: [OAuth] PHP Fatal Error: Call to undefined method MediaWiki\Extensions\OAuth\MWOAuthDAOAccessControl::getForHtml() - https://phabricator.wikimedia.org/T204757 [20:06:18] still running, bugfixing [20:07:59] ok [20:12:17] 10Operations, 10Maps, 10Maps-Sprint, 10decommission: Decommission maps-test cluster - https://phabricator.wikimedia.org/T202898 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for maps-test2001.codfw.wmnet and performed the following actions: - Revoked Puppet certificate - Removed from Pu... [20:12:28] 10Operations, 10Maps, 10Maps-Sprint, 10decommission: Decommission maps-test cluster - https://phabricator.wikimedia.org/T202898 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for maps-test2002.codfw.wmnet and performed the following actions: - Revoked Puppet certificate - Removed from Pu... [20:12:41] 10Operations, 10Maps, 10Maps-Sprint, 10decommission: Decommission maps-test cluster - https://phabricator.wikimedia.org/T202898 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for maps-test2003.codfw.wmnet and performed the following actions: - Revoked Puppet certificate - Removed from Pu... [20:12:47] (03PS1) 10Mathew.onipe: maps postgresql slow log settings [puppet] - 10https://gerrit.wikimedia.org/r/461206 (https://phabricator.wikimedia.org/T204106) [20:12:52] 10Operations, 10Maps, 10Maps-Sprint, 10decommission: Decommission maps-test cluster - https://phabricator.wikimedia.org/T202898 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for maps-test2004.codfw.wmnet and performed the following actions: - Revoked Puppet certificate - Removed from Pu... [20:19:07] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: rack/setup/install analytics-master100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T201939 (10elukey) ping :) [20:19:07] PROBLEM - Host elastic2004 is DOWN: PING CRITICAL - Packet loss = 100% [20:19:58] PROBLEM - Host elastic2006 is DOWN: PING CRITICAL - Packet loss = 100% [20:20:17] PROBLEM - Host elastic2005 is DOWN: PING CRITICAL - Packet loss = 100% [20:23:15] hu? ^ [20:24:04] 10Operations, 10Maps, 10Maps-Sprint, 10decommission: Decommission maps-test cluster - https://phabricator.wikimedia.org/T202898 (10RobH) a:05RobH>03Papaul All switch ports added to the disabled group. Once the disks are wiped and these are unracked, @papaul can delete the description off of each of th... [20:24:19] 10Operations, 10ops-codfw, 10Maps, 10Maps-Sprint, 10decommission: Decommission maps-test cluster - https://phabricator.wikimedia.org/T202898 (10RobH) [20:24:43] 10Operations, 10ops-codfw, 10Maps, 10Maps-Sprint, 10decommission: Decommission maps-test cluster - https://phabricator.wikimedia.org/T202898 (10RobH) [20:24:47] so trains looks fine [20:24:54] there are a few glitches but nothing worrying imho [20:25:53] 10Operations, 10DBA, 10Wikidata, 10Wikidata-Query-Service, and 4 others: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 (10Smalyshev) Looking at logstash: https://logstash.wikimedia.org/goto/39a6fe9edd787798129b66ae9d61ed90 there's definitely a... [20:26:00] hashar: In trying to do something I saw that trying to limit special:activeusers to one group fatals: https://www.mediawiki.org/wiki/Special:ActiveUsers [20:26:28] oh nice [20:26:31] https://www.mediawiki.org/wiki/Special:ActiveUsers?username=&groups%5B%5D=bot&wpFormIdentifier=specialactiveusers [20:26:36] select a group and search and you get: https://www.mediawiki.org/wiki/Special:ActiveUsers?username=&groups%5B%5D=sysop&wpFormIdentifier=specialactiveusers or https://test.wikipedia.org/wiki/Special:ActiveUsers?username=&groups%5B%5D=sysop&wpFormIdentifier=specialactiveusers with the full backtrace [20:26:52] yeah so [20:27:01] that is a new actor table [20:27:07] seems it is missing a few bits :\ [20:28:55] 10Operations, 10ops-codfw, 10Maps, 10Maps-Sprint, 10decommission: Decommission maps-test cluster - https://phabricator.wikimedia.org/T202898 (10RobH) a:05Papaul>03RobH [20:30:17] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [20:31:08] hashar: you or me for the bug filing? [20:34:25] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [20:35:25] greg-g: I have filled it [20:35:44] greg-g: in short Anomie is pushing a new change which is quite challenging to do right, and there is little hope beside testing it in prod [20:35:47] BUT [20:35:56] it is behind a feature flag, only enabled on group0 for now [20:37:14] oh, that's good at least [20:37:19] but dang :( [20:38:01] greg-g: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/460923/1/wmf-config/InitialiseSettings.php [20:38:02] :] [20:39:30] gotcha [20:39:44] and yeah, I saw the email for this earlier, I just fialed to link the two in my head until just now [20:43:48] Not even group0, just the three test wikis and mediawikiwiki. [20:46:45] 10Operations, 10Discovery-Search, 10Elasticsearch: elastic200[456] suddenly offlined - https://phabricator.wikimedia.org/T204772 (10RobH) p:05Triage>03High [20:49:14] 10Operations, 10Discovery-Search, 10Elasticsearch: elastic200[456] suddenly offlined - https://phabricator.wikimedia.org/T204772 (10RobH) The underlying OS is online for all three systems, accessible via mgmt/serial. [20:50:30] (03PS1) 10Anomie: Set ActorTableSchemaMigrationStage => OLD on test wikis, mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461209 (https://phabricator.wikimedia.org/T188327) [20:52:48] (03CR) 10Catrope: [C: 032] Set ActorTableSchemaMigrationStage => OLD on test wikis, mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461209 (https://phabricator.wikimedia.org/T188327) (owner: 10Anomie) [20:53:29] (03CR) 10Hashar: [C: 031] Set ActorTableSchemaMigrationStage => OLD on test wikis, mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461209 (https://phabricator.wikimedia.org/T188327) (owner: 10Anomie) [20:53:57] (03Merged) 10jenkins-bot: Set ActorTableSchemaMigrationStage => OLD on test wikis, mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461209 (https://phabricator.wikimedia.org/T188327) (owner: 10Anomie) [20:54:55] RECOVERY - Host elastic2004 is UP: PING OK - Packet loss = 0%, RTA = 36.25 ms [20:54:55] RECOVERY - Host elastic2006 is UP: PING OK - Packet loss = 0%, RTA = 36.10 ms [20:54:55] RECOVERY - Host elastic2005 is UP: PING OK - Packet loss = 0%, RTA = 36.09 ms [20:55:35] PROBLEM - Logstash rate of ingestion percent change compared to yesterday on einsteinium is CRITICAL: 131.6 ge 130 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [20:56:00] 10Operations, 10Discovery-Search, 10Elasticsearch: elastic200[456] suddenly offlined - https://phabricator.wikimedia.org/T204772 (10RobH) 05Open>03Resolved a:03RobH This was my fault due to a bad delete range within a vlan, had Arzhel rollback. [20:57:14] PROBLEM - Check systemd state on elastic2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:57:24] PROBLEM - puppet last run on elastic2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:57:43] !log catrope@deploy1001 sync-file aborted: Set ActorTableSchemaMigrationStage back to OLD on test wikis, mediawikiwiki (duration: 00m 10s) [20:57:45] PROBLEM - Check whether ferm is active by checking the default input chain on elastic2004 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [20:57:45] PROBLEM - puppet last run on elastic2005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:57:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:57] (03CR) 10jenkins-bot: Set ActorTableSchemaMigrationStage => OLD on test wikis, mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461209 (https://phabricator.wikimedia.org/T188327) (owner: 10Anomie) [20:58:52] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Set ActorTableSchemaMigrationStage back to OLD on test wikis, mediawikiwiki (T188327, T204669) (duration: 00m 57s) [20:59:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:02] T188327: Deploy refactored actor storage - https://phabricator.wikimedia.org/T188327 [20:59:03] T204669: Slow access to Special:Contributions on mediawiki.org (due to enabling actor table WRITE_BOTH mode) - https://phabricator.wikimedia.org/T204669 [21:02:54] RECOVERY - puppet last run on elastic2005 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [21:05:25] PROBLEM - HHVM rendering on mw1223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:06:25] RECOVERY - HHVM rendering on mw1223 is OK: HTTP OK: HTTP/1.1 200 OK - 75384 bytes in 0.165 second response time [21:12:42] 10Operations, 10Wikimedia-Logstash, 10User-fgiunchedi, 10User-herron: Logstash hardware expansion - https://phabricator.wikimedia.org/T203169 (10herron) >>! In T203169#4594186, @fgiunchedi wrote: > If for some reason we're under-provisioning (or over-utilizing) we have some knobs that are cheap to tune, na... [21:13:50] (03PS1) 10Ayounsi: Remove new IPs for cr3/4-ulsfo [dns] - 10https://gerrit.wikimedia.org/r/461215 [21:16:04] (03CR) 10Ayounsi: [C: 032] Remove new IPs for cr3/4-ulsfo [dns] - 10https://gerrit.wikimedia.org/r/461215 (owner: 10Ayounsi) [21:16:30] 10Operations, 10decommission: decom radium - https://phabricator.wikimedia.org/T203861 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for radon.wikimedia.org and performed the following actions: - Revoked Puppet certificate - Removed from PuppetDB - Downtimed host on Icinga - Downtimed mgmt... [21:18:00] is train stuff in a stable state now? [21:21:01] (03PS1) 10RobH: decom maps-test cluster in codfw [puppet] - 10https://gerrit.wikimedia.org/r/461221 (https://phabricator.wikimedia.org/T202898) [21:22:00] (03CR) 10RobH: [C: 032] decom maps-test cluster in codfw [puppet] - 10https://gerrit.wikimedia.org/r/461221 (https://phabricator.wikimedia.org/T202898) (owner: 10RobH) [21:22:45] RECOVERY - puppet last run on elastic2004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [21:23:54] (03PS1) 10RobH: decom maps-test cluser prod dns [dns] - 10https://gerrit.wikimedia.org/r/461222 (https://phabricator.wikimedia.org/T202898) [21:24:21] (03CR) 10RobH: [C: 032] decom maps-test cluser prod dns [dns] - 10https://gerrit.wikimedia.org/r/461222 (https://phabricator.wikimedia.org/T202898) (owner: 10RobH) [21:25:11] !log authdns1001: testing gdnsd version update (2.99.9-beta) [21:25:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:46] (03PS1) 10BBlack: dns: no-op test change [dns] - 10https://gerrit.wikimedia.org/r/461223 [21:31:58] (03CR) 10BBlack: [C: 032] dns: no-op test change [dns] - 10https://gerrit.wikimedia.org/r/461223 (owner: 10BBlack) [21:33:20] (03PS1) 10BBlack: Revert "dns: no-op test change" [dns] - 10https://gerrit.wikimedia.org/r/461225 [21:33:25] 10Operations, 10decommission: decom radium - https://phabricator.wikimedia.org/T203861 (10RobH) radon network port asw2-c-eqiad:ge-4/0/25 [21:33:41] (03CR) 10BBlack: [C: 032] Revert "dns: no-op test change" [dns] - 10https://gerrit.wikimedia.org/r/461225 (owner: 10BBlack) [21:35:08] (03PS1) 10RobH: decom radon [puppet] - 10https://gerrit.wikimedia.org/r/461226 (https://phabricator.wikimedia.org/T203861) [21:36:02] (03CR) 10RobH: [C: 032] decom radon [puppet] - 10https://gerrit.wikimedia.org/r/461226 (https://phabricator.wikimedia.org/T203861) (owner: 10RobH) [21:38:31] 10Operations, 10ops-eqiad, 10decommission: decom radium - https://phabricator.wikimedia.org/T203861 (10RobH) a:03Cmjohnson [21:38:37] 10Operations, 10ops-eqiad, 10decommission: decom radium - https://phabricator.wikimedia.org/T203861 (10Dzahn) @RobH This ticket is about radium but there is also a decom ticket for radon at the same time. I think they got mixed up above. Just wanted to let you know. [21:40:18] robh: radon vs radium .. but both are to be decomed [21:40:27] they had different roles though [21:40:29] oh goddamn it [21:40:33] am i loggin to wrong task? [21:40:37] yup [21:40:49] well shit. [21:40:56] let me do the other one now so its all done at leat [21:41:02] heh, cool [21:41:22] (03PS1) 10Ayounsi: cr1/2-ulsfo -> cr3/4-ulsfo renaming [dns] - 10https://gerrit.wikimedia.org/r/461228 (https://phabricator.wikimedia.org/T189552) [21:41:42] 10Operations, 10ops-eqiad, 10decommission: decom radium - https://phabricator.wikimedia.org/T203861 (10RobH) a:05Cmjohnson>03RobH Please note I did indeed swap references around, all the entries for radon should have gone to T202040 so stealing this back for its radium decom. [21:42:00] 10Operations, 10ops-eqiad, 10decommission: decom radium - https://phabricator.wikimedia.org/T203861 (10RobH) [21:42:08] i read some of the backlog about decom workflow.. incl the part where we should avoid having them with role(spare) but not reinstalled [21:42:17] made me want to reinstall that one (radium).. [21:42:26] but was just following old (current) workflow [21:42:41] if you decom it now i won't worry though [21:42:41] 10Operations, 10ops-eqiad, 10Traffic, 10decommission, 10Patch-For-Review: Decommission radon - https://phabricator.wikimedia.org/T202040 (10RobH) a:03Cmjohnson Please note I did all the decom steps on the wrong task, putting them on radium task not radon. This is ready for disk wipe. >>! In T203861#4... [21:42:47] !log authdns1001 seems stable/fine running gdnsd-2.99.9-beta so far. If issues crop up later, don't hesitate to (a) downgrade back to stretch-backports gdnsd-2.3.0-1~bpo9+1 or (b) call me! [21:42:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:02] 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission radon - https://phabricator.wikimedia.org/T202040 (10RobH) [21:43:04] mutante: good catch [21:43:14] (03PS1) 10Ppchelko: RPC/RunSingleJob.php - send X-Readonly header. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461229 (https://phabricator.wikimedia.org/T204154) [21:43:57] bblack will gdnsd 3.0 eventually head to stretch-backports or will it be in sid / next release? :) [21:44:23] (03CR) 10jerkins-bot: [V: 04-1] RPC/RunSingleJob.php - send X-Readonly header. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461229 (https://phabricator.wikimedia.org/T204154) (owner: 10Ppchelko) [21:45:20] (03PS2) 10Ppchelko: RPC/RunSingleJob.php - send X-Readonly header. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461229 (https://phabricator.wikimedia.org/T204154) [21:45:24] 10Operations, 10ops-eqiad, 10decommission: decom radium - https://phabricator.wikimedia.org/T203861 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for radium.wikimedia.org and performed the following actions: - Revoked Puppet certificate - Removed from PuppetDB - Downtimed host on Icinga -... [21:46:10] paladox: I really don't know, but generally new major versions don't automagically make it to a backports [21:46:23] ok. [21:48:08] robh: wow @ new bot edits. https://phabricator.wikimedia.org/T203861#4595958 [21:54:22] 10Operations, 10ops-eqiad, 10decommission: decom radium - https://phabricator.wikimedia.org/T203861 (10RobH) radium network port is asw-a-eqiad:ge-3/0/0 [21:54:46] 10Operations, 10ops-eqiad, 10decommission: decom radium - https://phabricator.wikimedia.org/T203861 (10RobH) [21:54:59] mutante: yeah so far what the script does is neat [21:55:31] already chatted with volans about adding the puppet disable and system shutdown, then all thats left in non-interrupt steps that isnt scripted is the switch port disable and repo removals (ad honesly the repo removals can be interrupted) [21:56:03] (03PS1) 10Ayounsi: Puppet, rename all instances of cr1/2-ulsfo to cr3/4 [puppet] - 10https://gerrit.wikimedia.org/r/461233 (https://phabricator.wikimedia.org/T189552) [21:56:44] (03PS1) 10RobH: decom radium prod dns [dns] - 10https://gerrit.wikimedia.org/r/461234 (https://phabricator.wikimedia.org/T203861) [21:57:52] (03PS1) 10RobH: decom radium puppet repo entries [puppet] - 10https://gerrit.wikimedia.org/r/461235 (https://phabricator.wikimedia.org/T203861) [21:58:04] (03CR) 10RobH: [C: 032] decom radium prod dns [dns] - 10https://gerrit.wikimedia.org/r/461234 (https://phabricator.wikimedia.org/T203861) (owner: 10RobH) [21:58:43] (03CR) 10RobH: [C: 032] decom radium puppet repo entries [puppet] - 10https://gerrit.wikimedia.org/r/461235 (https://phabricator.wikimedia.org/T203861) (owner: 10RobH) [21:59:43] well, that was fun [21:59:44] 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: decom radium - https://phabricator.wikimedia.org/T203861 (10RobH) a:05RobH>03Cmjohnson Ok, this is now all set, radium is ready for onsite steps for decom. [21:59:46] radon and radium [21:59:53] thats not annoying or confusing. [22:06:02] PROBLEM - Host cr2-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [22:06:50] did cr2 died? [22:06:51] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 124, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:06:59] I can't ssh to it neither [22:07:02] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 56, down: 3, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:07:22] PROBLEM - Host cr2-ulsfo IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [22:08:02] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 95 probes of 317 (alerts on 25) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [22:08:12] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is CRITICAL: cluster={cache_text,cache_upload} site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [22:08:31] PROBLEM - HTTP availability for Varnish at ulsfo on einsteinium is CRITICAL: job=varnish-upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [22:08:54] should I depool ulsfo? [22:09:09] wtf, I think so, yea [22:09:51] (03PS1) 10BBlack: Depool ulsfo [dns] - 10https://gerrit.wikimedia.org/r/461239 [22:10:02] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [22:10:06] (03CR) 10BBlack: [C: 032] Depool ulsfo [dns] - 10https://gerrit.wikimedia.org/r/461239 (owner: 10BBlack) [22:10:21] PROBLEM - Host re0.cr2-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [22:10:25] !log depooled ulsfo, some uknown router issue [22:10:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:43] (03PS1) 10Gergő Tisza: Allow wikitech bureaucrats to promote to interface-admin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461240 [22:11:15] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [22:11:22] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 75 probes of 342 (alerts on 25) - https://atlas.ripe.net/measurements/1791307/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [22:11:22] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [22:11:41] RECOVERY - HTTP availability for Varnish at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [22:13:02] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 20 probes of 317 (alerts on 25) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [22:14:12] (03PS2) 10Gergő Tisza: Allow wikitech bureaucrats to promote to interface-admin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461240 [22:16:12] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [22:16:15] 10Operations, 10decommission: decom silver (was silver has trouble rebooting) - https://phabricator.wikimedia.org/T168559 (10RobH) 05Open>03Resolved duplicate of T168559 [22:16:22] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 3 probes of 342 (alerts on 25) - https://atlas.ripe.net/measurements/1791307/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [22:16:31] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decom silver/WMF3434 - https://phabricator.wikimedia.org/T191357 (10RobH) a:03Cmjohnson [22:17:12] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [22:18:32] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is CRITICAL: 56.12 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [22:24:59] 10Operations, 10ops-ulsfo, 10netops: cr2-ulsfo crash - https://phabricator.wikimedia.org/T204782 (10ayounsi) p:05Triage>03High [22:25:01] PROBLEM - PyBal BGP sessions are established on lvs4007 is CRITICAL: 0 le 0 https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=ulsfo%2520prometheus%252Fops [22:27:23] ACKNOWLEDGEMENT - Host cr2-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% Ayounsi https://phabricator.wikimedia.org/T204782 [22:27:23] ACKNOWLEDGEMENT - Host cr2-ulsfo IPv6 is DOWN: PING CRITICAL - Packet loss = 100% Ayounsi https://phabricator.wikimedia.org/T204782 [22:27:23] ACKNOWLEDGEMENT - Host re0.cr2-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% Ayounsi https://phabricator.wikimedia.org/T204782 [22:28:43] ACKNOWLEDGEMENT - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 124, down: 1, dormant: 0, excluded: 0, unused: 0: Ayounsi https://phabricator.wikimedia.org/T204782 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:28:43] ACKNOWLEDGEMENT - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 56, down: 3, dormant: 0, excluded: 0, unused: 0: Ayounsi https://phabricator.wikimedia.org/T204782 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:28:43] ACKNOWLEDGEMENT - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is CRITICAL: 11.36 le 60 Ayounsi https://phabricator.wikimedia.org/T204782 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [22:28:43] ACKNOWLEDGEMENT - PyBal BGP sessions are established on lvs4007 is CRITICAL: 0 le 0 Ayounsi https://phabricator.wikimedia.org/T204782 https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=ulsfo%2520prometheus%252Fops [22:35:57] 10Operations, 10Growth-Team, 10Mail, 10Notifications, 10User-herron: SRE query: Is it possible to measure how many e-mails are sent to "black hole" e-mail addresses? - https://phabricator.wikimedia.org/T202329 (10MMiller_WMF) a:05herron>03nettrom_WMF Assigning this to @nettrom_WMF so he can drive it... [22:36:10] 10Operations, 10Growth-Team, 10Mail, 10Notifications, and 2 others: SRE query: Is it possible to measure how many e-mails are sent to "black hole" e-mail addresses? - https://phabricator.wikimedia.org/T202329 (10MMiller_WMF) [22:36:17] 10Operations, 10Cloud-Services, 10Wikibase-Quality, 10Wikibase-Quality-Constraints, and 3 others: Flood of WDQS requests from wbqc - https://phabricator.wikimedia.org/T204267 (10Smalyshev) a:05Smalyshev>03None > Is Retry-after always provided? Yes, as far as I know WDQS will always do it on throttling... [22:37:18] 10Operations, 10Cloud-Services, 10Wikibase-Quality, 10Wikibase-Quality-Constraints, and 3 others: Flood of WDQS requests from wbqc - https://phabricator.wikimedia.org/T204267 (10Smalyshev) p:05High>03Normal As the immediate problem ceased, resetting to Normal priority. [22:45:13] (03PS1) 10Ayounsi: Smokeping: comment out cr2-ulsfo while down [puppet] - 10https://gerrit.wikimedia.org/r/461244 (https://phabricator.wikimedia.org/T204782) [22:46:14] (03CR) 10Ayounsi: [C: 032] Smokeping: comment out cr2-ulsfo while down [puppet] - 10https://gerrit.wikimedia.org/r/461244 (https://phabricator.wikimedia.org/T204782) (owner: 10Ayounsi) [22:51:23] 10Operations, 10Gadgets, 10MediaWiki-Cache, 10Performance-Team, and 2 others: Mcrouter periodically reports soft TKOs for mc[1,2]035 leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) >>! In T203786#4590064, @Krinkle wrote: > I can see that the errors in the logs are... [22:58:48] 10Operations, 10Electron-PDFs, 10Proton, 10Patch-For-Review, and 3 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 (10Pchelolo) After sending 25% of traffic to Proton, there's not much to gather - a couple of crashers, which is ok for a new service, but nothi... [23:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180918T2300). [23:00:04] ebernhardson and Krinkle: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:02:05] I can deploy [23:02:41] ebernhardson, Krinkle, around? [23:02:48] twentyafterfour: yup, just finishing a meeting [23:03:18] ok I'll get things prepped and wait for you to be finished [23:03:41] (03CR) 1020after4: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460430 (https://phabricator.wikimedia.org/T204135) (owner: 10EBernhardson) [23:04:07] (03PS2) 1020after4: Add CirrusSearch cluster name to siteinfo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460430 (https://phabricator.wikimedia.org/T204135) (owner: 10EBernhardson) [23:05:18] twentyafterfour: ok i'm here [23:07:47] ebernhardson: should these be done 1 by 1 or deployed together all at once? [23:08:22] twentyafterfour: they can all go together. Two of the three don't even really do anything until we run maint scripts later [23:08:49] (03CR) 1020after4: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461088 (https://phabricator.wikimedia.org/T191961) (owner: 10DCausse) [23:08:53] ok cool [23:09:35] (03CR) 1020after4: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454588 (owner: 10DCausse) [23:09:42] RECOVERY - Host re0.cr2-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 77.33 ms [23:10:16] (03CR) 10Dzahn: [C: 032] mwmaint1002: add prod DNS entries (v4) [dns] - 10https://gerrit.wikimedia.org/r/461202 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn) [23:10:20] (03PS2) 10Dzahn: mwmaint1002: add prod DNS entries (v4) [dns] - 10https://gerrit.wikimedia.org/r/461202 (https://phabricator.wikimedia.org/T201343) [23:10:23] (03Merged) 10jenkins-bot: [cirrus] cleanup drop wgCirrusSearchInterwikiCacheTime [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461088 (https://phabricator.wikimedia.org/T191961) (owner: 10DCausse) [23:10:40] 10Operations, 10Gadgets, 10MediaWiki-Cache, 10Performance-Team, and 2 others: Mcrouter periodically reports soft TKOs for mc[1,2]035 leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) So to summarize: mcrouter seems to get 3 consecutive timeouts (after 1s each) for a... [23:10:52] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 126, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:11:02] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 65, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:11:11] RECOVERY - Host cr2-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 74.76 ms [23:11:18] (03CR) 10jenkins-bot: Add CirrusSearch cluster name to siteinfo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460430 (https://phabricator.wikimedia.org/T204135) (owner: 10EBernhardson) [23:11:20] (03CR) 10jenkins-bot: [cirrus] cleanup drop wgCirrusSearchInterwikiCacheTime [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461088 (https://phabricator.wikimedia.org/T191961) (owner: 10DCausse) [23:11:48] (03PS3) 1020after4: [cirrus] Increase number of shards for wikidata content and commons file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454588 (owner: 10DCausse) [23:11:53] (03PS2) 10Dzahn: DHCP: add mwmaint1002 MAC address [puppet] - 10https://gerrit.wikimedia.org/r/461204 (https://phabricator.wikimedia.org/T201343) [23:12:12] RECOVERY - Host cr2-ulsfo IPv6 is UP: PING OK - Packet loss = 0%, RTA = 75.55 ms [23:12:40] (03CR) 10Dzahn: [C: 032] DHCP: add mwmaint1002 MAC address [puppet] - 10https://gerrit.wikimedia.org/r/461204 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn) [23:12:41] RECOVERY - PyBal BGP sessions are established on lvs4007 is OK: (C)0 le (W)0 le 1 https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=ulsfo%2520prometheus%252Fops [23:12:44] (03CR) 1020after4: [V: 032 C: 032] [cirrus] Increase number of shards for wikidata content and commons file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454588 (owner: 10DCausse) [23:13:52] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [23:14:07] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install mwmaint1002.eqiad.wmnet - https://phabricator.wikimedia.org/T201343 (10Dzahn) [23:14:23] should I be concerned with those icinga alerts? [23:14:32] no, you should not. ulsfo is depooled [23:14:34] ACKing [23:14:45] as long as it's "ulsfo-only" that is [23:14:49] wait, cr2-ulsfo came back?! [23:15:16] robh: ^ [23:15:18] eh, it looks like it. i just referred to the new alert for Nginx [23:15:22] !log SWAT: deploying 3 patches for ebernhardson: 77deae12f 7009fe473 and 12cc420fc [23:15:23] but did nothing [23:15:24] 11:15PM up 9 mins, 1 user, load averages: 2.47, 1.87, 1.00 [23:15:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:02] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 284 probes of 317 (alerts on 25) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [23:16:11] .... odd [23:16:28] XioNoX: glad you didnt start pedaling? [23:16:29] it also doesnt like "Servers dns4002.wikimedia.org are marked down but pooled:" [23:17:02] ebernhardson: is this even anything you can test on mwdebug? [23:17:12] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [23:17:13] siteinfo yes [23:17:21] I did scap pull on mwdebug1001 if you want to test [23:17:26] ACKNOWLEDGEMENT - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 284 probes of 317 (alerts on 25) - https://atlas.ripe.net/measurements/1791309/#!map daniel_zahn ulsfo is depooled currently https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [23:17:38] "wmgCirrusSearchDefaultCluster": "local" [23:17:42] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [23:18:07] hmm? [23:18:22] I haven't deployed anything beyond mwdebug1001 so far... not sure if that's related at all? [23:18:22] LGTM on mwdebug1001 for gerrit/460430 [23:18:29] volans: thanks [23:18:33] twentyafterfour: yup looks good [23:18:35] not sure what's up with the memcached error [23:18:38] volans: go to bed :P [23:18:48] fwiw I did a: [23:18:48] curl -s -H 'X-Forwarded-Proto: https' -H 'Host: en.wikipedia.org' "http://mwdebug1001.eqiad.wmnet/w/api.php?action=query&meta=siteinfo&format=json&formatversion=2" | jq -r '.query.general["wmf-config"]' [23:19:22] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 306 probes of 342 (alerts on 25) - https://atlas.ripe.net/measurements/1791307/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [23:19:30] ebernhardson: yeah, almost there, waiting on an upgrade :D [23:19:43] ACKNOWLEDGEMENT - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 306 probes of 342 (alerts on 25) - https://atlas.ripe.net/measurements/1791307/#!map daniel_zahn ulsfo is depooled https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [23:19:52] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [23:20:51] syncing [23:21:14] !log twentyafterfour@deploy1001 Synchronized wmf-config/: SWAT: tested on mwdebug1001, now syncing wmf-config settings to the whole cluster (duration: 00m 59s) [23:21:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:27] 10Operations, 10Datacenter-Switchover-2018, 10Discovery-Search (Current work), 10Patch-For-Review: Warn when CirrusSearch is not configured to use local DCfor an extended time - https://phabricator.wikimedia.org/T204135 (10Volans) Great! Thanks for adding that :) ``` $ curl -s -H 'X-Forwarded-Proto: https'... [23:25:39] (03PS1) 10Krinkle: tests: Remove redundant $description from all-in-one-exactly test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461250 [23:26:08] (03CR) 10jenkins-bot: [cirrus] Increase number of shards for wikidata content and commons file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454588 (owner: 10DCausse) [23:31:12] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 20 probes of 317 (alerts on 25) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [23:33:07] 10Operations, 10ops-ulsfo, 10netops, 10Patch-For-Review: cr2-ulsfo crash - https://phabricator.wikimedia.org/T204782 (10ayounsi) [23:33:22] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is OK: (C)60 le (W)70 le 75.35 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [23:34:32] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 0 probes of 342 (alerts on 25) - https://atlas.ripe.net/measurements/1791307/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [23:43:11] (03PS2) 10Krinkle: tests: Remove redundant $description from all-in-one-exactly test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461250 [23:43:14] (03CR) 10Krinkle: [C: 032] tests: Remove redundant $description from all-in-one-exactly test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461250 (owner: 10Krinkle) [23:43:51] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [23:44:31] (03Merged) 10jenkins-bot: tests: Remove redundant $description from all-in-one-exactly test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461250 (owner: 10Krinkle) [23:55:16] (03CR) 10jenkins-bot: tests: Remove redundant $description from all-in-one-exactly test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461250 (owner: 10Krinkle)