[00:00:04] twentyafterfour: How many deployers does it take to do Phabricator update deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180524T0000). [00:00:37] !log ebernhardson@tin Synchronized php-1.32.0-wmf.4/extensions/VisualEditor/modules/ve-mw/ui/dialogs/ve.ui.MWSaveDialog.js: SWAT: T195323: MWSaveDialog: Fix typo in no-categories branch (duration: 01m 07s) [00:00:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:00:41] T195323: Visual Editor changes dialog hangs when clicking on preview on pages without categories - https://phabricator.wikimedia.org/T195323 [00:00:47] James_F: all your stuff should be synced to prod now [00:06:24] ebernhardson: Thank you so much! [00:07:24] !log ebernhardson@tin Synchronized php-1.32.0-wmf.4/extensions/CirrusSearch/: SWAT: Convert cirrus metastore to single type (duration: 01m 24s) [00:07:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:07:47] (03PS1) 10Papaul: DNS: Add production and mgmt DNS entries for db209[45] [dns] - 10https://gerrit.wikimedia.org/r/434830 (https://phabricator.wikimedia.org/T194781) [00:09:06] !log ebernhardson@tin Synchronized php-1.32.0-wmf.5/extensions/CirrusSearch/: SWAT: Convert cirrus metastore to single type (duration: 01m 24s) [00:09:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:10:44] !log upgrade cirrussearch metastore to 1.0 on eqiad and codfw [00:10:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:58] 10Operations, 10Move-Files-To-Commons, 10TCB-Team, 10Wikimedia-Extension-setup, 10Wikimedia-extension-review-queue: Deploy FileExporter and FileImporter to group0 - https://phabricator.wikimedia.org/T195370#4227346 (10Jdforrester-WMF) Sounds good. Consider this sign-off. [00:14:07] i know there is some log spam coming out from the metastore switch, fixing momentarily. It's only happining from maintenance jobs we run to check index correctness not user facing [00:14:29] !log ebernhardson@tin Synchronized php-1.32.0-wmf.5/extensions/CirrusSearch/includes/Job/CheckerJob.php: Drop cirrus checker job metastore transition check (duration: 01m 08s) [00:14:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:56] !log ebernhardson@tin Synchronized php-1.32.0-wmf.4/extensions/CirrusSearch/includes/Job/CheckerJob.php: Drop cirrus checker job metastore transition check (duration: 01m 08s) [00:16:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:59] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [00:23:09] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [00:28:09] ^ that alert might be a bit too slow ... the exceptions would have started 10 minutes prior to the alert (and it alerted after the fix had been deployed and was not generating new errors) [01:00:36] ebernhardson: Is the deployment finished or still checking/finishing? [01:00:52] Wanted to roll out a patch sometime after you're done. [01:06:35] * Krinkle reads backscroll [01:06:52] I'll do it tomorrow. Getting late over here. [01:06:57] night o/ [01:09:40] (03PS2) 10Chelsyx: Blacklisting new iOS eventlogging schemas on MySQL [puppet] - 10https://gerrit.wikimedia.org/r/434424 (https://phabricator.wikimedia.org/T192819) [01:10:28] (03CR) 10Chelsyx: "> Jenkins doesn't like your commit message (lines too long!) but" [puppet] - 10https://gerrit.wikimedia.org/r/434424 (https://phabricator.wikimedia.org/T192819) (owner: 10Chelsyx) [03:01:12] !log l10nupdate@tin scap sync-l10n completed (1.32.0-wmf.4) (duration: 12m 48s) [03:01:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:06:41] !log added wmde-fisch to LDAP group nda (T195223) [03:06:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:06:49] T195223: Add Christoph Jauera (WMDE-Fisch) to the ldap/nda group - https://phabricator.wikimedia.org/T195223 [03:57:19] !log l10nupdate@tin scap sync-l10n completed (1.32.0-wmf.5) (duration: 13m 29s) [03:57:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:04:35] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu May 24 04:04:35 UTC 2018 (duration 7m 16s) [04:04:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:37:51] (03PS5) 10ArielGlenn: Monitor dump output file production [puppet] - 10https://gerrit.wikimedia.org/r/434709 [04:38:57] (03CR) 10ArielGlenn: [C: 032] Monitor dump output file production [puppet] - 10https://gerrit.wikimedia.org/r/434709 (owner: 10ArielGlenn) [04:49:57] 10Operations, 10Code-Stewardship-Reviews, 10Services (watching): zotero translation server: code stewardship request - https://phabricator.wikimedia.org/T187194#4227503 (10danstillman) Just a quick update from our end. We're now using translation-server for ZoteroBib (https://zbib.org), a new web-based serv... [05:18:04] (03PS1) 10Marostegui: db-eqiad.php: Repool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434860 (https://phabricator.wikimedia.org/T194273) [05:19:47] (03PS2) 10Marostegui: db-eqiad.php: Repool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434860 (https://phabricator.wikimedia.org/T194273) [05:21:08] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434860 (https://phabricator.wikimedia.org/T194273) (owner: 10Marostegui) [05:22:38] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434860 (https://phabricator.wikimedia.org/T194273) (owner: 10Marostegui) [05:22:54] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434860 (https://phabricator.wikimedia.org/T194273) (owner: 10Marostegui) [05:24:54] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1092 (duration: 01m 09s) [05:24:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:30:34] (03PS2) 10Marostegui: DNS: Add production and mgmt DNS entries for db209[45] [dns] - 10https://gerrit.wikimedia.org/r/434830 (https://phabricator.wikimedia.org/T194781) (owner: 10Papaul) [05:33:52] (03CR) 10Marostegui: [C: 032] DNS: Add production and mgmt DNS entries for db209[45] [dns] - 10https://gerrit.wikimedia.org/r/434830 (https://phabricator.wikimedia.org/T194781) (owner: 10Papaul) [05:50:06] (03PS1) 10Marostegui: mariadb: Add the new sanitarium hosts to the config [puppet] - 10https://gerrit.wikimedia.org/r/434863 (https://phabricator.wikimedia.org/T194780) [05:51:49] (03PS2) 10Marostegui: mariadb: Add the new sanitarium hosts to the config [puppet] - 10https://gerrit.wikimedia.org/r/434863 (https://phabricator.wikimedia.org/T194780) [06:12:13] (03PS1) 10Marostegui: db-eqiad.php: Depool db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434868 (https://phabricator.wikimedia.org/T190148) [06:14:41] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434868 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [06:16:08] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434868 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [06:17:35] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1087 for alter table (duration: 01m 09s) [06:17:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:39] !log Deploy schema change on db1087, this will generate lag on labs on s8 - T191519 T188299 T190148 T194270 [06:17:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:46] T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519 [06:17:46] T194270: Drop 'tmp1' index from wb_terms table in production - https://phabricator.wikimedia.org/T194270 [06:17:46] T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148 [06:17:46] T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299 [06:18:39] 10Operations, 10Phabricator, 10Release-Engineering-Team (Watching / External): Reimage both phab1001 and phab2001 to stretch - https://phabricator.wikimedia.org/T190568#4227557 (10elukey) @Dzahn it would be really great if T182832 was resolved as soon as possible, it has been "stable" so far but having one o... [06:19:54] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434868 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [06:20:20] 10Operations, 10Phabricator, 10Release-Engineering-Team (Watching / External): Reimage both phab1001 and phab2001 to stretch - https://phabricator.wikimedia.org/T190568#4227558 (10mmodell) @elukey if we have a php 7.2 package ready to use then I'm all for it. I don't think there is much in the way of moving... [06:21:50] ebernhardson: it's definitely deployed [06:23:19] ebernhardson: looks like you got it solved already :) [06:24:38] !log installing remaining curl security updates in eqiad [06:24:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:59] PROBLEM - puppet last run on mw1323 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/mysql-ps1.sh] [06:29:29] PROBLEM - puppet last run on analytics1028 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/R/update-library.R] [06:29:46] 10Operations, 10Phabricator, 10Release-Engineering-Team (Watching / External): Reimage both phab1001 and phab2001 to stretch - https://phabricator.wikimedia.org/T190568#4227560 (10elukey) @mmodell nice! As far as I can see from https://gerrit.wikimedia.org/r/#/c/410245 I think that everything needs to wait u... [06:30:00] PROBLEM - puppet last run on mw1278 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/apache2/sites-available/00-nonexistent.conf] [06:31:00] PROBLEM - puppet last run on labvirt1017 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-puppet-agent-stats] [06:32:19] PROBLEM - puppet last run on mw1319 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ImageMagick-6/policy.xml] [06:34:23] 10Operations, 10Phabricator, 10Release-Engineering-Team (Watching / External): Reimage both phab1001 and phab2001 to stretch - https://phabricator.wikimedia.org/T190568#4077108 (10MoritzMuehlenhoff) PHP 7.2 packages for stretch are available since early March via thirdparty/php72, let me know if anything is... [06:34:26] (03PS1) 10Elukey: Set druid* PXE boot to Debian Stretch [puppet] - 10https://gerrit.wikimedia.org/r/434870 (https://phabricator.wikimedia.org/T192636) [06:34:34] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1087" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434871 [06:35:56] (03CR) 10Elukey: [C: 032] Set druid* PXE boot to Debian Stretch [puppet] - 10https://gerrit.wikimedia.org/r/434870 (https://phabricator.wikimedia.org/T192636) (owner: 10Elukey) [06:36:46] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1087" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434871 (owner: 10Marostegui) [06:38:13] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1087" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434871 (owner: 10Marostegui) [06:38:59] PROBLEM - MariaDB Slave Lag: s8 on db2092 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 670.01 seconds [06:39:39] ^ that is the alter table [06:39:45] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1087 after alter table (duration: 01m 08s) [06:39:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:27] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1087" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434871 (owner: 10Marostegui) [06:41:31] (03PS1) 10Marostegui: db-eqiad.php: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434872 (https://phabricator.wikimedia.org/T190148) [06:41:40] PROBLEM - Disk space on elastic1019 is CRITICAL: DISK CRITICAL - free space: /srv 57545 MB (12% inode=99%) [06:42:10] (03PS2) 10Jcrespo: mariadb: Set up db1117:3325 as the backup host for m5 database section [puppet] - 10https://gerrit.wikimedia.org/r/434740 (https://phabricator.wikimedia.org/T192979) [06:43:17] (03CR) 10Jcrespo: [C: 032] mariadb-hosts: Add db1117 instances to m1,m2,m3 and m5 [software] - 10https://gerrit.wikimedia.org/r/434675 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo) [06:44:03] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434872 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [06:45:26] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434872 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [06:46:44] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434872 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [06:46:58] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1104 for alter table (duration: 01m 08s) [06:47:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:02] !log Deploy schema change on db1104 - T191519 T188299 T190148 T194270 [06:47:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:08] T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519 [06:47:08] T194270: Drop 'tmp1' index from wb_terms table in production - https://phabricator.wikimedia.org/T194270 [06:47:09] T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148 [06:47:09] T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299 [06:50:10] RECOVERY - Disk space on elastic1019 is OK: DISK OK [06:55:20] RECOVERY - puppet last run on mw1278 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [06:56:29] RECOVERY - puppet last run on labvirt1017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:39] RECOVERY - puppet last run on mw1319 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:59:20] RECOVERY - puppet last run on mw1323 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [06:59:50] RECOVERY - puppet last run on analytics1028 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:01:29] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=PATCH https://grafana.wikimedia.org/dashboard/db/kubernetes-api [07:02:30] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [07:02:40] RECOVERY - MariaDB Slave Lag: s8 on db2092 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [07:02:52] (03PS1) 10Muehlenhoff: Add library hint for procps [puppet] - 10https://gerrit.wikimedia.org/r/434873 [07:05:11] (03CR) 10Jcrespo: [C: 04-1] "This is not what we agreed or prepared, not what we purchased for (only 265GB), why make things more complicated?" [puppet] - 10https://gerrit.wikimedia.org/r/434863 (https://phabricator.wikimedia.org/T194780) (owner: 10Marostegui) [07:06:55] (03CR) 10Jcrespo: [C: 04-1] mariadb: Add the new sanitarium hosts to the config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/434863 (https://phabricator.wikimedia.org/T194780) (owner: 10Marostegui) [07:08:29] (03PS2) 10Muehlenhoff: Add library hint for procps [puppet] - 10https://gerrit.wikimedia.org/r/434873 [07:10:05] (03CR) 10Marostegui: "Ups! Yeah, I had my mind somewhere else, I will ammend" [puppet] - 10https://gerrit.wikimedia.org/r/434863 (https://phabricator.wikimedia.org/T194780) (owner: 10Marostegui) [07:11:41] (03CR) 10Muehlenhoff: [C: 032] Add library hint for procps [puppet] - 10https://gerrit.wikimedia.org/r/434873 (owner: 10Muehlenhoff) [07:17:37] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1104" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434874 [07:17:43] !log installing procps security updates [07:17:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:40] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1104" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434874 (owner: 10Marostegui) [07:22:58] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1104" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434874 (owner: 10Marostegui) [07:23:14] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1104" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434874 (owner: 10Marostegui) [07:27:18] (03PS3) 10Marostegui: mariadb: Add the new sanitarium hosts to the config [puppet] - 10https://gerrit.wikimedia.org/r/434863 (https://phabricator.wikimedia.org/T194780) [07:29:28] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1104 after alter table (duration: 01m 29s) [07:29:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:56] (03CR) 10Jcrespo: mariadb: Add the new sanitarium hosts to the config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/434863 (https://phabricator.wikimedia.org/T194780) (owner: 10Marostegui) [07:30:13] (03PS1) 10Marostegui: db-eqiad.php: Depool db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434875 (https://phabricator.wikimedia.org/T190148) [07:33:10] (03CR) 10Marostegui: ">" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/434863 (https://phabricator.wikimedia.org/T194780) (owner: 10Marostegui) [07:33:46] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434875 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [07:34:56] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434875 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [07:36:21] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1109 for alter table (duration: 01m 08s) [07:36:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:26] !log Deploy schema change on db1109 - T191519 T188299 T190148 T194270 [07:36:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:33] T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519 [07:36:33] T194270: Drop 'tmp1' index from wb_terms table in production - https://phabricator.wikimedia.org/T194270 [07:36:33] T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148 [07:36:33] T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299 [07:39:37] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434875 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [07:40:43] 10Operations, 10Phabricator, 10Release-Engineering-Team (Watching / External): Reimage both phab1001 and phab2001 to stretch - https://phabricator.wikimedia.org/T190568#4227638 (10Paladox) I’ve been running it on https://phab-stretch.wmflabs.org/ with performance improvements noticeable :) The puppet class... [07:48:46] !log stop db2042 to clone it to db2078 and upgrade [07:48:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:01] (03PS4) 10Marostegui: mariadb: Add the new sanitarium hosts to the config [puppet] - 10https://gerrit.wikimedia.org/r/434863 (https://phabricator.wikimedia.org/T194780) [07:53:50] PROBLEM - DPKG on mx2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [07:54:01] 10Operations, 10Move-Files-To-Commons, 10TCB-Team, 10Wikimedia-Extension-setup, 10Wikimedia-extension-review-queue: Deploy FileExporter and FileImporter to group0 - https://phabricator.wikimedia.org/T195370#4227654 (10Lea_WMDE) [07:56:31] ACKNOWLEDGEMENT - MegaRAID on db1065 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T195444 [07:56:37] 10Operations, 10ops-eqiad: Degraded RAID on db1065 - https://phabricator.wikimedia.org/T195444#4227657 (10ops-monitoring-bot) [07:57:30] PROBLEM - DPKG on mwdebug1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [07:58:51] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1065 - https://phabricator.wikimedia.org/T195444#4227660 (10Marostegui) a:03Cmjohnson @Cmjohnson let's get this disk replaced Thanks! [07:59:53] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1109" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434878 [08:02:16] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1109" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434878 (owner: 10Marostegui) [08:03:34] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1109" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434878 (owner: 10Marostegui) [08:04:00] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1109" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434878 (owner: 10Marostegui) [08:05:09] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1109 after alter table (duration: 01m 09s) [08:05:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:15] !log Deploy schema change on s8 primary master (db1071) - T191519 T188299 T190148 T194270 [08:05:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:21] T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519 [08:05:22] T194270: Drop 'tmp1' index from wb_terms table in production - https://phabricator.wikimedia.org/T194270 [08:05:22] T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148 [08:05:22] T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299 [08:06:18] (03PS1) 10Gilles: Backend-Timing Varnish mtail program [puppet] - 10https://gerrit.wikimedia.org/r/434879 (https://phabricator.wikimedia.org/T131894) [08:06:51] (03CR) 10jerkins-bot: [V: 04-1] Backend-Timing Varnish mtail program [puppet] - 10https://gerrit.wikimedia.org/r/434879 (https://phabricator.wikimedia.org/T131894) (owner: 10Gilles) [08:08:40] (03PS2) 10Gilles: Backend-Timing Varnish mtail program [puppet] - 10https://gerrit.wikimedia.org/r/434879 (https://phabricator.wikimedia.org/T131894) [08:09:58] 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review, 10Wikimedia-Incident: Collect Backend-Timing in Prometheus - https://phabricator.wikimedia.org/T131894#4227676 (10Gilles) [08:16:13] !log stop db2037 to clone it to db2078 and upgrade [08:16:15] (03PS2) 10Gilles: Launch performance survey on cawiki and enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434641 (https://phabricator.wikimedia.org/T187299) [08:16:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:31] !log Deployment of cawiki and enwikivoyage performance survey [08:21:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:40] (03CR) 10Gilles: [C: 032] Launch performance survey on cawiki and enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434641 (https://phabricator.wikimedia.org/T187299) (owner: 10Gilles) [08:23:09] (03Merged) 10jenkins-bot: Launch performance survey on cawiki and enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434641 (https://phabricator.wikimedia.org/T187299) (owner: 10Gilles) [08:23:24] (03CR) 10jenkins-bot: Launch performance survey on cawiki and enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434641 (https://phabricator.wikimedia.org/T187299) (owner: 10Gilles) [08:32:21] !log gilles@tin Synchronized wmf-config/InitialiseSettings.php: T187299 Launch performance survey on cawiki and enwikivoyage (duration: 01m 08s) [08:32:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:26] T187299: User-perceived page load performance study - https://phabricator.wikimedia.org/T187299 [08:35:19] (03PS1) 10ArielGlenn: fix verbose mode off for dumps job watcher [puppet] - 10https://gerrit.wikimedia.org/r/434883 [08:35:31] !log pnorman@tin Started deploy [tilerator/deploy@9e40702]: Deploy scap fixes to cleartables map test server go 3 [08:35:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:54] !log pnorman@tin Finished deploy [tilerator/deploy@9e40702]: Deploy scap fixes to cleartables map test server go 3 (duration: 00m 23s) [08:35:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:25] (03CR) 10ArielGlenn: [C: 032] fix verbose mode off for dumps job watcher [puppet] - 10https://gerrit.wikimedia.org/r/434883 (owner: 10ArielGlenn) [08:39:34] !log ayounsi@tin Started deploy [netbox/deploy@ac54feb]: Adding service name in scap.cfg [08:39:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:07] !log ayounsi@tin Finished deploy [netbox/deploy@ac54feb]: Adding service name in scap.cfg (duration: 00m 33s) [08:40:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:18] !log pnorman@tin Started deploy [tilerator/deploy@9e40702]: Deploy scap fixes to cleartables map test server in verbose mode [08:45:19] !log pnorman@tin Finished deploy [tilerator/deploy@9e40702]: Deploy scap fixes to cleartables map test server in verbose mode (duration: 00m 01s) [08:45:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:55] !log Deploy schema change on s1 codfw primary master (db2048), this will generate lag on codfw - T191519 T188299 T190148 [08:46:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:00] T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519 [08:46:00] T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148 [08:46:01] T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299 [08:47:11] !log pnorman@tin Started deploy [tilerator/deploy@9e40702]: Deploy scap fixes to cleartables map test server in verbose mode [08:47:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:16] !log pnorman@tin Finished deploy [tilerator/deploy@9e40702]: Deploy scap fixes to cleartables map test server in verbose mode (duration: 00m 05s) [08:47:18] (03CR) 10Giuseppe Lavagetto: profile::mediawiki::mcrouter_wancache: add ssl, proxy support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/431737 (https://phabricator.wikimedia.org/T192370) (owner: 10Giuseppe Lavagetto) [08:47:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:35] (03PS6) 10Giuseppe Lavagetto: profile::mediawiki::mcrouter_wancache: add ssl, proxy support [puppet] - 10https://gerrit.wikimedia.org/r/431737 (https://phabricator.wikimedia.org/T192370) [08:51:32] !log pnorman@tin Started deploy [tilerator/deploy@9e40702]: Deploy scap fixes to cleartables map test server in verbose mode [08:51:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:38] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db209[45].codfw.wmnet (sanitarium expansion) - https://phabricator.wikimedia.org/T194781#4227813 (10Marostegui) [08:51:45] !log pnorman@tin Finished deploy [tilerator/deploy@9e40702]: Deploy scap fixes to cleartables map test server in verbose mode (duration: 00m 13s) [08:51:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:48] (03PS5) 10Marostegui: mariadb: Add the new sanitarium hosts to the config [puppet] - 10https://gerrit.wikimedia.org/r/434863 (https://phabricator.wikimedia.org/T194780) [08:55:31] !log pnorman@tin Started deploy [tilerator/deploy@9e40702] (cleartables--force): Deploy scap fixes to cleartables map test server in verbose mode [08:55:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:36] !log pnorman@tin Finished deploy [tilerator/deploy@9e40702] (cleartables--force): Deploy scap fixes to cleartables map test server in verbose mode (duration: 00m 05s) [08:55:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:43] !log pnorman@tin Started deploy [tilerator/deploy@9e40702] (cleartables): Deploy scap fixes to cleartables map test server in verbose mode [08:55:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:55] !log pnorman@tin Finished deploy [tilerator/deploy@9e40702] (cleartables): Deploy scap fixes to cleartables map test server in verbose mode (duration: 00m 13s) [08:55:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:06] (03PS7) 10Giuseppe Lavagetto: profile::mediawiki::mcrouter_wancache: add ssl, proxy support [puppet] - 10https://gerrit.wikimedia.org/r/431737 (https://phabricator.wikimedia.org/T192370) [09:03:20] (03PS3) 10Jcrespo: mariadb: Set up db1117:3325 as the backup host for m5 database section [puppet] - 10https://gerrit.wikimedia.org/r/434740 (https://phabricator.wikimedia.org/T192979) [09:03:22] (03PS1) 10Jcrespo: mariadb: Reenable notifications on db2078 [puppet] - 10https://gerrit.wikimedia.org/r/434885 (https://phabricator.wikimedia.org/T192979) [09:08:33] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::mediawiki::mcrouter_wancache: add ssl, proxy support [puppet] - 10https://gerrit.wikimedia.org/r/431737 (https://phabricator.wikimedia.org/T192370) (owner: 10Giuseppe Lavagetto) [09:19:03] (03PS6) 10Marostegui: mariadb: Add the new sanitarium hosts to the config [puppet] - 10https://gerrit.wikimedia.org/r/434863 (https://phabricator.wikimedia.org/T194780) [09:19:14] (03PS7) 10Marostegui: mariadb: Add the new sanitarium hosts to the config [puppet] - 10https://gerrit.wikimedia.org/r/434863 (https://phabricator.wikimedia.org/T194780) [09:22:23] (03CR) 10Marostegui: [C: 032] mariadb: Add the new sanitarium hosts to the config [puppet] - 10https://gerrit.wikimedia.org/r/434863 (https://phabricator.wikimedia.org/T194780) (owner: 10Marostegui) [09:24:09] (03CR) 10Jcrespo: [C: 032] mariadb: Reenable notifications on db2078 [puppet] - 10https://gerrit.wikimedia.org/r/434885 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo) [09:24:15] (03PS2) 10Jcrespo: mariadb: Reenable notifications on db2078 [puppet] - 10https://gerrit.wikimedia.org/r/434885 (https://phabricator.wikimedia.org/T192979) [09:24:50] !log ppchelko@tin Started deploy [cpjobqueue/deploy@f66dacb]: Correctly commit offsets for multi-topic rules [09:24:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:14] (03PS4) 10Giuseppe Lavagetto: puppet_ecdsacert: allow IP-based SANs [puppet] - 10https://gerrit.wikimedia.org/r/431738 (https://phabricator.wikimedia.org/T192370) [09:25:39] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@f66dacb]: Correctly commit offsets for multi-topic rules (duration: 00m 49s) [09:25:39] !log pnorman@tin Started deploy [kartotherian/deploy@9fc09ef]: Do test deploy of kartotherian to maps-test2004 [09:25:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:02] !log pnorman@tin Finished deploy [kartotherian/deploy@9fc09ef]: Do test deploy of kartotherian to maps-test2004 (duration: 00m 23s) [09:26:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:02] !log pnorman@tin Started deploy [kartotherian/deploy@9fc09ef]: Do test deploy of kartotherian to maps-test2004 [09:27:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:24] !log pnorman@tin Finished deploy [kartotherian/deploy@9fc09ef]: Do test deploy of kartotherian to maps-test2004 (duration: 00m 22s) [09:27:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:00] (03PS5) 10Giuseppe Lavagetto: puppet_ecdsacert: allow IP-based SANs [puppet] - 10https://gerrit.wikimedia.org/r/431738 (https://phabricator.wikimedia.org/T192370) [09:29:54] * mobrovac on tin for mw-config [09:30:09] (03PS5) 10Mobrovac: Switch all jobs for everything except wikipedia, commons and wikidata. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429980 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko) [09:30:47] <_joe_> mobrovac, Pchelolo does that include videoscalers? you need me to set up that vip? [09:31:00] no no, no videoscalers _joe_ [09:31:05] we are still waiting for you on that :P [09:31:08] <_joe_> ok [09:31:12] !log pnorman@tin Started deploy [kartotherian/deploy@9fc09ef]: Do a fresh deploy of Kartotherian to production [09:31:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:16] <_joe_> I'll unblock you then asap [09:31:16] but yeah, that lvs would be nice [09:31:23] <_joe_> lemme finish the installation of mcrouter in prod [09:31:28] _joe_: no, videoscalers are excluded [09:31:34] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet_ecdsacert: allow IP-based SANs [puppet] - 10https://gerrit.wikimedia.org/r/431738 (https://phabricator.wikimedia.org/T192370) (owner: 10Giuseppe Lavagetto) [09:31:43] <_joe_> I should be done by EOW [09:32:19] cool, thnx _joe_! [09:32:24] (03CR) 10Mobrovac: [C: 032] Switch all jobs for everything except wikipedia, commons and wikidata. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429980 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko) [09:32:26] awesome :) we're switching other jobs now, there's a bunch of excetional cases, so we have things to do for now [09:33:34] (03Merged) 10jenkins-bot: Switch all jobs for everything except wikipedia, commons and wikidata. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429980 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko) [09:33:48] !log pnorman@tin Finished deploy [kartotherian/deploy@9fc09ef]: Do a fresh deploy of Kartotherian to production (duration: 02m 35s) [09:33:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:46] <_joe_> let me know when I can blacklist all jobs but cirrus on the old jobrunner [09:36:31] _joe_: early next week I think. There's some more exceptions apart from cirrus though [09:36:59] _joe_: we are now switching everything but cirrus and couple of ones that fail due to serialisation for everything but wp, commons and wd [09:37:25] !log mobrovac@tin Synchronized wmf-config/jobqueue.php: Switch all jobs to EventBus for everything except wikipedia, commons and wikidata, file 1/2 - T190327 (duration: 01m 09s) [09:37:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:30] T190327: FY17/18 Q4 Program 8 Services Goal: Complete the JobQueue transition to EventBus - https://phabricator.wikimedia.org/T190327 [09:39:23] PROBLEM - tileratorui on maps-test2004 is CRITICAL: connect to address 10.192.16.35 and port 6535: Connection refused [09:39:33] PROBLEM - Check systemd state on maps-test2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:40:01] ^ maps-test2004 issues are expected, I'm scheduling downtime [09:40:06] 10Operations, 10Traffic, 10netops: cp intermittent IPsec MTU issue - https://phabricator.wikimedia.org/T195365#4228012 (10ayounsi) I looked into `charon.plugins.kernel-netlink.mtu` but for what I read it is only applied to routes added by ipsec in tunnel mode, while we use it in transport (transparent) mode.... [09:42:31] (03CR) 10Mobrovac: [C: 031] VCL: move RB Accept header normalization to text-fe [puppet] - 10https://gerrit.wikimedia.org/r/434706 (owner: 10Ema) [09:42:55] (03CR) 10Mobrovac: [C: 04-1] VCL: Normalise the Accept-Language header for the REST API (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/434558 (https://phabricator.wikimedia.org/T195327) (owner: 10Mobrovac) [09:44:10] (03PS1) 10Ppchelko: Make cirrusSearch jobTypeConf set explicitly and not rely on default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434889 [09:45:16] (03CR) 10jerkins-bot: [V: 04-1] Make cirrusSearch jobTypeConf set explicitly and not rely on default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434889 (owner: 10Ppchelko) [09:47:06] (03PS2) 10Ppchelko: Make cirrusSearch jobTypeConf set explicitly and not rely on default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434889 [09:50:07] (03CR) 10Mobrovac: [C: 032] Make cirrusSearch jobTypeConf set explicitly and not rely on default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434889 (owner: 10Ppchelko) [09:50:43] RECOVERY - tileratorui on maps-test2004 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.096 second response time [09:50:53] RECOVERY - Check systemd state on maps-test2004 is OK: OK - running: The system is fully operational [09:51:40] (03Merged) 10jenkins-bot: Make cirrusSearch jobTypeConf set explicitly and not rely on default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434889 (owner: 10Ppchelko) [09:53:42] !log mobrovac@tin Synchronized wmf-config/jobqueue.php: Switch all jobs to EventBus for everything except wikipedia, commons and wikidata, file 1/2, take #2 - T190327 (duration: 01m 08s) [09:53:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:46] T190327: FY17/18 Q4 Program 8 Services Goal: Complete the JobQueue transition to EventBus - https://phabricator.wikimedia.org/T190327 [09:54:25] !log ppchelko@tin Started deploy [cpjobqueue/deploy@b537fa1]: Switch all non-special jobs for everything except wikipedia, commons and wikidata T190327 [09:54:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:07] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@b537fa1]: Switch all non-special jobs for everything except wikipedia, commons and wikidata T190327 (duration: 00m 42s) [09:55:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:32] !log mobrovac@tin Synchronized wmf-config/InitialiseSettings.php: Switch all jobs to EventBus for everything except wikipedia, commons and wikidata, file 2/2 - T190327 (duration: 01m 06s) [09:55:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:15] (03PS1) 10Giuseppe Lavagetto: puppet-ecdsacert: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/434890 [09:57:46] * mobrovac is done with tin [09:57:57] (03CR) 10jenkins-bot: Switch all jobs for everything except wikipedia, commons and wikidata. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429980 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko) [09:58:01] (03CR) 10jenkins-bot: Make cirrusSearch jobTypeConf set explicitly and not rely on default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434889 (owner: 10Ppchelko) [09:58:15] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet-ecdsacert: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/434890 (owner: 10Giuseppe Lavagetto) [09:58:56] (03PS1) 10Ppchelko: Switch all job apart from exceptions for everything. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434891 (https://phabricator.wikimedia.org/T190327) [10:00:50] (03PS5) 10Alexandros Kosiaris: otrs: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/433491 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [10:00:55] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] otrs: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/433491 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [10:03:49] (03PS1) 10Alexandros Kosiaris: Reimage ganeti2001, ganeti2005 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/434892 [10:11:16] (03PS4) 10Volans: wmf-auto-reimage: validate certificate fingerprint [puppet] - 10https://gerrit.wikimedia.org/r/433928 [10:11:18] (03PS1) 10Volans: wmf-auto-reimage: improve donwtime of reimaged host [puppet] - 10https://gerrit.wikimedia.org/r/434894 (https://phabricator.wikimedia.org/T195423) [10:14:17] 10Operations, 10monitoring, 10Patch-For-Review: Reduce false positive icinga alerts during host reimages - https://phabricator.wikimedia.org/T195423#4228173 (10Volans) The proposed approach don't take into account hosts installed for the first time. As for detecting the newly added host on the Icinga configu... [10:17:57] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): rename naos to deploy2001 and reinstall with stretch - https://phabricator.wikimedia.org/T193916#4228181 (10ayounsi) [10:18:00] 10Operations, 10netops: update switch port label from naos.codfw.wmnet to deploy2001.codfw.wmnet - https://phabricator.wikimedia.org/T195422#4228178 (10ayounsi) 05Open>03Resolved a:03ayounsi ```lang=diff [edit interfaces ge-5/0/15] - description naos; + description deploy2001; ``` [10:18:42] (03CR) 10Alexandros Kosiaris: [C: 032] Reimage ganeti2001, ganeti2005 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/434892 (owner: 10Alexandros Kosiaris) [10:20:28] (03PS1) 10Giuseppe Lavagetto: puppet-ecdsacert: fix regex [puppet] - 10https://gerrit.wikimedia.org/r/434895 [10:21:29] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet-ecdsacert: fix regex [puppet] - 10https://gerrit.wikimedia.org/r/434895 (owner: 10Giuseppe Lavagetto) [10:21:35] (03PS2) 10Giuseppe Lavagetto: puppet-ecdsacert: fix regex [puppet] - 10https://gerrit.wikimedia.org/r/434895 [10:27:15] 10Operations, 10Citoid, 10VisualEditor, 10Services (watching), 10User-mobrovac: Some requests for DOIs are failing or very slow; if we have a DOI and the request is taking too long, just use CrossRef data instead. - https://phabricator.wikimedia.org/T165105#4228213 (10Mvolz) >>! In T165105#4149139, @mobr... [10:32:34] * Krinkle staging on mwdebug1002 [10:40:43] (03PS3) 10Alexandros Kosiaris: keyholder: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/434535 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [10:40:47] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] keyholder: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/434535 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [10:41:39] (03CR) 10Vgutierrez: [C: 031] "some minor comments, LGTM otherwise" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/434894 (https://phabricator.wikimedia.org/T195423) (owner: 10Volans) [10:41:55] 10Operations, 10MediaWiki-Platform-Team, 10HHVM, 10Patch-For-Review, 10User-ArielGlenn: Run all jobs on PHP7 or HHVM - https://phabricator.wikimedia.org/T195393#4228235 (10Reedy) >>! In T195393#4226973, @Jdforrester-WMF wrote: > If you're testing anyway, how long does `maintenance/rebuildLocalisationCach... [10:51:15] !log krinkle@tin Synchronized php-1.32.0-wmf.5/includes/resourceloader/ResourceLoaderUserModule.php: T195380 (duration: 01m 08s) [10:51:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:20] T195380: Logged-out views must not make request for empty modules=user script - https://phabricator.wikimedia.org/T195380 [10:53:09] !log Unexpected dirty git status at tin:/srv/mediawiki-staging/php-1.32.0-wmf.4/extensions/JADE (1 file is locally deleted, but not committed) [10:53:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:32] !log krinkle@tin Synchronized php-1.32.0-wmf.4/includes/resourceloader/ResourceLoaderUserModule.php: T195380 (duration: 01m 08s) [10:55:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:44] (03PS5) 10Volans: wmf-auto-reimage: validate certificate fingerprint [puppet] - 10https://gerrit.wikimedia.org/r/433928 [10:55:46] (03PS2) 10Volans: wmf-auto-reimage: improve donwtime of reimaged host [puppet] - 10https://gerrit.wikimedia.org/r/434894 (https://phabricator.wikimedia.org/T195423) [10:55:48] (03PS1) 10Volans: wmf-auto-reimage: use absolute path for subprocess [puppet] - 10https://gerrit.wikimedia.org/r/434896 [10:55:59] (03CR) 10Volans: "Addressed comments" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/434894 (https://phabricator.wikimedia.org/T195423) (owner: 10Volans) [11:04:18] (03CR) 10Alexandros Kosiaris: [C: 031] Client self-update capability [software/debmonitor] - 10https://gerrit.wikimedia.org/r/432394 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [11:05:46] !log GTT work on eqiad-esams link starting soon [11:05:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:12] !log Deploy schema change on dbstore1002:s1 - T191519 T188299 T190148 [11:10:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:18] T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519 [11:10:18] T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148 [11:10:18] T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299 [11:12:11] jouncebot: next [11:12:11] In 1 hour(s) and 47 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180524T1300) [11:13:29] 10Operations, 10Traffic, 10netops: cp intermittent IPsec MTU issue - https://phabricator.wikimedia.org/T195365#4228312 (10BBlack) >>! In T195365#4228012, @ayounsi wrote: > Raising the MTU above standard everywhere is indeed another can of worms and out of scope here. > With careful testing, raising it on so... [11:13:54] (03PS1) 10Marostegui: db-eqiad.php: Depool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434899 (https://phabricator.wikimedia.org/T190148) [11:15:27] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434899 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [11:16:44] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434899 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [11:18:51] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1080 for alter table (duration: 01m 08s) [11:18:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:07] !log Deploy schema change on db1080 - T191519 T188299 T190148 [11:19:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:12] T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519 [11:19:12] T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148 [11:19:13] T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299 [11:20:24] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434899 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [11:20:52] 10Operations, 10Citoid, 10VisualEditor, 10Services (watching), 10User-mobrovac: Some requests for DOIs are failing or very slow; if we have a DOI and the request is taking too long, just use CrossRef data instead. - https://phabricator.wikimedia.org/T165105#4228334 (10mobrovac) Set the config manually to... [11:27:22] 10Operations, 10Cloud-Services, 10netops: Allocate public v4 IPs for Neutron setup in eqiad - https://phabricator.wikimedia.org/T193496#4228340 (10ayounsi) https://apps.db.ripe.net/db-web-ui/#/lookup?source=ripe&key=185.15.56.0%2F24AS14907&type=route created. IPv6 is tracked in T187929 and can indeed wait.... [11:31:13] !log rebooting mw1261, mw1276, mw1319, mw1312, mw1258, mw1221 to use Intel microcode updates [11:31:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:56] (03CR) 10Volans: [C: 032] Client self-update capability [software/debmonitor] - 10https://gerrit.wikimedia.org/r/432394 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [11:33:39] (03Merged) 10jenkins-bot: Client self-update capability [software/debmonitor] - 10https://gerrit.wikimedia.org/r/432394 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [11:34:09] (03PS3) 10Volans: CLI: use lsb_release for OS detection [software/debmonitor] - 10https://gerrit.wikimedia.org/r/432395 [11:35:52] (03CR) 10Volans: [C: 032] CLI: use lsb_release for OS detection [software/debmonitor] - 10https://gerrit.wikimedia.org/r/432395 (owner: 10Volans) [11:36:46] (03Merged) 10jenkins-bot: CLI: use lsb_release for OS detection [software/debmonitor] - 10https://gerrit.wikimedia.org/r/432395 (owner: 10Volans) [11:37:29] (03PS1) 10Gergő Tisza: Add WMDS support question feed to mediawikiwiki RSS whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434901 (https://phabricator.wikimedia.org/T185087) [11:38:35] 10Operations, 10Citoid, 10VisualEditor, 10Services (watching), 10User-mobrovac: Some requests for DOIs are failing or very slow; if we have a DOI and the request is taking too long, just use CrossRef data instead. - https://phabricator.wikimedia.org/T165105#4228373 (10Mvolz) >>! In T165105#4228334, @mobr... [11:39:06] 10Operations, 10Developer-Relations, 10Discourse, 10Epic: Bring discourse.mediawiki.org to production - https://phabricator.wikimedia.org/T180853#4228375 (10Tgr) [11:53:06] 10Operations, 10WMDE-QWERTY-Team, 10wikidiff2, 10Patch-For-Review: Update wikidiff2 library on the WMF production cluster - https://phabricator.wikimedia.org/T190717#4228419 (10Lea_WMDE) @MoritzMuehlenhoff we are going forward with the deploy, the bug was only found in one of 300+ cases and does not break... [11:58:57] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942#4228437 (10Krinkle) [12:01:46] 10Operations, 10WMDE-QWERTY-Team, 10wikidiff2, 10Patch-For-Review: Update wikidiff2 library on the WMF production cluster - https://phabricator.wikimedia.org/T190717#4228440 (10MoritzMuehlenhoff) @Lea_WMDE Ack, I'll start upgrading the mediawiki canaries later the (CEST) afternoon. [12:07:31] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942#4228453 (10Krinkle) @ema ResourceLoader dashboards in Grafana have been updated to use Prometheus for all Varnish metrics. The varnishrls deamon for Grap... [12:14:03] (03PS4) 10Mark Bergsma: Cleanup monitor shutdown handler (invoking stop) after run [debs/pybal] - 10https://gerrit.wikimedia.org/r/433369 [12:19:51] (03CR) 10Mark Bergsma: [C: 031] Cleanup monitor shutdown handler (invoking stop) after run (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/433369 (owner: 10Mark Bergsma) [12:20:00] (03CR) 10Mark Bergsma: [C: 032] Cleanup monitor shutdown handler (invoking stop) after run [debs/pybal] - 10https://gerrit.wikimedia.org/r/433369 (owner: 10Mark Bergsma) [12:20:40] (03Merged) 10jenkins-bot: Cleanup monitor shutdown handler (invoking stop) after run [debs/pybal] - 10https://gerrit.wikimedia.org/r/433369 (owner: 10Mark Bergsma) [12:25:42] (03PS4) 10Mark Bergsma: Split monitor tests into separate modules [debs/pybal] - 10https://gerrit.wikimedia.org/r/433370 [12:26:36] jouncebot: refresh [12:26:37] I refreshed my knowledge about deployments. [12:26:41] jouncebot: next [12:26:41] In 0 hour(s) and 33 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180524T1300) [12:27:50] are you serious, jouncebot? nothing for EU SWAT today?! [12:29:40] (03PS1) 10Gehel: maps: add fonts-noto and fonts-noto-cjk [puppet] - 10https://gerrit.wikimedia.org/r/434904 (https://phabricator.wikimedia.org/T195474) [12:29:42] (03CR) 10Mark Bergsma: [C: 032] Split monitor tests into separate modules [debs/pybal] - 10https://gerrit.wikimedia.org/r/433370 (owner: 10Mark Bergsma) [12:29:49] huh. i guess no one has yet found all the bugs we caused during the hackathon ;) [12:30:17] (03Merged) 10jenkins-bot: Split monitor tests into separate modules [debs/pybal] - 10https://gerrit.wikimedia.org/r/433370 (owner: 10Mark Bergsma) [12:30:24] (03CR) 10jerkins-bot: [V: 04-1] maps: add fonts-noto and fonts-noto-cjk [puppet] - 10https://gerrit.wikimedia.org/r/434904 (https://phabricator.wikimedia.org/T195474) (owner: 10Gehel) [12:31:30] (03PS2) 10Gehel: maps: add fonts-noto and fonts-noto-cjk [puppet] - 10https://gerrit.wikimedia.org/r/434904 (https://phabricator.wikimedia.org/T195474) [12:41:47] 10Operations, 10MediaWiki-Platform-Team, 10HHVM, 10Patch-For-Review, 10User-ArielGlenn: Run all jobs on PHP7 or HHVM - https://phabricator.wikimedia.org/T195393#4228562 (10Reedy) ``` reedy@mwmaint1001:/tmp$ php7.0 --version PHP 7.0.27-0+deb9u1 (cli) (built: Jan 5 2018 13:51:52) ( NTS ) Copyright (c) 199... [12:51:04] (03PS2) 10Alexandros Kosiaris: mathoid: Install ingress networkpolicy policy if enabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/434473 [12:57:53] (03PS3) 10Ottomata: Blacklisting new iOS eventlogging schemas on MySQL [puppet] - 10https://gerrit.wikimedia.org/r/434424 (https://phabricator.wikimedia.org/T192819) (owner: 10Chelsyx) [12:58:00] (03CR) 10Ottomata: [V: 032 C: 032] Blacklisting new iOS eventlogging schemas on MySQL [puppet] - 10https://gerrit.wikimedia.org/r/434424 (https://phabricator.wikimedia.org/T192819) (owner: 10Chelsyx) [13:00:05] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180524T1300). [13:00:05] No GERRIT patches in the queue for this window AFAICS. [13:00:29] Yeah! :D [13:00:49] We are done! No moar deployments! Evar! ;P [13:02:58] (03CR) 10Vgutierrez: [C: 031] wmf-auto-reimage: use absolute path for subprocess [puppet] - 10https://gerrit.wikimedia.org/r/434896 (owner: 10Volans) [13:04:35] 10Operations, 10MediaWiki-Platform-Team, 10HHVM, 10Patch-For-Review, 10User-ArielGlenn: Run all jobs on PHP7 or HHVM - https://phabricator.wikimedia.org/T195393#4228588 (10Reedy) >>! In T191921#4160131, @Legoktm wrote: > {meme, src="full-steam-ahead", above="PHP7", below="full steam ahead"} [13:09:17] (03PS3) 10Volans: wmf-auto-reimage: improve downtime of reimaged host [puppet] - 10https://gerrit.wikimedia.org/r/434894 (https://phabricator.wikimedia.org/T195423) [13:10:02] zeljkof: in this case would maybe have a second to have a look at: https://gerrit.wikimedia.org/r/#/c/434011/ ? [13:10:11] sorry for the shameless self-promotion [13:10:58] leszek_wmde: no problem :) it's on my todo list for today, already working on patch for T167432 [13:10:59] T167432: Run Wikibase daily browser tests on Jenkins - https://phabricator.wikimedia.org/T167432 [13:11:14] zeljkof: amazing! thank you very much [13:12:31] leszek_wmde: see T167432#4228595 [13:13:42] zeljkof: oh. thanks! [13:13:54] !log upgrading mw1261 (canary host) to wikidiff 1.7.0 [13:13:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:32] (03CR) 10Vgutierrez: Use passed-in reactor in all monitors (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/434684 (owner: 10Mark Bergsma) [13:20:09] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1080" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434909 [13:20:40] (03CR) 10Mark Bergsma: Use passed-in reactor in all monitors (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/434684 (owner: 10Mark Bergsma) [13:21:30] (03PS4) 10Jcrespo: mariadb: Set up db1117:3325 as the backup host for m5 database section [puppet] - 10https://gerrit.wikimedia.org/r/434740 (https://phabricator.wikimedia.org/T192979) [13:24:55] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1080" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434909 (owner: 10Marostegui) [13:25:36] !log upgrading mw1262-mw1265 (canary hosts) to wikidiff 1.7.0 [13:25:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:09] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1080" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434909 (owner: 10Marostegui) [13:27:03] !log Running deduplicateArchiveRevId.php on group 1 for T193180 [13:27:05] (03PS1) 10Marostegui: db-eqiad.php: Depool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434911 (https://phabricator.wikimedia.org/T190148) [13:27:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:07] T193180: Clean up archive rows with duplicate revision IDs - https://phabricator.wikimedia.org/T193180 [13:27:21] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1080 after alter table (duration: 00m 56s) [13:27:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:17] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434911 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [13:29:39] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1080" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434909 (owner: 10Marostegui) [13:29:58] (03CR) 10Gehel: "looks good in principle, minor comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/431860 (https://phabricator.wikimedia.org/T193766) (owner: 10Herron) [13:30:21] (03CR) 10Jcrespo: [C: 032] mariadb: Set up db1117:3325 as the backup host for m5 database section [puppet] - 10https://gerrit.wikimedia.org/r/434740 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo) [13:30:57] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434911 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [13:31:34] (03CR) 10Vgutierrez: [C: 031] "LGTM" [debs/pybal] - 10https://gerrit.wikimedia.org/r/434684 (owner: 10Mark Bergsma) [13:32:01] PROBLEM - Apache HTTP on mw1262 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.002 second response time [13:32:10] PROBLEM - Nginx local proxy to apache on mw1262 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.013 second response time [13:32:28] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1067 for alter table (duration: 01m 00s) [13:32:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:52] (03CR) 10Gehel: "LGTM, minor comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/431830 (https://phabricator.wikimedia.org/T193766) (owner: 10Herron) [13:33:15] !log Deploy schema change on db1067 - T191519 T188299 T190148 [13:33:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:20] T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519 [13:33:21] T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148 [13:33:21] T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299 [13:33:28] (03PS1) 10Alexandros Kosiaris: mathoid: Do not if guard config-volume volume [deployment-charts] - 10https://gerrit.wikimedia.org/r/434915 [13:33:30] PROBLEM - Check systemd state on mw1262 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:33:31] PROBLEM - HHVM processes on mw1262 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm [13:34:37] ^ silencing [13:34:50] (03CR) 10Gehel: [V: 032 C: 032] "LGTM, checksum verified" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/432136 (https://phabricator.wikimedia.org/T193734) (owner: 10DCausse) [13:35:35] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434911 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [13:35:46] k [13:36:31] RECOVERY - Check systemd state on mw1262 is OK: OK - running: The system is fully operational [13:36:31] RECOVERY - HHVM processes on mw1262 is OK: PROCS OK: 6 processes with command name hhvm [13:37:03] !log rebalance row_A codfw ganeti nodegroup. Fully upgrade to stretch now [13:37:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:10] RECOVERY - Apache HTTP on mw1262 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.606 second response time [13:37:11] RECOVERY - Nginx local proxy to apache on mw1262 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.070 second response time [13:38:57] (03PS1) 10Jcrespo: mariadb: Failover m2 master to db1065 instad of db2044 [puppet] - 10https://gerrit.wikimedia.org/r/434916 (https://phabricator.wikimedia.org/T195484) [13:39:48] (03CR) 10Jcrespo: [C: 032] mariadb: Failover m2 master to db1065 instad of db2044 [puppet] - 10https://gerrit.wikimedia.org/r/434916 (https://phabricator.wikimedia.org/T195484) (owner: 10Jcrespo) [13:41:32] 10Operations, 10MediaWiki-Platform-Team, 10HHVM, 10Patch-For-Review, 10User-ArielGlenn: Run all jobs on PHP7 or HHVM - https://phabricator.wikimedia.org/T195393#4228651 (10Reedy) And for some more fun and games... hhvm on tin ``` real 40m32.809s ``` [13:43:32] 10Operations, 10Scap, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes - https://phabricator.wikimedia.org/T191921#4228654 (10Reedy) Interestingly for 1 branch of rebuildLocalisationCache.php tin hhvm ``` real 40m32.809s ``` tin php5 ``` r... [13:45:52] (03CR) 10Bstorm: "It appears this is already affecting users. I'll merge the patch since it worked well in the last script I had to update with pagination." [puppet] - 10https://gerrit.wikimedia.org/r/434755 (owner: 10Bstorm) [13:46:00] (03PS2) 10Bstorm: wiki replicas: maintain-dbusers to page through ldap [puppet] - 10https://gerrit.wikimedia.org/r/434755 [13:47:14] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] mathoid: Install ingress networkpolicy policy if enabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/434473 (owner: 10Alexandros Kosiaris) [13:47:23] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] mathoid: Do not if guard config-volume volume [deployment-charts] - 10https://gerrit.wikimedia.org/r/434915 (owner: 10Alexandros Kosiaris) [13:47:54] (03CR) 10Bstorm: [C: 032] wiki replicas: maintain-dbusers to page through ldap [puppet] - 10https://gerrit.wikimedia.org/r/434755 (owner: 10Bstorm) [13:51:12] (03PS2) 10Alexandros Kosiaris: mathoid: Disable monitoring by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/434474 [13:51:32] <_joe_> ottomata: I have questions on cergen! [13:52:00] PROBLEM - puppet last run on lvs4007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:54:13] (03CR) 10Vgutierrez: [C: 031] "looks good! Thx Riccardo! :)" [puppet] - 10https://gerrit.wikimedia.org/r/434032 (owner: 10Volans) [13:54:41] (03Abandoned) 10Krinkle: multiversion: Move vendor/autoload from MWMultiVersion to profiler.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/432012 (owner: 10Krinkle) [13:54:44] <_joe_> ottomata: specifically, there doesn't seem to be a way to add keyUsage restrictions anywhere [13:55:18] "morning" [13:55:19] jouncebot: next [13:55:19] In 2 hour(s) and 4 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180524T1600) [13:56:07] hi _joe_ , naos is gone. deploy2001 is here. i want to add it to scap master now. currently there is just one scap master [13:56:28] <_joe_> mutante: +1! [13:56:40] hm, ya _joe_ i don't remember needing it so I didn't do that...but we could add that! [13:56:42] thanks! there might just be one blocker.. mysql grants [13:56:46] looking [13:57:12] (03PS2) 10Giuseppe Lavagetto: s/php5/php/ in foreachwikiindblist [puppet] - 10https://gerrit.wikimedia.org/r/434754 (https://phabricator.wikimedia.org/T195393) (owner: 10Reedy) [13:57:24] yea, i still need https://gerrit.wikimedia.org/r/#/c/434821/ but i wonder what will be broken when deploy2001 can't talk to "labswiki" db [13:58:03] (03CR) 10Giuseppe Lavagetto: [C: 032] s/php5/php/ in foreachwikiindblist [puppet] - 10https://gerrit.wikimedia.org/r/434754 (https://phabricator.wikimedia.org/T195393) (owner: 10Reedy) [13:58:14] woo hoo! [13:58:57] (03PS4) 10Dzahn: scap: add deploy2001 as scap master and host [puppet] - 10https://gerrit.wikimedia.org/r/433616 (https://phabricator.wikimedia.org/T193916) [13:59:34] <_joe_> ottomata: also, I don't care about generating java keystores, is there a way not to build them? [14:00:29] !log deploy2001: scap pull to sync. then add as scap master and host (gerrit:433616) [14:00:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:19] _joe_: no, but does it hurt to generate them? [14:01:45] <_joe_> ottomata: just useless scrap :P [14:02:18] aye [14:02:22] we could add that too i suppose [14:02:32] i think that would not be hard [14:02:33] <_joe_> but that's ok, I don't particularly care about that [14:02:45] (03CR) 10Dzahn: [C: 032] scap: add deploy2001 as scap master and host [puppet] - 10https://gerrit.wikimedia.org/r/433616 (https://phabricator.wikimedia.org/T193916) (owner: 10Dzahn) [14:02:58] there's already some should_generate() logic, could add some bit to the config about which should be generated [14:03:01] <_joe_> I'll take a look at how to add the keyusage attributes [14:03:02] but ya [14:03:03] ok [14:03:35] prob could pass it as a dict that could be used as kwargs for x509.KeyUsage [14:03:39] https://cryptography.io/en/latest/x509/reference/#cryptography.x509.KeyUsage [14:03:46] then you just set them in the yaml [14:03:58] key_usage: [14:03:58] key_cert_sign: true [14:03:58] decipher_only: true [14:04:01] etc. [14:04:14] <_joe_> yes, that was my plan [14:04:17] 10Operations, 10WMDE-QWERTY-Team, 10wikidiff2, 10Patch-For-Review: Update wikidiff2 library on the WMF production cluster - https://phabricator.wikimedia.org/T190717#4228731 (10MoritzMuehlenhoff) @Lea_WMDE, @WMDE-Fisch : The canary application servers have been upgraded and so far everything looks fine in... [14:04:18] coooooo [14:05:41] (03PS1) 10Alexandros Kosiaris: mathoid: Allow autoallocating a service port [deployment-charts] - 10https://gerrit.wikimedia.org/r/434918 [14:06:08] moritzm: If it is on the canaries, does that mean, I could have a look at mwdebug1001 or something? [14:06:41] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): rename naos to deploy2001 and reinstall with stretch - https://phabricator.wikimedia.org/T193916#4228736 (10Dzahn) [14:07:00] (03PS2) 10Krinkle: Move multiversion/vendor/ to vendor/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/432013 [14:07:57] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] mathoid: Disable monitoring by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/434474 (owner: 10Alexandros Kosiaris) [14:08:12] (03CR) 10jerkins-bot: [V: 04-1] Move multiversion/vendor/ to vendor/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/432013 (owner: 10Krinkle) [14:08:18] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] mathoid: Allow autoallocating a service port [deployment-charts] - 10https://gerrit.wikimedia.org/r/434918 (owner: 10Alexandros Kosiaris) [14:08:30] (03PS2) 10Dzahn: remove naos.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/434802 (https://phabricator.wikimedia.org/T193916) [14:09:07] !log empty ganeti2004 for stretch reimage [14:09:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:27] (03CR) 10Dzahn: [C: 032] "host name has been completely removed from the puppet repo and host has been reinstalled as deploy2001." [dns] - 10https://gerrit.wikimedia.org/r/434802 (https://phabricator.wikimedia.org/T193916) (owner: 10Dzahn) [14:11:35] (03PS3) 10Krinkle: Move multiversion/vendor/ to vendor/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/432013 [14:11:44] (03PS1) 10Jcrespo: mariadb: Remove old references to db105* hosts at dns [dns] - 10https://gerrit.wikimedia.org/r/434920 (https://phabricator.wikimedia.org/T186320) [14:11:55] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): rename naos to deploy2001 and reinstall with stretch - https://phabricator.wikimedia.org/T193916#4228739 (10Dzahn) [14:11:58] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Remove old references to db105* hosts at dns [dns] - 10https://gerrit.wikimedia.org/r/434920 (https://phabricator.wikimedia.org/T186320) (owner: 10Jcrespo) [14:12:33] 10Operations, 10MediaWiki-Platform-Team, 10HHVM, 10Patch-For-Review, 10User-ArielGlenn: Run all jobs on PHP7 or HHVM - https://phabricator.wikimedia.org/T195393#4226190 (10Joe) As of now, any new script will only run on php7 or HHVM. We can consider this task resolved. [14:12:39] (03PS2) 10Jcrespo: mariadb: Remove old references to db105* and codfw hosts at dns [dns] - 10https://gerrit.wikimedia.org/r/434920 (https://phabricator.wikimedia.org/T186320) [14:12:50] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Remove old references to db105* and codfw hosts at dns [dns] - 10https://gerrit.wikimedia.org/r/434920 (https://phabricator.wikimedia.org/T186320) (owner: 10Jcrespo) [14:14:09] (03PS4) 10Krinkle: Move /multiversion/vendor to /vendor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/432013 [14:14:30] (03PS3) 10Jcrespo: mariadb: Remove old references to db105* and codfw hosts at dns [dns] - 10https://gerrit.wikimedia.org/r/434920 (https://phabricator.wikimedia.org/T186320) [14:15:18] (03CR) 10Krinkle: "For deployment: Requires full scap (fairly obviously)." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/432013 (owner: 10Krinkle) [14:15:20] 10Operations, 10MediaWiki-Platform-Team, 10HHVM, 10User-ArielGlenn: Run all jobs on PHP7 - https://phabricator.wikimedia.org/T195392#4228757 (10Joe) [14:15:37] 10Operations, 10MediaWiki-Platform-Team, 10HHVM, 10Patch-For-Review, 10User-ArielGlenn: Run all jobs on PHP7 or HHVM - https://phabricator.wikimedia.org/T195393#4228756 (10Joe) 05Open>03Resolved [14:16:01] (03CR) 10Jcrespo: "Needs careful review" [dns] - 10https://gerrit.wikimedia.org/r/434920 (https://phabricator.wikimedia.org/T186320) (owner: 10Jcrespo) [14:22:17] RECOVERY - puppet last run on lvs4007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:24:17] PROBLEM - Disk space on stat1004 is CRITICAL: DISK CRITICAL - /mnt/hdfs is not accessible: Software caused connection abort [14:25:17] PROBLEM - MD RAID on stat1005 is CRITICAL: Return code of 255 is out of bounds [14:25:27] PROBLEM - Check systemd state on stat1005 is CRITICAL: Return code of 255 is out of bounds [14:25:37] PROBLEM - Disk space on stat1005 is CRITICAL: Return code of 255 is out of bounds [14:25:47] PROBLEM - SSH on stat1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:26:07] PROBLEM - DPKG on stat1005 is CRITICAL: Return code of 255 is out of bounds [14:26:08] PROBLEM - dhclient process on stat1005 is CRITICAL: Return code of 255 is out of bounds [14:26:08] elukey: ^^^ FYI [14:26:17] PROBLEM - configured eth on stat1005 is CRITICAL: Return code of 255 is out of bounds [14:26:35] yes yes [14:27:05] heh [14:27:17] RECOVERY - Disk space on stat1004 is OK: DISK OK [14:27:40] for some reason fuse_dfs on stat100[45] was eating a ton of memory [14:27:47] RECOVERY - SSH on stat1005 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0) [14:27:58] PROBLEM - puppet last run on stat1005 is CRITICAL: Return code of 255 is out of bounds [14:28:36] ack [14:29:16] (03PS4) 10Marostegui: mariadb: add mwmaint1001 to grants for production-m5 [puppet] - 10https://gerrit.wikimedia.org/r/430524 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [14:29:53] (03PS2) 10Alexandros Kosiaris: scaffolding: Disabling monitoring by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/434476 [14:29:55] (03PS1) 10Alexandros Kosiaris: Allow autoallocate service port, use it under minikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/434924 [14:30:43] (03CR) 10Marostegui: [C: 032] mariadb: add mwmaint1001 to grants for production-m5 [puppet] - 10https://gerrit.wikimedia.org/r/430524 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [14:33:37] (03PS1) 10Alexandros Kosiaris: Reimage ganeti1001, ganeti1006 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/434927 [14:38:48] 10Operations, 10ops-codfw, 10DBA, 10decommission, 10Patch-For-Review: db2064 crashed and totally broken - decommission it - https://phabricator.wikimedia.org/T195228#4228800 (10Marostegui) I had a chat with @mark and for now we will not buy a replacement. If we have some more issues with other servers an... [14:39:47] (03CR) 10Dzahn: [C: 032] mariadb: grant deploy1001 access to labswiki [puppet] - 10https://gerrit.wikimedia.org/r/434821 (https://phabricator.wikimedia.org/T175288) (owner: 10Dzahn) [14:40:38] (03PS2) 10Dzahn: mariadb: grant deploy1001 access to labswiki [puppet] - 10https://gerrit.wikimedia.org/r/434821 (https://phabricator.wikimedia.org/T175288) [14:40:49] (03PS3) 10Dzahn: mariadb: grant deploy1001 access to labswiki [puppet] - 10https://gerrit.wikimedia.org/r/434821 (https://phabricator.wikimedia.org/T175288) [14:42:14] (03PS2) 10Alexandros Kosiaris: Reimage ganeti1001, ganeti1006 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/434927 [14:42:30] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Reimage ganeti1001, ganeti1006 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/434927 (owner: 10Alexandros Kosiaris) [14:43:57] PROBLEM - Check the NTP synchronisation status of timesyncd on stat1005 is CRITICAL: Return code of 255 is out of bounds [14:44:00] (03PS2) 10Dzahn: mariadb: update m5 grants after naos became deploy2001 [puppet] - 10https://gerrit.wikimedia.org/r/434803 (https://phabricator.wikimedia.org/T193916) [14:51:04] (03CR) 10Dzahn: [C: 032] mariadb: update m5 grants after naos became deploy2001 [puppet] - 10https://gerrit.wikimedia.org/r/434803 (https://phabricator.wikimedia.org/T193916) (owner: 10Dzahn) [14:51:30] !log upgrading mwdebug servers in eqiad to wikidiff 1.7.0 [14:51:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:25] 10Operations, 10WMDE-QWERTY-Team, 10wikidiff2, 10Patch-For-Review: Update wikidiff2 library on the WMF production cluster - https://phabricator.wikimedia.org/T190717#4228819 (10MoritzMuehlenhoff) The mwdebug have also been upgraded. [14:56:38] !log shutting down elastic2020 for maintenance [14:56:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:11] (03PS6) 10Volans: facter: refactor the net_driver fact [puppet] - 10https://gerrit.wikimedia.org/r/434032 [14:57:39] !log Deploy schema change on dbstore1001:s1 - T191519 T188299 T190148 [14:57:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:44] T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519 [14:57:44] T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148 [14:57:44] T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299 [14:58:10] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): rename naos to deploy2001 and reinstall with stretch - https://phabricator.wikimedia.org/T193916#4228843 (10Dzahn) [14:58:59] (03PS1) 10Jcrespo: mariadb: Reenable notifications on db1117 [puppet] - 10https://gerrit.wikimedia.org/r/434929 (https://phabricator.wikimedia.org/T192979) [14:59:07] PROBLEM - Host elastic2020 is DOWN: PING CRITICAL - Packet loss = 100% [14:59:08] PROBLEM - IPMI Sensor Status on stat1005 is CRITICAL: Return code of 255 is out of bounds [14:59:38] RECOVERY - DPKG on stat1005 is OK: All packages OK [14:59:48] RECOVERY - dhclient process on stat1005 is OK: PROCS OK: 0 processes with command name dhclient [14:59:57] RECOVERY - configured eth on stat1005 is OK: OK - interfaces up [14:59:57] RECOVERY - MD RAID on stat1005 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [15:00:07] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational [15:00:18] RECOVERY - Disk space on stat1005 is OK: DISK OK [15:01:14] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288#4228866 (10Dzahn) [15:01:17] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): rename naos to deploy2001 and reinstall with stretch - https://phabricator.wikimedia.org/T193916#4228865 (10Dzahn) 05Open>03Resolved [15:01:54] 10Operations, 10WMDE-QWERTY-Team, 10wikidiff2, 10Patch-For-Review: Update wikidiff2 library on the WMF production cluster - https://phabricator.wikimedia.org/T190717#4228867 (10WMDE-Fisch) > The mwdebug have also been upgraded. Just checked the inline diff there and as expected the moved paragraph changes... [15:03:28] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [15:05:37] I've restarted the nagios daemon on stat1005 [15:05:46] nrpe? [15:06:48] !log installing glibc updates from stretch point release [15:06:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:17] RECOVERY - Host elastic2020 is UP: PING OK - Packet loss = 0%, RTA = 36.89 ms [15:09:12] (03CR) 10Volans: [C: 032] facter: refactor the net_driver fact [puppet] - 10https://gerrit.wikimedia.org/r/434032 (owner: 10Volans) [15:12:08] PROBLEM - haproxy failover on dbproxy1002 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [15:12:17] PROBLEM - Disk space on db1065 is CRITICAL: DISK CRITICAL - /srv is not accessible: Input/output error [15:12:28] looks like db1065 storage died? [15:12:33] PROBLEM - MariaDB disk space on db1065 is CRITICAL: DISK CRITICAL - /srv is not accessible: Input/output error [15:12:42] looks like [15:12:44] root@db1065:~# df -hT [15:12:45] -bash: /bin/df: Input/output error [15:12:58] PROBLEM - haproxy failover on dbproxy1007 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [15:12:58] PROBLEM - Check systemd state on db1065 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:13:08] It had a failed disk earlier today: T195444 [15:13:08] T195444: Degraded RAID on db1065 - https://phabricator.wikimedia.org/T195444 [15:13:12] marostegui: it was degraded [15:13:13] PROBLEM - mysqld processes on db1065 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [15:14:08] RECOVERY - Check the NTP synchronisation status of timesyncd on stat1005 is OK: OK: synced at Thu 2018-05-24 15:14:03 UTC. [15:14:25] cmjohnson: were you changing the disk on db1065 by any chance? [15:14:34] Just wondering if it died just by itself [15:16:23] (03CR) 10Jcrespo: [C: 032] mariadb: Reenable notifications on db1117 [puppet] - 10https://gerrit.wikimedia.org/r/434929 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo) [15:16:28] (03PS2) 10Jcrespo: mariadb: Reenable notifications on db1117 [puppet] - 10https://gerrit.wikimedia.org/r/434929 (https://phabricator.wikimedia.org/T192979) [15:17:02] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1065 - https://phabricator.wikimedia.org/T195444#4228898 (10Marostegui) Storage crashed: ``` root@db1065:~# df -hT -bash: /bin/df: Input/output error root@db1065:~# dmesg -bash: /bin/dmesg: Input/output error ``` [15:17:37] mmm [15:18:29] nothing on HW logs [15:20:06] btw, did you get paged? I didn't but got the email [15:20:22] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1065 - https://phabricator.wikimedia.org/T195444#4228905 (10Marostegui) @Cmjohnson can you visually check if there are more than 1 disk broken? [15:21:03] I got no page [15:21:07] just an email [15:21:17] (03PS1) 10Jcrespo: mariadb: Disable db1065 notifications [puppet] - 10https://gerrit.wikimedia.org/r/434933 (https://phabricator.wikimedia.org/T195444) [15:22:04] and while I've been looking at icinga often i'm not going to necessarily see it up to the minute, pages are better [15:22:43] (03PS2) 10Jcrespo: mariadb: Disable db1065 notifications [puppet] - 10https://gerrit.wikimedia.org/r/434933 (https://phabricator.wikimedia.org/T195444) [15:22:52] according to the config it should have paged, checking [15:23:15] (03CR) 10Jcrespo: [C: 032] mariadb: Disable db1065 notifications [puppet] - 10https://gerrit.wikimedia.org/r/434933 (https://phabricator.wikimedia.org/T195444) (owner: 10Jcrespo) [15:23:52] it would page for host down but the host is up, only services on it are down? [15:24:10] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for Burrow Prometheus exporters [puppet] - 10https://gerrit.wikimedia.org/r/434934 (https://phabricator.wikimedia.org/T135991) [15:24:15] mutante: [1527174793] SERVICE NOTIFICATION: andrew;db1065;mysqld processes;CRITICAL;notify-by-sms-gateway;PROCS CRITICAL: 0 processes with command name mysq [15:24:15] ah, it did page [15:24:26] did anything change recently on the gateway system? [15:24:27] i can see from the mails to alerts@wm.org [15:24:35] that means it really should have [15:24:38] (03CR) 10jerkins-bot: [V: 04-1] Enable base::service_auto_restart for Burrow Prometheus exporters [puppet] - 10https://gerrit.wikimedia.org/r/434934 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:24:41] I saw some CR regarding smtp config [15:24:46] not sure if anything changed there [15:25:27] (03CR) 10Jcrespo: [C: 032] "I am not sure this will take effection without being able to run puppet." [puppet] - 10https://gerrit.wikimedia.org/r/434933 (https://phabricator.wikimedia.org/T195444) (owner: 10Jcrespo) [15:25:43] the alerts that trigger the pages are both "mysqld processes" and "mariadb disk space" [15:25:51] let's see if Herron's change was merged [15:26:14] no, it's not [15:26:16] no not merged yet [15:26:21] ack [15:26:29] volans: ^ then i dont know of any changes [15:27:23] (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for Burrow Prometheus exporters [puppet] - 10https://gerrit.wikimedia.org/r/434934 (https://phabricator.wikimedia.org/T135991) [15:28:06] the good news is that there is more redundancy as usual because I set up db1117 [15:28:15] :) [15:28:29] but apparently, hosts reject hosting otrs database [15:28:41] and they commit suicide before doing that, akosiaris [15:28:43] i am testing sending an SMS to myself directly from the icinga server, using direct email via my script [15:28:46] and that works [15:29:05] so it's not einsteinium having a general mail problem [15:29:14] RECOVERY - IPMI Sensor Status on stat1005 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [15:29:15] and also it shouldnt be the icinga config for this alert [15:29:24] we can do another test if everybody is ok with the spamming [15:29:30] i am [15:29:53] (03PS1) 10Cmjohnson: DNS entries db1124/db1125 [dns] - 10https://gerrit.wikimedia.org/r/434937 (https://phabricator.wikimedia.org/T194780) [15:29:57] also that test used the same mail2sms gateway... [15:30:35] (03PS2) 10Cmjohnson: DNS entries db1124/db1125 [dns] - 10https://gerrit.wikimedia.org/r/434937 (https://phabricator.wikimedia.org/T194780) [15:31:29] which service should I fail, the same on another host? [15:32:07] "mysqld process" on any host should do it [15:32:10] yeah [15:32:17] (03CR) 10Cmjohnson: [C: 032] DNS entries db1124/db1125 [dns] - 10https://gerrit.wikimedia.org/r/434937 (https://phabricator.wikimedia.org/T194780) (owner: 10Cmjohnson) [15:32:18] lets go with some in production [15:32:21] but depooled [15:32:32] production here means core [15:32:44] you can just disable active notification and then manually set the value on icinga [15:32:56] sorry s/notification/checks/ [15:33:17] oh, yea, also an ACK on the alert should nowadays send a proper ACK via SMS (if service is paging) [15:33:34] ok, I can try that [15:35:10] I'm checking on the external gateway in the meanwhile [15:35:44] i am watching the icinga log [15:35:50] and will check the alerts@ mail [15:37:34] icinga log says it did "notify-by-sms-gateway" for a bunch of you [15:37:45] not for me because it wasn't within my awake hours timeperiod [15:38:02] that could mean AQL is down [15:38:10] elukey: did you get a page for Hadoop HDFS Fuse earlier today? [15:38:20] I got a text [15:38:22] oh, here they are [15:38:25] I got it now the critical [15:38:29] yeah [15:38:31] same here [15:38:31] I just got one as well [15:38:44] too bad they don't have the timestamp [15:39:00] ouch, 25 minutes? [15:39:09] i did not. i guess it's daylight savings time and i would get it from 9am instead of 8am [15:39:29] yeah pages arrived [15:39:31] wait, it was sent 25 minutes delayed or this was the new test? [15:39:39] that's my question too [15:39:43] they appeared now on the delivery log of the external gateway service with time '16:13:14' [15:40:05] jouncebot next [15:40:05] In 0 hour(s) and 19 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180524T1600) [15:40:08] so yeah, the had ~25 minutes of delay :( [15:40:37] last time it did the notification commands was at 1527174793 [15:40:50] marostegui: with your ok, I would like to do a hard restart of db1065 [15:41:01] +1 [15:41:02] I know you may want to wait for visual inspection [15:41:07] nah, go for it [15:41:20] but I want to be able to run for it, and there is probably not much to save anyway [15:41:26] *run puppet [15:42:07] Can I get a quick ping when y'all are done with the db stuff? I Need to do a gerrit restart, but don't want another thing moving while you're busy on that. [15:42:10] we may want to reconsider our renewal schedule for some hosts [15:42:27] 10Operations, 10Analytics, 10Research, 10Traffic, and 6 others: Referrer policy for browsers which only support the old spec - https://phabricator.wikimedia.org/T180921#4229008 (10TheDJ) Shall we consider this closed then ? [15:42:44] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [15:43:08] uh, that seems more important [15:43:14] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [15:43:14] ACKNOWLEDGEMENT - Disk space on db1065 is CRITICAL: DISK CRITICAL - /srv is not accessible: Input/output error daniel_zahn DBAs are on it (testing SMS) [15:44:12] 10Operations, 10Traffic, 10Browser-Support-Apple-Safari, 10Upstream: Fix broken referer categorization for visits from Safari browsers - https://phabricator.wikimedia.org/T154702#4229012 (10TheDJ) 05Open>03Resolved a:03TheDJ I think we can call this one resolved, per T180921#4226106 and {F18492888} [15:44:59] what is it, is it thumbs? [15:45:12] I see nothing in MW logs, and looks like it was just a spike? [15:45:27] * addshore has a 'pretty urgent' patch to backport for wikidata dispatching, if there is space before puppet swat... [15:45:35] as wikidata dispatching has been broken since the train yesterday [15:46:00] gerrit restart can wait [15:46:03] uploads at esams [15:46:05] https://grafana.wikimedia.org/dashboard/db/wikidata-dispatch?refresh=1m&orgId=1 :( [15:46:12] * no_justification is grumpy anyway [15:46:57] jouncebot: now [15:46:57] No deployments scheduled for the next 0 hour(s) and 13 minute(s) [15:47:59] * addshore will merge and sync https://gerrit.wikimedia.org/r/#/c/434940/ now (only touching the dispatch maint script) [15:48:12] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434942 [15:48:16] addshore: I will wait for you [15:48:30] !log swapped failed disk 0 db1065 [15:48:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:41] marostegui: thanks, just waiting for CI [15:48:56] cmjohnson: did you just swap it? or did you do it earlier? [15:49:09] just now..well like 1 min ago [15:49:20] cmjohnson: cool - thanks! [15:49:32] cmjohnson: can you see some other failed disks on that host? [15:49:37] yep..once that rebuilds I will swap out the 2nd disk slot 1....we need to order more disks [15:49:40] the storage died like 30 minutes ago or so [15:50:18] I am not sure that will rebuild [15:50:25] or at least, to something usable [15:50:36] we probably have to reimage it [15:51:01] I can just replace the 2nd disk if you want to re-install? or wait and see what happens [15:51:13] so both disks of the same slot died or was a random crash? [15:51:14] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [15:51:44] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [15:52:05] cmjohnson: we should still have plenty of used disks from all the servers that we've decommissioned no? [15:52:23] 10Operations, 10Scap, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes - https://phabricator.wikimedia.org/T191921#4229024 (10thcipriani) >>! In T191921#4228654, @Reedy wrote: > tin php5 > ``` > real 24m13.240s > ``` This result seems confusin... [15:52:34] I am asking becuse if 2 died, we just can reimage and forget, but if it is the controller, I don't think we can trust it [15:53:16] cmjohnson: mentioned two disks, so maybe that was it [15:53:33] My best guess is the disks...we have had multiple disk failures on that batch of servers. [15:53:53] it is 2 disk failures slot 0 and slot 1 [15:54:04] so that's probably the same span [15:54:08] let me check another similar server [15:55:01] yeah, 0 and 1 are normally the same span, so it makes sense the server died entirely [15:55:08] yes, sorry, I meant span [15:55:12] not slot [15:55:24] each disk is on a separate slot, of course [15:55:55] so definitely we need a reimage [15:56:01] yeah [15:56:08] however, we have lost the copy of m1 [15:56:11] 10Operations, 10Analytics, 10Research, 10Traffic, and 6 others: Referrer policy for browsers which only support the old spec - https://phabricator.wikimedia.org/T180921#4229032 (10Tgr) Would be nice if someone could test it on IE or Edge, now that Safari is fixed those are the only two browsers that need a... [15:57:16] I think it was copied to db1117 [15:57:30] so we should have still 2 old copies of m1 [15:58:01] syncing [15:58:11] \o/ [15:58:16] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434942 (owner: 10Marostegui) [15:58:25] jynus: should we get both disks replaced then? [15:58:31] yeah [15:58:32] if we are going to reimage anyways [15:58:34] then reimage [15:58:44] then copy from db1117 [15:58:49] cmjohnson feel free to replace the other disk that failed whenever you have time, we will reimage db1065 anyways [15:59:09] I may rebuild the RAID, too [15:59:23] okay...will do in a few mins [15:59:30] jynus: yeah, good idea [15:59:32] cmjohnson: if you need 600GB disks [15:59:39] I can tell you where to get some [15:59:43] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434942 (owner: 10Marostegui) [15:59:58] yeah, there are quite a bunch of tasks for decommissioning, those should have 600GB disks [15:59:59] * addshore twiddels thumbs waiting for sync... [16:00:04] godog, moritzm, and _joe_: Dear deployers, time to do the Puppet SWAT(Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180524T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:01:27] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434942 (owner: 10Marostegui) [16:01:29] (03PS1) 10Marostegui: db-eqiad.php: Depool db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434944 (https://phabricator.wikimedia.org/T190148) [16:02:14] sync-masters seems to be taking an age :( [16:02:29] :( [16:02:44] 10Operations, 10Analytics, 10Research, 10Traffic, and 6 others: Referrer policy for browsers which only support the old spec - https://phabricator.wikimedia.org/T180921#4229047 (10TheDJ) P.S. I do think that fallback is indeed broken on Safari. Note how the error message says it is reverting the policy to... [16:03:01] marostegui the disk has been replaced...ready for your reinstall [16:03:28] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1065 storage crash - https://phabricator.wikimedia.org/T195444#4229050 (10jcrespo) [16:03:48] (03PS1) 10Jcrespo: auto_install: Allow full reimage of db1065, disallow most others [puppet] - 10https://gerrit.wikimedia.org/r/434946 (https://phabricator.wikimedia.org/T195444) [16:03:50] cmjohnson: thank you [16:03:56] 3 mins on sync-masters... ill cancel and try again [16:04:32] (03PS2) 10Jcrespo: auto_install: Allow full reimage of db1065, disallow most others [puppet] - 10https://gerrit.wikimedia.org/r/434946 (https://phabricator.wikimedia.org/T195444) [16:04:54] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy2001 is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging [16:05:17] ^^ I imagine that has something to do with it? [16:05:20] mutante: ^ is that you? [16:06:25] uhm.. yes.. as in "i added deploy2001 to scap masters" [16:06:41] i didn't touch permissions of /srv stuff though.. looking [16:07:06] * addshore is currently trying to do a sync, and its stuck on sync-masters [16:07:30] ok to deploy https://gerrit.wikimedia.org/r/434946 ? [16:08:04] jynus: you added db2065, not db1065 [16:08:11] oh [16:08:36] addshore: how exactly is it stuck? [16:08:41] does it show any error [16:08:45] nope [16:08:45] sync-masters: 0% (ok: 0; fail: 0; left: 1) [16:08:58] been running this attempt for 3 mins again now [16:09:03] only syncing a single file also [16:09:30] marostegui: check now, please [16:09:46] checking [16:09:50] (03PS3) 10Jcrespo: auto_install: Allow full reimage of db1065, disallow most others [puppet] - 10https://gerrit.wikimedia.org/r/434946 (https://phabricator.wikimedia.org/T195444) [16:10:09] (03CR) 10Marostegui: [C: 031] auto_install: Allow full reimage of db1065, disallow most others [puppet] - 10https://gerrit.wikimedia.org/r/434946 (https://phabricator.wikimedia.org/T195444) (owner: 10Jcrespo) [16:10:22] (03CR) 10Jcrespo: [C: 032] auto_install: Allow full reimage of db1065, disallow most others [puppet] - 10https://gerrit.wikimedia.org/r/434946 (https://phabricator.wikimedia.org/T195444) (owner: 10Jcrespo) [16:13:23] mutante: 8 mins now, the background process is doing stuff with deploy2001.codfw.wmnet [16:13:48] mutante: Can we get it removed so we can deploy the stuff that is in the queue and do the tests later maybe? [16:13:53] !log deploy2001 - let mwdeploy own .~tmp~ in /srv/mediawiki-staging [16:13:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:01] oh :) [16:14:38] on 2001 it looks like cdb rebuild is running? [16:14:51] or was, sync-masters just finished [16:14:54] RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy2001 is OK: Files ownership is ok. [16:15:00] \o/ [16:15:13] !log addshore@tin Synchronized php-1.32.0-wmf.5/extensions/Wikibase/repo/maintenance/dispatchChanges.php: [[gerrit:434940|Dont use WikibaseRepo class in dispatchChanges constructor]] (duration: 10m 28s) [16:15:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:30] got 1 error with the sync [16:15:30] 16:14:51 Check 'Check endpoints for mwdebug1001.eqiad.wmnet' failed: /wiki/{title} (Special Version) timed out before a response was received [16:15:48] I am deploying now too, let's see how it goes [16:16:01] i am glad that fixed the icinga alert [16:16:13] addshore: that would make sense since it was the first deploy for this ever [16:16:15] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434944 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [16:16:31] mutante: ack! :) [16:16:36] if there are more issues i can remove it [16:16:44] (deploy2001 from scap lists) [16:16:46] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1067 after alter table (duration: 01m 17s) [16:16:48] mutante: My deploy was all good:-) [16:16:49] but i would love it if we dont have to :) [16:16:49] No errors [16:16:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:51] yay [16:16:58] that sync went much faster! woo :) [16:17:05] :)) [16:17:15] I wonder if it might be an idea to force the cdb rebuild when adding them in the future? [16:17:27] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434944 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [16:17:42] yea, probably, i wasn't really aware of that [16:18:15] PROBLEM - Host db1065 is DOWN: PING CRITICAL - Packet loss = 100% [16:18:39] (03PS1) 10Cmjohnson: Add MAC addresses db1124/25 [puppet] - 10https://gerrit.wikimedia.org/r/434947 (https://phabricator.wikimedia.org/T194780) [16:19:00] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1099:3311 for alter table (duration: 01m 13s) [16:19:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:15] !log Deploy schema change on db1099:3311 - T191519 T188299 T190148 [16:19:17] (03CR) 10Cmjohnson: [C: 032] Add MAC addresses db1124/25 [puppet] - 10https://gerrit.wikimedia.org/r/434947 (https://phabricator.wikimedia.org/T194780) (owner: 10Cmjohnson) [16:19:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:21] T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519 [16:19:21] T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148 [16:19:21] T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299 [16:19:25] PROBLEM - Disk space on tin is CRITICAL: DISK CRITICAL - free space: / 1393 MB (3% inode=61%) [16:19:34] heh, tin is running full [16:19:51] but it only has to survive until tomorrow [16:19:52] <_joe_> mutante: what's happening there? do you need me to take a look? [16:19:59] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434944 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [16:20:03] <_joe_> no, let's see what's wrong please [16:20:07] _joe_: no, it seems it'a already fixed [16:20:47] <_joe_> ottomata: can I ask you to review https://gerrit.wikimedia.org/r/434948 today? I really need to keep this going [16:21:01] 10Operations, 10Scap, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes - https://phabricator.wikimedia.org/T191921#4229114 (10Reedy) ``` reedy@tin:~$ time PHP=php5 mwscript rebuildLocalisationCache.php --wiki=enwiki --outdir=/tmp/l10nstuff3 408... [16:21:16] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/install db112[45].eqiad.wmnet (sanitarium expansion) - https://phabricator.wikimedia.org/T194780#4229116 (10Cmjohnson) [16:21:35] RECOVERY - Disk space on tin is OK: DISK OK [16:21:53] _joe_: tin disk space issue is unrelated.. never ran full.. and i fixed with apt-get clean [16:22:56] !log tin apt-get clean saved 7% disk space on / - fixing disk space alert [16:22:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:55] dbproxy1002 and dbproxy1007 will complain until db1065 reimage [16:25:37] although I think I could failover them to the 3rd copy [16:25:40] 10Operations, 10Scap, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes - https://phabricator.wikimedia.org/T191921#4229129 (10thcipriani) >>! In T191921#4229114, @Reedy wrote: > ``` > reedy@tin:~$ time PHP=php5 mwscript rebuildLocalisationCache... [16:26:55] 10Operations, 10Scap, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes on HHVM (rather than ~5 on PHP 5) - https://phabricator.wikimedia.org/T191921#4229130 (10Jdforrester-WMF) [16:28:25] RECOVERY - haproxy failover on dbproxy1007 is OK: OK check_failover servers up 2 down 0 [16:28:44] !log manually failover the backup host for m2 to db1117:3322 [16:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:05] RECOVERY - haproxy failover on dbproxy1002 is OK: OK check_failover servers up 2 down 0 [16:29:25] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on einsteinium is CRITICAL: cluster={cache_text,cache_upload} site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:29:54] PROBLEM - HTTP availability for Varnish at esams on einsteinium is CRITICAL: job={varnish-text,varnish-upload} site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [16:30:25] there are several issues with varnish on esams bblack [16:30:57] I saw 5 hosts complaining, known? [16:31:15] !log Running deduplicateArchiveRevId.php on group 2 for T193180 [16:31:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:19] T193180: Clean up archive rows with duplicate revision IDs - https://phabricator.wikimedia.org/T193180 [16:31:34] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [16:31:45] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [16:32:04] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [16:33:04] RECOVERY - HTTP availability for Varnish at esams on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [16:33:45] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:35:58] <_joe_> uh [16:36:08] there was a single spike of 500s [16:36:10] already recovered [16:36:13] <_joe_> in esams? [16:36:25] <_joe_> or general, just more prominent in esams given the time of the day [16:37:08] esams only AFAICS [16:37:17] both text and upload [16:37:24] https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [16:38:06] 10Operations, 10Traffic, 10Browser-Support-Apple-Safari, 10Upstream: Fix broken referer categorization for visits from Safari browsers - https://phabricator.wikimedia.org/T154702#4229170 (10Nuria) @TheDJ is there a way to know when the fix landed in the safari version users have (not when it was merged and... [16:38:34] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [16:39:24] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [16:40:05] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [16:43:31] hi, im getting a 503 at [16:43:32] https://wikitech.wikimedia.org/wiki/Wikitech:Cloud_Services_Terms_of_use#If_my_tools_collect_Private_Information [16:43:39] If you report this error to the Wikimedia System Administrators, please include the details below. [16:43:39] Request from 2a00:23c4:ad0a:7d01:995e:9d67:b5e9:7563 via cp3010 cp3010, Varnish XID 61223376 [16:43:39] Error: 503, Backend fetch failed at Thu, 24 May 2018 16:43:12 GMT [16:44:00] <_joe_> andrewbogott: ^^ [16:44:02] Probably due to the esams errors above [16:44:03] ? [16:44:04] <_joe_> this is on wikitech [16:44:07] <_joe_> nope reedy [16:44:25] <_joe_> or yes, indeed [16:44:25] ACKNOWLEDGEMENT - MD RAID on bast3002 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T195501 [16:44:31] 10Operations, 10ops-esams: Degraded RAID on bast3002 - https://phabricator.wikimedia.org/T195501#4229204 (10ops-monitoring-bot) [16:44:34] PROBLEM - Host eeden is DOWN: PING CRITICAL - Packet loss = 100% [16:44:36] <_joe_> paladox: purge the cache :P [16:44:40] ok [16:44:42] <_joe_> WTF? [16:44:45] <_joe_> eeden down? [16:44:46] ah [16:44:48] works now [16:44:56] <_joe_> ema vgutierrez [16:44:57] I dont see any logspam? [16:45:01] wut? [16:45:14] PROBLEM - Host ns2-v4 is DOWN: PING CRITICAL - Packet loss = 100% [16:45:21] XioNoX too [16:45:29] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack frpig1001 - https://phabricator.wikimedia.org/T187365#4229207 (10Jgreen) [16:45:35] wut [16:45:54] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on einsteinium is CRITICAL: cluster={cache_text,cache_upload} site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:45:55] few errors on phab too [16:46:15] PROBLEM - HTTP availability for Varnish at esams on einsteinium is CRITICAL: job={varnish-text,varnish-upload} site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [16:46:15] <_joe_> ok [16:46:20] <_joe_> let's depool esams NOW [16:46:42] <_joe_> volans: do you concur? [16:46:51] I don't see anything going on on the LVSes or traffic graphs, but sure [16:46:54] better safe than sorry [16:46:54] RECOVERY - Host eeden is UP: PING OK - Packet loss = 0%, RTA = 83.72 ms [16:47:03] yeah [16:47:14] RECOVERY - Host ns2-v4 is UP: PING OK - Packet loss = 0%, RTA = 83.69 ms [16:47:16] XioNoX: did you do anything? [16:47:17] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack frdata1001 - https://phabricator.wikimedia.org/T187364#4229224 (10Jgreen) [16:47:22] want me to do it? [16:47:24] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [16:47:29] nop, barely looked [16:47:35] no I mean the recovery was due to you doing something :D [16:47:43] _joe_: I'm at a conference but looking on. Seems resolved? [16:47:48] <_joe_> andrewbogott: yeah sorry [16:47:59] np, thanks [16:48:07] (03PS1) 10Giuseppe Lavagetto: Depool esams [dns] - 10https://gerrit.wikimedia.org/r/434952 [16:48:14] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [16:48:23] * volans checking eeded [16:48:27] *eeden [16:48:28] <_joe_> XioNoX: if you want, I can depool [16:48:51] confd was spamming syslog with All the given peers are not reachable [16:48:55] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [16:48:57] seems network so far [16:49:04] (03PS1) 10Ayounsi: Depool esams - intermittent issues [dns] - 10https://gerrit.wikimedia.org/r/434953 [16:49:13] https://gerrit.wikimedia.org/r/#/c/434953/ [16:49:19] started at 16:26:12 [16:49:22] volans: _joe_ bblack ^ [16:49:33] (03CR) 10Ayounsi: [C: 032] Depool esams - intermittent issues [dns] - 10https://gerrit.wikimedia.org/r/434953 (owner: 10Ayounsi) [16:49:49] ack [16:50:06] <_joe_> XioNoX: commit yours :) [16:50:50] !log depooled esams - investigating issues [16:50:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:14] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:51:35] RECOVERY - HTTP availability for Varnish at esams on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [16:51:40] so far nothing on the host on eeden, XioNoX you might want to have a look at the network devices [16:51:52] yeah, looking [16:53:00] maybe this https://phabricator.wikimedia.org/T195501 ? [16:53:05] PROBLEM - Check systemd state on lawrencium is CRITICAL: Return code of 255 is out of bounds [16:53:26] 10Operations, 10ops-codfw, 10DBA: Swith port information for db209[4-5] - https://phabricator.wikimedia.org/T195507#4229282 (10Papaul) p:05Triage>03Normal [16:54:17] 10Operations, 10ops-esams: Degraded RAID on bast3002 - https://phabricator.wikimedia.org/T183814#4229307 (10Volans) [16:54:19] 10Operations, 10ops-esams: Degraded RAID on bast3002 - https://phabricator.wikimedia.org/T195501#4229309 (10Volans) [16:55:17] paladox: should not cause 503 [16:55:24] ah ok [16:56:13] unrealated, it's an old thing [16:56:37] (03PS5) 10Matěj Suchánek: Update Wikidata property blacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430045 [16:56:45] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [16:57:01] i was about to say.. that would be the second disk on bast3002. but that's old. yep [16:57:34] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [16:58:04] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [16:58:16] 10Operations, 10ops-codfw: Degraded RAID on elastic2020 - https://phabricator.wikimedia.org/T195306#4229352 (10Volans) It looks to me that the battery is broken/not recognized. [16:59:46] !log Running populateExternallinksIndex60.php on group 1 for T59176 [16:59:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:50] T59176: ApiQueryExtLinksUsage::run query has crazy limit - https://phabricator.wikimedia.org/T59176 [16:59:58] can't find anything wrong on the network side right now, but maybe a spike of network related logs [17:00:04] looking into that [17:00:04] cscott, arlolra, subbu, halfak, and Amir1: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Graphoid / Parsoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180524T1700). [17:01:12] (03CR) 10Pnorman: [C: 031] maps: add fonts-noto and fonts-noto-cjk [puppet] - 10https://gerrit.wikimedia.org/r/434904 (https://phabricator.wikimedia.org/T195474) (owner: 10Gehel) [17:01:36] (03PS3) 10Gehel: maps: add fonts-noto and fonts-noto-cjk [puppet] - 10https://gerrit.wikimedia.org/r/434904 (https://phabricator.wikimedia.org/T195474) [17:02:43] (03CR) 10Gehel: [C: 032] maps: add fonts-noto and fonts-noto-cjk [puppet] - 10https://gerrit.wikimedia.org/r/434904 (https://phabricator.wikimedia.org/T195474) (owner: 10Gehel) [17:02:45] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db209[45].codfw.wmnet (sanitarium expansion) - https://phabricator.wikimedia.org/T194781#4229379 (10Papaul) [17:03:34] 10Operations, 10ops-codfw: Degraded RAID on wasat - https://phabricator.wikimedia.org/T193394#4229384 (10Volans) [17:03:37] 10Operations, 10ops-codfw: Degraded RAID on wasat - https://phabricator.wikimedia.org/T195339#4229387 (10Volans) [17:04:20] Can anyone point me to some up to date data regarding redis access? it seems https://wikitech.wikimedia.org/wiki/Redis is out of date? [17:04:46] oh no... just redis-cli is only on tin! [17:06:05] 10Operations, 10ops-codfw, 10DBA, 10netops: Swtich port information for db209[4-5] - https://phabricator.wikimedia.org/T195507#4229394 (10Papaul) [17:06:15] oh, i wonder what installed that on tin [17:06:26] (03PS1) 10Cmjohnson: labvirt1019 changing dhcpd MAC [puppet] - 10https://gerrit.wikimedia.org/r/434957 (https://phabricator.wikimedia.org/T194964) [17:06:34] because it sure isnt on deploy2001 , which means not coming from puppet role [17:06:44] (03PS2) 10Cmjohnson: labvirt1019 changing dhcpd MAC [puppet] - 10https://gerrit.wikimedia.org/r/434957 (https://phabricator.wikimedia.org/T194964) [17:07:21] (03CR) 10Cmjohnson: [C: 032] labvirt1019 changing dhcpd MAC [puppet] - 10https://gerrit.wikimedia.org/r/434957 (https://phabricator.wikimedia.org/T194964) (owner: 10Cmjohnson) [17:07:27] 10Operations, 10ops-codfw, 10DBA, 10netops: switch port information for db209[4-5] - https://phabricator.wikimedia.org/T195507#4229282 (10Papaul) [17:07:29] network log look clean [17:08:07] :( [17:08:08] 10Operations, 10ops-codfw: Degraded RAID on wasat - https://phabricator.wikimedia.org/T193394#4229402 (10Volans) a:05RobH>03Volans [17:08:28] (03PS3) 10Muehlenhoff: Enable base::service_auto_restart for smartd [puppet] - 10https://gerrit.wikimedia.org/r/419769 (https://phabricator.wikimedia.org/T135991) [17:09:43] 10Operations, 10ops-codfw, 10netops: switch port information for db209[4-5] - https://phabricator.wikimedia.org/T195507#4229419 (10Marostegui) [17:10:09] mutante: interesting, the wikitech page said terbium, but it definitely isn't there [17:10:10] starting a ripe mesurement to see if there is anything funky between the world and esams [17:10:34] PROBLEM - Long running screen/tmux on lawrencium is CRITICAL: Return code of 255 is out of bounds [17:11:12] addshore: yea, and i can't find it in puppet or in git log [17:13:07] 04Critical Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80% [17:13:07] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: Connect or troubleshoot eth1 on labvirt1019 and labvirt1020 - https://phabricator.wikimedia.org/T194964#4229437 (10Cmjohnson) I have attempted to get the 10G NICS work but am having zero luck. I am able to enable them in the bios, set the P... [17:13:30] addshore: i will follow-up somehow.. either a patch or a mail or soemthing [17:13:44] okay! [17:13:50] for now you have tin.. right.. because tomorrow that will change , heh [17:14:09] guess i'll make a patch to install redis-cli [17:16:23] (03PS1) 10Papaul: DHCP: Add MAC address entries for db209[4-5] [puppet] - 10https://gerrit.wikimedia.org/r/434960 [17:16:34] can't find anything wrong with the ripe atlas neither [17:17:03] (03CR) 10jerkins-bot: [V: 04-1] DHCP: Add MAC address entries for db209[4-5] [puppet] - 10https://gerrit.wikimedia.org/r/434960 (owner: 10Papaul) [17:19:05] 10Operations, 10ops-codfw: Degraded RAID on wasat - https://phabricator.wikimedia.org/T193394#4229453 (10Volans) a:05Volans>03None I just discovered that this host is planned for reimage in the next few days, not bothering fixing the md array as the host is not seeing the replaced disk and might need anywa... [17:20:14] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: Connect or troubleshoot eth1 on labvirt1019 and labvirt1020 - https://phabricator.wikimedia.org/T194964#4229456 (10Bstorm) Huh. The 19 and 20 are both already imaged fully, if that matters. [17:23:34] PROBLEM - MariaDB Slave Lag: s7 on db2047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 370.00 seconds [17:23:44] PROBLEM - MariaDB Slave Lag: s7 on db2086 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 376.02 seconds [17:23:55] PROBLEM - MariaDB Slave Lag: s7 on db2054 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 383.35 seconds [17:24:04] PROBLEM - MariaDB Slave Lag: s7 on db2077 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 384.48 seconds [17:24:05] PROBLEM - MariaDB Slave Lag: s7 on db2087 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 388.96 seconds [17:24:14] PROBLEM - MariaDB Slave Lag: s7 on db2068 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 391.24 seconds [17:24:15] PROBLEM - MariaDB Slave Lag: s7 on db2040 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 393.19 seconds [17:24:24] PROBLEM - MariaDB Slave Lag: s7 on db2061 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 394.84 seconds [17:26:05] ^ FYI, that codfw slave lag is probably the same maintenance script that caused it yesterday. [17:26:50] (03PS2) 10Dzahn: DHCP: Add MAC address entries for db209[4-5] [puppet] - 10https://gerrit.wikimedia.org/r/434960 (https://phabricator.wikimedia.org/T194781) (owner: 10Papaul) [17:27:06] (03CR) 10Dzahn: "commit message space inserted" [puppet] - 10https://gerrit.wikimedia.org/r/434960 (https://phabricator.wikimedia.org/T194781) (owner: 10Papaul) [17:28:13] 10Operations, 10MediaWiki-Platform-Team, 10HHVM, 10TechCom-RFC (TechCom-Approved), 10User-ArielGlenn: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370#4229472 (10Jdforrester-WMF) [17:28:16] volans: any other ideas of what could have caused that eqiad issue? [17:28:29] esams? [17:29:22] papaul: db2095.mgmt is reachable but db2094.mgmt is not [17:29:35] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db209[45].codfw.wmnet (sanitarium expansion) - https://phabricator.wikimedia.org/T194781#4229475 (10Papaul) [17:29:39] volans: yeah [17:29:58] XioNoX: nope, but look at eeden network graph: https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats?panelId=8&fullscreen&orgId=1&var-server=eeden&var-datasource=esams%20prometheus%2Fops&from=1527179192974&to=1527181729448 [17:30:11] mutante: checking [17:31:18] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db209[45].codfw.wmnet (sanitarium expansion) - https://phabricator.wikimedia.org/T194781#4229479 (10RobH) [17:31:21] 10Operations, 10ops-codfw, 10netops: switch port information for db209[4-5] - https://phabricator.wikimedia.org/T195507#4229477 (10RobH) 05Open>03Resolved both ports have descriptions set, enabled, and they were both already in the private vlan [17:32:45] PROBLEM - Check systemd state on lawrencium is CRITICAL: Return code of 255 is out of bounds [17:32:52] mutante: check now [17:33:38] papaul: yes, works now. after some delay [17:33:46] mutante: ok [17:34:05] (03CR) 10Dzahn: [C: 032] DHCP: Add MAC address entries for db209[4-5] [puppet] - 10https://gerrit.wikimedia.org/r/434960 (https://phabricator.wikimedia.org/T194781) (owner: 10Papaul) [17:34:08] merging your change [17:34:15] mutante: thanks [17:34:25] i noticed when i wanted to actually check MAC [17:35:24] you can install now. applied on install2002 [17:35:55] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/install db112[45].eqiad.wmnet (sanitarium expansion) - https://phabricator.wikimedia.org/T194780#4229486 (10Cmjohnson) [17:36:43] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/install db112[45].eqiad.wmnet (sanitarium expansion) - https://phabricator.wikimedia.org/T194780#4208342 (10Cmjohnson) a:05Cmjohnson>03Marostegui @Marostegui These are installed and ready for you to take over. Assigning to you [17:36:55] win 45 [17:37:26] mutante: thanks waiting on Rob to kick up the switch configuration [17:37:39] *nod* [17:38:14] PROBLEM - Check systemd state on lawrencium is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [17:38:41] 10Operations, 10ops-eqiad, 10cloud-services-team: Degraded RAID on labvirt1019 - https://phabricator.wikimedia.org/T194907#4229497 (10Volans) I've double checked both the report script that populate this task and the Icinga check script that raised the alarm. The issue here seems to be that the controller in... [17:39:19] 10Operations, 10ops-eqiad, 10cloud-services-team: Degraded RAID on labvirt1019 - https://phabricator.wikimedia.org/T194907#4229500 (10Volans) Forgot to mention that the above message and output was taken on labvirt1020 as I cannot ssh to 1019 right now. [17:41:00] 10Operations, 10ops-eqiad, 10cloud-services-team: Degraded RAID on labvirt1019 - https://phabricator.wikimedia.org/T194907#4212678 (10Bstorm) Yes there's network work being done on 1019 at the moment. That said, they are identical machines. [17:43:05] 04Critical Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80% [17:43:42] I'm aware of ^, we're good so far [17:43:54] XioNoX: I'm about to go offline, can I leave it to you to repool esams when seems stable? [17:43:58] jouncebot: next [17:43:58] In 0 hour(s) and 16 minute(s): Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180524T1800) [17:44:43] volans: yep [17:44:47] thanks [17:44:47] dpkg is broken on mwdebug1001. tests with package installs? [17:44:50] thanks a lot! [17:45:21] mutante: what do you mean? [17:45:41] DPKG CRITICAL dpkg reports broken packages [17:45:54] it systemd 232-25+deb9u3 amd64 system and service manager [17:46:26] Commandline: apt-get install hhvm hhvm-dbg [17:48:37] mutante: the 'it' status means installed/Triggers-pending [17:48:54] thought that's T and not t [17:49:00] other packages are in similar status, either it or iU [17:49:04] installed/unpacked [17:49:12] see dpkg -l|grep '^[uirph]'|egrep -v '^(ii|rc)' [17:49:14] i see them [17:49:19] (from the script that alarms) [17:49:25] RECOVERY - DPKG on mwdebug1001 is OK: All packages OK [17:49:27] now, why is that... that's a good question :D [17:49:31] yea, i know that part. i fixed most with apt-get autoremove [17:49:46] now just apt-listchanges left [17:50:22] ack [17:50:33] !log mwdebug1001 - apt-get remove to clean up packages in "non ii" states [17:50:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:39] it recovered but didnt tell us yet [18:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Morning SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180524T1800). [18:00:05] No GERRIT patches in the queue for this window AFAICS. [18:00:31] wikibugs_: wake up [18:01:22] !log gerrit restarting on cobalt, back soon [18:01:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:24] mutante: https://www.youtube.com/watch?v=wauzrPn0cfg [18:04:06] PROBLEM - puppet last run on labsdb1011 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [18:04:42] greg-g: haha, i am not at that intensity yet.. i'll save that for emergency :) it was still a mild "would you please" [18:04:56] PROBLEM - puppet last run on labsdb1009 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [18:07:05] PROBLEM - puppet last run on labsdb1010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [18:08:56] 10Operations, 10Phabricator, 10Release-Engineering-Team (Watching / External): Reimage both phab1001 and phab2001 to stretch - https://phabricator.wikimedia.org/T190568#4229568 (10Dzahn) @elukey Alright, yea that makes sense. I'll prioritize this right after the deploy1001 tomorrow. Will do it soon, though i... [18:09:53] twentyafterfour: regarding phab* stretch reinstalls. i could just do phab2001 .. eh.. any time .. ? [18:10:06] while phab1001 would need a full downtime [18:10:20] because we cant failover to 2001.. still blocked by lack of mysql [18:12:22] 10Operations, 10Phabricator, 10Release-Engineering-Team (Watching / External): Reimage both phab1001 and phab2001 to stretch - https://phabricator.wikimedia.org/T190568#4229576 (10Dzahn) I can do phab2001 first (as the ticket suggests).. i suppose anytime without further ado. right, Mukunda? [18:13:06] 04Critical Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80% [18:19:17] (03PS1) 10Ayounsi: Revert "Depool esams - intermittent issues" [dns] - 10https://gerrit.wikimedia.org/r/434974 [18:20:20] (03CR) 10Ayounsi: [C: 032] Revert "Depool esams - intermittent issues" [dns] - 10https://gerrit.wikimedia.org/r/434974 (owner: 10Ayounsi) [18:21:18] !log repool esams [18:21:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:36] RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational [18:31:36] RECOVERY - Check systemd state on netmon2001 is OK: OK - running: The system is fully operational [18:31:39] !log netmon1002, netmon2001: systemctl mask uwsgi; systemctl reset-failed - to fix Icinga alert about broken DPKG since last netbox deploy and to match existing status on labtestweb2001 [18:31:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:43] XioNoX: ^ [18:31:58] thx [18:32:15] RECOVERY - puppet last run on labsdb1010 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [18:33:04] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% [18:33:37] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4081506 (10Marostegui) [18:33:41] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/install db112[45].eqiad.wmnet (sanitarium expansion) - https://phabricator.wikimedia.org/T194780#4229603 (10Marostegui) 05Open>03Resolved Thanks! I have confirmed I can access both servers and they look good. Going to continue the final... [18:34:25] RECOVERY - puppet last run on labsdb1011 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:34:30] 10Operations, 10Patch-For-Review: Re-add intel-microcode - https://phabricator.wikimedia.org/T127825#4229611 (10MoritzMuehlenhoff) Six out of the mw* servers have been switched to using microcode updates. [18:35:16] RECOVERY - puppet last run on labsdb1009 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [18:37:02] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db209[45].codfw.wmnet (sanitarium expansion) - https://phabricator.wikimedia.org/T194781#4229627 (10Papaul) I can not pxe boot both servers >>Start PXE over IPv4. Station IP address is 10.192.0.101 Server IP address is 208.80.1... [18:39:35] ooh a new bot [18:39:52] haven't seen librenms-wmf around before [18:40:05] it's the netbops bot :) [18:40:08] netops [18:40:13] 10Operations, 10ops-codfw, 10DBA, 10netops, 10Patch-For-Review: rack/setup/install db209[45].codfw.wmnet (sanitarium expansion) - https://phabricator.wikimedia.org/T194781#4229634 (10Marostegui) Adding #netops to see if they can help out [18:40:15] how long has it been running? [18:40:51] hmm.. a couple weeks at least. but it doesn't alert that often [18:41:28] https://wikitech.wikimedia.org/wiki/LibreNMS#IRC_Alerting [18:41:33] hopefully :) [18:42:20] earliest from my logs: [18:42:23] #wikimedia-operations.0317.log:2018-03-17 15:38:05 librenms-wmf ̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-codfw.wikimedia.org recovered from Primary inbound port utilisation over 80% [18:46:40] 10Operations, 10ops-codfw, 10DBA, 10netops, 10Patch-For-Review: rack/setup/install db209[45].codfw.wmnet (sanitarium expansion) - https://phabricator.wikimedia.org/T194781#4229648 (10Marostegui) I can the requests arriving fine (this is db2094) but looks like it is not going past that? : ``` May 24 18:25... [18:50:04] cool [19:00:05] twentyafterfour: Dear deployers, time to do the MediaWiki train deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180524T1900). [19:14:13] !log Starting the MediaWiki train for Thursday May 24, today I will be deploying wmf/1.32.0-wmf.5 to all wikis [19:14:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:05] 08Warning Alert for device cr3-ulsfo.mgmt.ulsfo.wmnet - Juniper environment status [19:27:05] PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: /en.wikipedia.org/v1/page/metadata/{title}{/revision} (Get extended metadata of a test page) timed out before a response was received [19:27:12] PROBLEM - MariaDB Slave SQL: s8 on db1109 is CRITICAL: CRITICAL slave_sql_state could not connect [19:27:36] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/mobile-sections/{title}{/revision}{/tid} (retrieve en.wp main page via mobile-sections) timed out before a response was received: /{domain}/v1/page/metadata/{title}{/revision}{/tid} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article) [19:27:36] etrieve a random article returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/mobile-sections-lead/{title}{/revision}{/tid} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Altrincham page via mobile-sections-lead returned the unexpected status 404 (expecting: 200) [19:27:47] I am checking [19:27:48] <_joe_> uhm what'up with that db? [19:28:15] RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy [19:28:22] PROBLEM - MariaDB Slave IO: s8 on db1104 is CRITICAL: CRITICAL slave_io_state could not connect [19:28:25] too many connections [19:28:29] <_joe_> whoa [19:28:36] Have we deployed something? [19:28:37] <_joe_> s8, wikidata [19:28:39] yeah [19:28:45] PROBLEM - cxserver endpoints health on scb1001 is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received [19:28:45] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/mobile-sections/{title}{/revision}{/tid} (retrieve en.wp main page via mobile-sections) timed out before a response was received: /{domain}/v1/page/metadata/{title}{/revision}{/tid} (retrieve extended metadata for Video article on English Wikipedia) is CRITICAL: Test retrieve extended metadata for Video article on English Wikipedia returned the unex [19:28:45] expecting: 200): /{domain}/v1/page/mobile-sections-lead/{title}{/revision}{/tid} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Altrincham page via mobile-sections-lead returned the unexpected status 404 (expecting: 200) [19:28:49] need help? [19:28:51] PROBLEM - MariaDB Slave IO: s8 on db1092 is CRITICAL: CRITICAL slave_io_state could not connect [19:28:54] I see 1124 and 1125 in icinga also [19:28:57] <_joe_> twentyafterfour: let's rollback the train I guess? [19:29:03] Looks like all wikidata is suffering [19:29:06] yes, let's revert [19:29:06] PROBLEM - cxserver endpoints health on scb2003 is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received [19:29:13] Dispatching started again, but that was some hours ago [19:29:14] <_joe_> twentyafterfour: hey, you there? [19:29:15] PROBLEM - cxserver endpoints health on scb1004 is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received [19:29:16] no those (1124 and 25) are older [19:29:22] PROBLEM - MariaDB Slave SQL: s8 on db1104 is CRITICAL: CRITICAL slave_sql_state could not connect [19:29:25] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/mobile-sections/{title}{/revision}{/tid} (retrieve en.wp main page via mobile-sections) is CRITICAL: Test retrieve en.wp main page via mobile-sections returned the unexpected status 404 (expecting: 200): /{domain}/v1/page/mobile-sections-lead/{title}{/revision}{/tid} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITIC [19:29:25] ead section of en.wp Altrincham page via mobile-sections-lead returned the unexpected status 404 (expecting: 200) [19:29:25] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/mobile-sections/{title}{/revision}{/tid} (retrieve en.wp main page via mobile-sections) is CRITICAL: Test retrieve en.wp main page via mobile-sections returned the unexpected status 404 (expecting: 200): /{domain}/v1/page/metadata/{title}{/revision}{/tid} (retrieve extended metadata for Video article on English Wikipedia) is CRITICAL: Te [19:29:25] d metadata for Video article on English Wikipedia returned the unexpected status 404 (expecting: 200): /{domain}/v1/page/random/title (retrieve a random article) is CRITICAL: Test retrieve a random article returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/mobile-sections-lead/{title}{/revision}{/tid} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead se [19:29:25] incham page via mobile-sections-lead returned the unexpected status 404 (expecting: 200) [19:29:28] <_joe_> ofc the services sugger too [19:29:31] <_joe_> *suffer [19:29:36] PROBLEM - cxserver endpoints health on scb2001 is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received [19:29:36] PROBLEM - restbase endpoints health on restbase2004 is CRITICAL: /en.wikipedia.org/v1/page/metadata/{title}{/revision} (Get extended metadata of a test page) is CRITICAL: Test Get extended metadata of a test page returned the unexpected status 404 (expecting: 200) [19:29:38] marostegui: all S8?? [19:29:45] PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) timed out before a response was received [19:29:46] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: /en.wikipedia.org/v1/page/metadata/{title}{/revision} (Get extended metadata of a test page) timed out before a response was received [19:29:46] PROBLEM - cxserver endpoints health on scb2004 is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received [19:29:46] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/page/metadata/{title}{/revision} (Get extended metadata of a test page) timed out before a response was received [19:30:02] PROBLEM - MariaDB Slave IO: s8 on db1087 is CRITICAL: CRITICAL slave_io_state could not connect [19:30:02] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received [19:30:02] PROBLEM - cxserver endpoints health on scb2006 is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received [19:30:02] PROBLEM - cxserver endpoints health on scb2005 is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received [19:30:02] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/metadata/{title}{/revision}{/tid} (retrieve extended metadata for Video article on English Wikipedia) is CRITICAL: Test retrieve extended metadata for Video article on English Wikipedia returned the unexpected status 404 (expecting: 200): /{domain}/v1/page/mobile-sections-lead/{title}{/revision}{/tid} (retrieve lead section of en.wp Altrincham page [19:30:02] -lead) is CRITICAL: Test retrieve lead section of en.wp Altrincham page via mobile-sections-lead returned the unexpected status 404 (expecting: 200) [19:30:05] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/mobile-sections/{title}{/revision}{/tid} (retrieve en.wp main page via mobile-sections) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article) is CRITICAL: Test retrieve a random article returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/mobile-sections-lead/{title}{/revis [19:30:05] e lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Altrincham page via mobile-sections-lead returned the unexpected status 404 (expecting: 200) [19:30:05] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/mobile-sections/{title}{/revision}{/tid} (retrieve en.wp main page via mobile-sections) is CRITICAL: Test retrieve en.wp main page via mobile-sections returned the unexpected status 404 (expecting: 200): /{domain}/v1/page/metadata/{title}{/revision}{/tid} (retrieve extended metadata for Video article on English Wikipedia) is CRITICAL: Test retrieve [19:30:05] or Video article on English Wikipedia returned the unexpected status 404 (expecting: 200): /{domain}/v1/page/random/title (retrieve a random article) is CRITICAL: Test retrieve a random article returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/mobile-sections-lead/{title}{/revision}{/tid} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) timed out before a response was received [19:30:11] PROBLEM - MariaDB Slave IO: s8 on db1109 is CRITICAL: CRITICAL slave_io_state could not connect [19:30:15] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 504 (expecting: 303) [19:30:15] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/mobile-sections/{title}{/revision}{/tid} (retrieve en.wp main page via mobile-sections) is CRITICAL: Test retrieve en.wp main page via mobile-sections returned the unexpected status 404 (expecting: 200): /{domain}/v1/page/metadata/{title}{/revision}{/tid} (retrieve extended metadata for Video article on English Wikipedia) is CRITICAL: Test retrieve [19:30:15] or Video article on English Wikipedia returned the unexpected status 404 (expecting: 200): /{domain}/v1/page/random/title (retrieve a random article) is CRITICAL: Test retrieve a random article returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/mobile-sections-lead/{title}{/revision}{/tid} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en. [19:30:15] via mobile-sections-lead returned the unexpected status 404 (expecting: 200) [19:30:15] PROBLEM - restbase endpoints health on restbase2006 is CRITICAL: /en.wikipedia.org/v1/page/metadata/{title}{/revision} (Get extended metadata of a test page) is CRITICAL: Test Get extended metadata of a test page returned the unexpected status 404 (expecting: 200) [19:30:16] PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: /en.wikipedia.org/v1/page/metadata/{title}{/revision} (Get extended metadata of a test page) is CRITICAL: Test Get extended metadata of a test page returned the unexpected status 404 (expecting: 200) [19:30:16] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 504 (expecting: 303) [19:30:16] PROBLEM - restbase endpoints health on restbase2007 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 504 (expecting: 303): /en.wikipedia.org/v1/page/metadata/{title}{/revision} (Get extended metadata of a test page) is CRITICAL: Test Get extended metadata of a test page returned the unexpected status 404 (expecting: 200) [19:30:16] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 504 (expecting: 303) [19:30:17] Cannot access the database: No working replica DB server: Unknown error (10.64.32.198:3318)) [19:30:17] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 504 (expecting: 303): /en.wikipedia.org/v1/page/metadata/{title}{/revision} (Get extended metadata of a test page) is CRITICAL: Test Get extended metadata of a test page returned the unexpected status 404 (expecting: 200) [19:30:17] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/mobile-sections/{title}{/revision}{/tid} (retrieve en.wp main page via mobile-sections) is CRITICAL: Test retrieve en.wp main page via mobile-sections returned the unexpected status 404 (expecting: 200): /{domain}/v1/page/metadata/{title}{/revision}{/tid} (retrieve extended metadata for Video article on English Wikipedia) is CRITICAL: Test retrieve [19:30:18] or Video article on English Wikipedia returned the unexpected status 404 (expecting: 200): /{domain}/v1/page/random/title (retrieve a random article) is CRITICAL: Test retrieve a random article returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/mobile-sections-lead/{title}{/revision}{/tid} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en. [19:30:19] via mobile-sections-lead returned the unexpected status 404 (expecting: 200) [19:30:29] wth [19:30:35] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/random/{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 504 (expecting: 303) [19:30:36] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 504 (expecting: 303): /en.wikipedia.org/v1/page/metadata/{title}{/revision} (Get extended metadata of a test page) is CRITICAL: Test Get extended metadata of a test page returned the unexpected status 404 (expecting: 200) [19:30:36] PROBLEM - restbase endpoints health on restbase2003 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 504 (expecting: 303): /en.wikipedia.org/v1/page/metadata/{title}{/revision} (Get extended metadata of a test page) is CRITICAL: Test Get extended metadata of a test page returned the unexpected status 404 (expecting: 200) [19:30:36] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/page/mobile-sections/{title}{/revision}{/tid} (retrieve en.wp main page via mobile-sections) is CRITICAL: Test retrieve en.wp main page via mobile-sections returned the unexpected status 404 (expecting: 200): /{domain}/v1/page/metadata/{title}{/revision}{/tid} (retrieve extended metadata for Video article on English Wikipedia) is CRITICAL: Test retrieve [19:30:36] or Video article on English Wikipedia returned the unexpected status 404 (expecting: 200): /{domain}/v1/page/random/title (retrieve a random article) is CRITICAL: Test retrieve a random article returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/mobile-sections-lead/{title}{/revision}{/tid} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en. [19:30:36] via mobile-sections-lead returned the unexpected status 404 (expecting: 200) [19:30:38] twentyafterfour: we need to revert [19:30:40] twentyafterfour: can you revert? [19:30:45] PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) timed out before a response was received [19:30:45] PROBLEM - cxserver endpoints health on scb2002 is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received [19:30:46] PROBLEM - MariaDB Slave SQL: s8 on db1101 is CRITICAL: CRITICAL slave_sql_state could not connect [19:30:49] 10Operations, 10Traffic, 10Browser-Support-Apple-Safari, 10Upstream: Fix broken referer categorization for visits from Safari browsers - https://phabricator.wikimedia.org/T154702#4229700 (10TheDJ) @nuria Safari 11.1 was released with iOS 11.3 and macOS 11.3.4 (as well as macOS 10.12.6 and 10.11.6) Both r... [19:30:52] _joe_ I didn't deploy anything significant [19:30:53] !log twentyafterfour@tin scap failed: average error rate on 3/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/2cc7028226a539553178454fc2f14459 for details) [19:30:55] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received [19:30:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:59] twentyafterfour: I can confirm it.wiki is broken [19:31:05] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/mobile-sections/{title}{/revision}{/tid} (retrieve en.wp main page via mobile-sections) is CRITICAL: Test retrieve en.wp main page via mobile-sections returned the unexpected status 404 (expecting: 200): /{domain}/v1/page/metadata/{title}{/revision}{/tid} (retrieve extended metadata for Video article on English Wikipedia) is CRITICAL: Test retrieve [19:31:05] or Video article on English Wikipedia returned the unexpected status 404 (expecting: 200): /{domain}/v1/page/random/title (retrieve a random article) is CRITICAL: Test retrieve a random article returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/mobile-sections-lead/{title}{/revision}{/tid} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en. [19:31:05] via mobile-sections-lead returned the unexpected status 404 (expecting: 200) [19:31:05] PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: /en.wikipedia.org/v1/page/metadata/{title}{/revision} (Get extended metadata of a test page) is CRITICAL: Test Get extended metadata of a test page returned the unexpected status 404 (expecting: 200) [19:31:05] PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 504 (expecting: 303): /en.wikipedia.org/v1/page/metadata/{title}{/revision} (Get extended metadata of a test page) is CRITICAL: Test Get extended metadata of a test page returned the unexpected status 404 (expecting: 200) [19:31:06] PROBLEM - cxserver endpoints health on scb1003 is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received [19:31:06] PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) timed out before a response was received: /en.wikipedia.org/v1/page/metadata/{title}{/revision} (Get extended metadata of a test page) is CRITICAL: Test Get extended metadata of a test page returned the unexpected status 404 (expecting: 200) [19:31:06] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:31:06] doesn't matter what looks significant [19:31:11] scap failed. ugh [19:31:11] twentyafterfour: Let's revert anyways [19:31:13] revert anything you can that's recently deployed [19:31:15] RECOVERY - cxserver endpoints health on scb2006 is OK: All endpoints are healthy [19:31:15] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy [19:31:16] PROBLEM - HTTP availability for Varnish at esams on einsteinium is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [19:31:17] <_joe_> twentyafterfour: just revert please? [19:31:18] now seems to be back [19:31:20] things are horribly broken [19:31:21] I didn't deploy anything but one file [19:31:23] a js file [19:31:25] PROBLEM - MariaDB Slave IO: s8 on db1101 is CRITICAL: CRITICAL slave_io_state could not connect [19:31:25] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/page/random/{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 504 (expecting: 303) [19:31:25] PROBLEM - restbase endpoints health on restbase2001 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 504 (expecting: 303): /en.wikipedia.org/v1/page/metadata/{title}{/revision} (Get extended metadata of a test page) timed out before a response was received [19:31:28] twentyafterfour: please don't argue, rollback now. [19:31:31] PROBLEM - MariaDB Slave IO: s8 on db1092 is CRITICAL: CRITICAL slave_io_state could not connect [19:31:33] I'm not arguing [19:31:35] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:31:44] I literally didn't deploy anything [19:31:50] I did a sync file which failed [19:31:51] PROBLEM - MariaDB Slave SQL: s8 on db1092 is CRITICAL: CRITICAL slave_sql_state could not connect [19:31:52] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy [19:31:52] PROBLEM - HTTP availability for Varnish at eqiad on einsteinium is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [19:32:02] <_joe_> it's possible it's unrelated [19:32:05] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on einsteinium is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:32:06] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on einsteinium is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:32:06] PROBLEM - HTTP availability for Varnish at ulsfo on einsteinium is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [19:32:06] possible? [19:32:15] always possible, but still [19:32:15] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on einsteinium is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:32:16] PROBLEM - cxserver endpoints health on scb1002 is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received [19:32:16] <_joe_> quite probable, een [19:32:20] Amir1: you around? [19:32:20] most likely [19:32:25] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 504 (expecting: 303) [19:32:25] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [19:32:26] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/mobile-sections/{title}{/revision}{/tid} (retrieve en.wp main page via mobile-sections) is CRITICAL: Test retrieve en.wp main page via mobile-sections returned the unexpected status 404 (expecting: 200): /{domain}/v1/page/metadata/{title}{/revision}{/tid} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response [19:32:26] main}/v1/page/random/title (retrieve a random article) is CRITICAL: Test retrieve a random article returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/mobile-sections-lead/{title}{/revision}{/tid} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) timed out before a response was received [19:32:26] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) timed out before a response was received: /en.wikipedia.org/v1/page/metadata/{title}{/revision} (Get extended metadata of a test page) timed out before a response was received [19:32:26] PROBLEM - restbase endpoints health on restbase2002 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 504 (expecting: 303): /en.wikipedia.org/v1/page/metadata/{title}{/revision} (Get extended metadata of a test page) timed out before a response was received [19:32:29] <_joe_> bblack: I fear this has to do with some wikidata abuse [19:32:30] marostegui: is it wikidata dispatching causing issues? [19:32:42] addshore: not sure [19:32:45] PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 504 (expecting: 303) [19:32:45] RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy [19:32:46] marostegui: yup [19:32:46] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy [19:32:47] PROBLEM - HTTP availability for Varnish at eqsin on einsteinium is CRITICAL: job=varnish-text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [19:32:57] Amir1: looks related to the query we were discussing earlier [19:33:05] PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/page/random/{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 504 (expecting: 303) [19:33:05] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 504 (expecting: 303) [19:33:05] Amir1: Or at least I am seeing it flooding the slaves [19:33:06] PROBLEM - HTTP availability for Varnish at codfw on einsteinium is CRITICAL: job=varnish-text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [19:33:07] Amir1: marostegui which query? [19:33:15] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [19:33:16] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [19:33:22] PROBLEM - MariaDB Slave SQL: s8 on db1087 is CRITICAL: CRITICAL slave_sql_state could not connect [19:33:25] SELECT /* Wikibase\Lib\Store\Sql\TermSqlIndex::getMatchingTerms [19:33:36] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 504 (expecting: 303): /en.wikipedia.org/v1/page/metadata/{title}{/revision} (Get extended metadata of a test page) is CRITICAL: Test Get extended metadata of a test page returned the unexpected status 404 (expecting: 200) [19:33:36] hmmmm [19:33:42] RECOVERY - MariaDB Slave SQL: s8 on db1109 is OK: OK slave_sql_state Slave_SQL_Running: Yes [19:33:43] That is what I am seeing for now [19:33:44] marostegui: it was only one slave [19:33:45] PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 504 (expecting: 303): /en.wikipedia.org/v1/page/metadata/{title}{/revision} (Get extended metadata of a test page) timed out before a response was received [19:33:46] *looks* [19:33:52] Amir1: no, it is everywhere [19:34:02] PROBLEM - MariaDB Slave IO: s8 on db1087 is CRITICAL: CRITICAL slave_io_state could not connect [19:34:03] bloody NickServ [19:34:04] did we drop it from everywhere? [19:34:07] my scap failed due to canary checks so it definitely wasn't anything I deployed [19:34:11] o_O [19:34:15] PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/page/random/{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 504 (expecting: 303) [19:34:15] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [19:34:16] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [19:34:16] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [19:34:21] wtf is happening [19:34:27] <_joe_> twentyafterfour: I'm pretty sure [19:34:31] freenode global service issues [19:34:34] great time for nickserv problems [19:34:46] Guest9040: it was twentyafterfour's patch :p [19:34:51] Amir1: what calls that? is this to do with that index that was being removed? [19:34:56] PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [19:34:57] marostegui: did we drop tmp1 from everywhere? [19:35:02] Guest48088: yes [19:35:05] addshore: there was several indexes [19:35:09] 10Operations: Database error - https://phabricator.wikimedia.org/T195520#4229738 (10Xaosflux) [19:35:11] PROBLEM - MariaDB Slave Lag: s8 on db1104 is CRITICAL: CRITICAL slave_sql_lag could not connect [19:35:15] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received [19:35:15] PROBLEM - cxserver endpoints health on scb2006 is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received [19:35:16] the tmp1 one [19:35:31] PROBLEM - MariaDB Slave IO: s8 on db1092 is CRITICAL: CRITICAL slave_io_state could not connect [19:35:32] maybe this was some how critical to holding that table together? [19:35:32] Guest48088: tmp1 was dropped everywhere [19:35:46] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/random/{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 504 (expecting: 303) [19:36:03] Lexeme doesn't use the terms table at all so there should be no increased use due to Lexeme [19:36:05] addshore: nope, it's because the property suggester uses wb_terms directly [19:36:10] to search [19:36:13] 10Operations: Multiple projects reporting Cannot access the database: No working replica DB server - https://phabricator.wikimedia.org/T195520#4229709 (10Xaosflux) [19:36:15] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [19:36:22] PROBLEM - MariaDB Slave IO: s8 on db1104 is CRITICAL: CRITICAL slave_io_state could not connect [19:36:26] 10Operations: Multiple projects reporting Cannot access the database: No working replica DB server - https://phabricator.wikimedia.org/T195520#4229709 (10Xaosflux) Duplicated on meta: at https://meta.wikimedia.org/wiki/Category:Deleteme [19:36:45] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/page/random/{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 504 (expecting: 303) [19:36:46] RECOVERY - restbase endpoints health on restbase2006 is OK: All endpoints are healthy [19:37:00] addshore: marostegui one very quick way is to disable property suggester until we fix it properly [19:37:03] wait, multiple shards? [19:37:05] RECOVERY - cxserver endpoints health on scb1004 is OK: All endpoints are healthy [19:37:09] Guest48088: might be a good plan [19:37:10] Guest48088: go for it [19:37:10] 10Operations, 10Wikimedia-log-errors: Multiple projects reporting Cannot access the database: No working replica DB server - https://phabricator.wikimedia.org/T195520#4229744 (10Framawiki) p:05Triage>03Unbreak! [19:37:21] PROBLEM - MariaDB Slave SQL: s8 on db1104 is CRITICAL: CRITICAL slave_sql_state could not connect [19:37:21] PROBLEM - MariaDB Slave SQL: s8 on db1101 is CRITICAL: CRITICAL slave_sql_state could not connect [19:37:24] +1 to disable [19:37:25] If it is indeed that query [19:37:32] Guest48088: you make the patch? [19:37:33] 10Operations, 10Wikimedia-General-or-Unknown, 10Wikimedia-log-errors: Multiple projects reporting Cannot access the database: No working replica DB server - https://phabricator.wikimedia.org/T195520#4229709 (10Marostegui) we are on it [19:37:36] 10Operations, 10Wikimedia-General-or-Unknown, 10Wikimedia-log-errors: Multiple projects reporting Cannot access the database: No working replica DB server - https://phabricator.wikimedia.org/T195520#4229749 (10Framawiki) [19:37:42] PROBLEM - MariaDB Slave SQL: s8 on db1109 is CRITICAL: CRITICAL slave_sql_state could not connect [19:37:49] doesn't matter. I do it ASAP [19:37:51] RECOVERY - MariaDB Slave Lag: s8 on db1104 is OK: OK slave_sql_lag Replication lag: 0.35 seconds [19:38:06] PROBLEM - MariaDB Slave IO: s8 on db1101 is CRITICAL: CRITICAL slave_io_state could not connect [19:38:14] Guest48088: yeah, disable it [19:38:15] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 504 (expecting: 303): /en.wikipedia.org/v1/page/metadata/{title}{/revision} (Get extended metadata of a test page) is CRITICAL: Test Get extended metadata of a test page returned the unexpected status 404 (expecting: 200) [19:38:22] addshore: do we have a config for that? [19:38:26] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [19:38:38] <_joe_> let's disable that [19:38:45] PROBLEM - MariaDB Slave Lag: s8 on db1101 is CRITICAL: CRITICAL slave_sql_lag could not connect [19:38:49] PATCH INBOUND [19:38:52] heh caps [19:39:04] <_joe_> addshore: caps are ok in this situation :D [19:39:06] // wfLoadExtension( 'PropertySuggester' ); [19:39:06] (03PS1) 10Addshore: Dont load PropertySuggester [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434996 [19:39:08] FIXED [19:39:11] PROBLEM - MariaDB Slave Lag: s8 on db1109 is CRITICAL: CRITICAL slave_sql_lag could not connect [19:39:11] Reedy: indeed [19:39:19] (03CR) 10Reedy: [C: 032] Dont load PropertySuggester [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434996 (owner: 10Addshore) [19:39:24] Will that pass for PHPCS? [19:39:26] lol [19:39:29] what's up? [19:39:37] Reedy: i propose ignoring phpcs? [19:39:46] does it need a space after the //? [19:40:14] <_joe_> who is deploying? [19:40:17] in the mean time I work on the proper fix [19:40:17] <_joe_> :) [19:40:19] (03CR) 10jerkins-bot: [V: 04-1] Dont load PropertySuggester [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434996 (owner: 10Addshore) [19:40:32] PROBLEM - MariaDB Slave IO: s8 on db1087 is CRITICAL: CRITICAL slave_io_state could not connect [19:40:33] (03PS2) 10Addshore: Dont load PropertySuggester [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434996 [19:40:40] (03PS3) 10Addshore: Dont load PropertySuggester [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434996 [19:40:45] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/random/{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 504 (expecting: 303) [19:40:45] PROBLEM - restbase endpoints health on restbase2006 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 504 (expecting: 303): /en.wikipedia.org/v1/page/metadata/{title}{/revision} (Get extended metadata of a test page) is CRITICAL: Test Get extended metadata of a test page returned the unexpected status 404 (expecting: 200) [19:40:55] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy [19:40:57] (03PS4) 10Addshore: Dont load PropertySuggester [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434996 [19:40:57] ffs typos [19:41:04] PS4 it is... [19:41:05] PROBLEM - cxserver endpoints health on scb1004 is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received [19:41:05] srsly [19:41:12] Reedy: can you deploy? [19:41:13] (03CR) 10Reedy: [C: 032] Dont load PropertySuggester [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434996 (owner: 10Addshore) [19:41:21] looks like a yes [19:41:43] it can't bring down the databse like that [19:41:45] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [19:41:45] <_joe_> I can deploy if no one else can [19:41:51] I'm waiting for a merge [19:41:52] I can [19:41:55] Guest79402: It is weird it happened all of a sudden [19:42:06] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) timed out before a response was received [19:42:06] I'm not sure if that's the only cause [19:42:11] let's disable it first [19:42:18] Guest79402: Yeah, let's discard this first [19:42:27] (03Merged) 10jenkins-bot: Dont load PropertySuggester [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434996 (owner: 10Addshore) [19:42:28] Guest79402: it could be a combination of things [19:42:42] (03CR) 10jenkins-bot: Dont load PropertySuggester [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434996 (owner: 10Addshore) [19:42:45] merge done [19:42:49] dispatching started up again around 2 hours ago ish I think, and I was running an extra dispatcher [19:42:51] PROBLEM - MariaDB Slave IO: s8 on db1087 is CRITICAL: CRITICAL slave_io_state could not connect [19:42:52] <_joe_> let's first deploy and thing of other causes [19:42:54] but thats never done this before [19:43:08] the pig is flying [19:43:15] RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy [19:43:16] <_joe_> Reedy: <3 [19:43:25] remind me how long it takes that that to get around [19:43:33] ...with hhvm [19:43:36] PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [19:43:43] <_joe_> not much [19:43:54] 10Operations, 10MediaWiki-extensions-Translate, 10Language-2018-Apr-June, 10Wikimedia-Incident, 10Wikimedia-log-errors: 503 error attempting to open multiple projects (Wikipedia and meta wiki are loading very slowly) - https://phabricator.wikimedia.org/T195293#4222298 (101339861mzb) It appears to me in t... [19:43:55] <_joe_> but I fear the canary shit will stop the deployment [19:44:04] <_joe_> Reedy: what's taking so long? [19:44:05] scap sync-[whatever] --force [19:44:06] !log reedy@tin Synchronized wmf-config/Wikibase.php: Disable PropSuggester (duration: 01m 21s) [19:44:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:10] if it's needed [19:44:16] oof [19:44:16] ...which it isn't, I guess. [19:44:18] <_joe_> ok it's done now [19:44:25] Took it 1m21 [19:44:25] <_joe_> let's see if something recovers [19:44:25] https://fr.wikipedia.org/wiki/Clifford_Geertz is down too [19:44:26] if this is just s8 why does the ticket also say wikipedia & meta? [19:44:49] _joe_: maybe worth restarting apaches? [19:44:56] going to kill connections [19:45:18] <_joe_> marostegui: apaches won't do anything for connections [19:45:20] addshore: i suppose that every purged page that uses wikidata is concerned [19:45:28] <_joe_> kill connections without pity on your side, it's faster [19:45:34] oki [19:45:36] <_joe_> framawiki: probably [19:45:39] framawiki: perhaps! [19:46:05] framawiki: complaints to #wikimedia-tech please. this here is for people fixing, don't distract them [19:46:12] <_joe_> the pages now load btw [19:46:17] frwiki wfm, wikidata still having issues [19:46:20] <_joe_> https://fr.wikipedia.org/wiki/Clifford_Geertz?lsldls loads [19:46:26] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/random/{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 504 (expecting: 303) [19:46:26] PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/page/random/{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 504 (expecting: 303) [19:46:27] on call with Lydia [19:46:29] <_joe_> (note the cache-busting trick) [19:46:31] wikidata still not loading for me [19:46:36] 10Operations, 10MediaWiki-extensions-Translate, 10Language-2018-Apr-June, 10Wikimedia-Incident, 10Wikimedia-log-errors: 503 error attempting to open multiple projects (Wikipedia and meta wiki are loading very slowly) - https://phabricator.wikimedia.org/T195293#4229785 (10Urbanecm) @1339861mzb Are you sur... [19:46:40] <_joe_> wikidata is still down [19:47:06] 10Operations, 10Wikimedia-General-or-Unknown, 10Wikimedia-log-errors: Multiple projects reporting Cannot access the database: No working replica DB server - https://phabricator.wikimedia.org/T195520#4229788 (10Reedy) [19:47:09] yeah, slaves still with many connections, I am killing them [19:47:15] PROBLEM - MariaDB Slave Lag: s4 on db2091 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.04 seconds [19:47:16] PROBLEM - MariaDB Slave Lag: s4 on db2084 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.88 seconds [19:47:16] RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy [19:47:25] PROBLEM - MariaDB Slave Lag: s4 on db2065 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 305.00 seconds [19:47:45] PROBLEM - MariaDB Slave Lag: s4 on db2058 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.38 seconds [19:47:46] PROBLEM - MariaDB Slave Lag: s4 on db2073 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.77 seconds [19:48:04] 10Operations, 10MediaWiki-extensions-Translate, 10Language-2018-Apr-June, 10Wikimedia-Incident, 10Wikimedia-log-errors: 503 error attempting to open multiple projects (Wikipedia and meta wiki are loading very slowly) - https://phabricator.wikimedia.org/T195293#4229795 (10Urbanecm) EDIT: This can be T195520. [19:48:05] PROBLEM - MariaDB Slave Lag: s4 on db2051 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 315.59 seconds [19:48:15] PROBLEM - MariaDB Slave Lag: s4 on db2090 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 317.72 seconds [19:48:18] why there is a lag on s4, that part is strange [19:48:21] PROBLEM - MariaDB Slave IO: s8 on db1087 is CRITICAL: CRITICAL slave_io_state could not connect [19:48:23] <_joe_> marostegui: are the connections coming back? [19:48:23] ^ FYI, s4 codfw slave lag is probably unrelated. [19:48:30] <_joe_> Guest79402: that's codfw, unrelated [19:48:31] what is it related to? [19:48:31] I think there was a script running by anomie [19:48:32] _joe_: submitting an edit on dewiki fails as well [19:48:35] ok [19:48:37] k, noted [19:48:56] 10Operations, 10Wikimedia-General-or-Unknown, 10Wikimedia-log-errors: Multiple projects reporting Cannot access the database: No working replica DB server - https://phabricator.wikimedia.org/T195520#4229709 (10Urbanecm) Should this be #wikimedia-incident as well? [19:49:01] <_joe_> Sagan: but you can see uncached pages? [19:49:16] <_joe_> I can now see wikidata pages it seems [19:49:17] I am struggling to kill connections [19:49:21] the servers are super overloaded [19:49:26] it.wiki seems to be working fine now [19:49:27] also disabled those notifications for s4 (for now) [19:49:30] marostegui: can I help? [19:49:31] marostegui: just s8 i think? [19:49:34] _joe_: you mean if I purge a page? that works [19:49:45] <_joe_> Sagan: can you re-try to edit now? [19:49:53] <_joe_> I think the dbs are in a better shape [19:50:14] <_joe_> uhm not really [19:50:24] _joe_: that works, thanks [19:50:36] RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy [19:50:36] PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/page/random/{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 504 (expecting: 303) [19:50:41] still no wikidata pages from here for me, still just Cannot access the database: No working replica DB server: Unknown error [19:50:55] _joe_: sorry. editing in my userspace worked, but not on a page at Wikipedia: [19:50:56] 10Operations, 10MediaWiki-extensions-Translate, 10Language-2018-Apr-June, 10Wikimedia-Incident, 10Wikimedia-log-errors: 503 error attempting to open multiple projects (Wikipedia and meta wiki are loading very slowly) - https://phabricator.wikimedia.org/T195293#4229809 (101339861mzb) {F18512790} yes here... [19:51:00] s8 aggregated DB stats: https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=s8&var-role=All [19:51:01] aside from the slave lag alerts, summary of other ongoing criticals that I guess we currently believe to be related: restbase endpoint health, cxserver translation checks, mobileapps (possible due to shared scb w/ cxserver?), text-lb level checks for RB random title redirects [19:51:06] "Cannot access the database: No working replica DB server: Unknown error (10.64.16.84:3318))" [19:51:21] 10Operations, 10Wikimedia-General-or-Unknown, 10Wikimedia-log-errors: Multiple projects reporting Cannot access the database: No working replica DB server - https://phabricator.wikimedia.org/T195520#4229811 (10alanajjar) Multiple screenshots from ar.wiki users {F18512789} {F18512793} [19:51:25] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [19:51:25] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [19:51:45] RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy [19:51:45] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [19:51:45] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy [19:51:56] getting error when i tried to login on dewiki [19:52:05] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [19:52:09] 10Operations, 10Wikimedia-General-or-Unknown, 10Wikimedia-log-errors: Multiple projects reporting Cannot access the database: No working replica DB server - https://phabricator.wikimedia.org/T195520#4229813 (10ToBeFree) Might be solved / mitigated. Edits are coming back. [19:52:16] all of wikidata.org is still down for me "Sorry! This site is experiencing technical difficulties." [19:52:35] apache restart? [19:52:47] just wild guess here [19:52:50] I got one page after a very long time, but was transient [19:52:53] do you mean restarting the whole datacenter? lol [19:53:16] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [19:53:16] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy [19:53:19] <_joe_> apache restarts won't do anything [19:53:25] <_joe_> we need to restart hhvm [19:53:28] Things are slowly recovering [19:53:29] was able to login on wikidata now [19:53:29] at least [19:53:31] <_joe_> a rolling restart takes time [19:53:35] slow but happening [19:53:37] <_joe_> Reedy: I don't think that's the case [19:53:43] <_joe_> see the connections on wikidata [19:53:46] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy [19:53:48] <_joe_> we haven't solved the issue [19:53:55] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy [19:54:04] I am killing lots of connections now [19:54:06] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy [19:54:06] What are the connections coming from? app servers? [19:54:06] RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy [19:54:10] but the tertiary fallouts seem to be slowly diminishing, so that's a good sign [19:54:23] I just had a wikidata page load? [19:54:24] <_joe_> what happened at 19:23? [19:54:27] de.wp is fixed for me. i made an edit [19:54:35] <_joe_> marostegui: are you still killing connections, correct? [19:54:36] Yes, wikidata has loaded once for me now [19:54:41] mutante: dewiki is on a different set of servers [19:54:44] _joe_: yeah [19:54:45] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) timed out before a response was received: /en.wikipedia.org/v1/page/metadata/{title}{/revision} (Get extended metadata of a test page) timed out before a response was received [19:54:52] query throughput is basically zero on s8 [19:54:59] fatal monitor seems to be decreasing [19:55:05] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 504 (expecting: 303) [19:55:07] does RB use wikidata for something about the random title redirect stuff? [19:55:13] wb_terms is being used in all wikis, so it brings down all wikis [19:55:17] that can of worms [19:55:21] <_joe_> probably, bblack [19:55:22] marostegui: right, was just in response to earlier report about dewp. both wikidata and dewiki work now [19:55:40] <_joe_> I doubt wikidata really works [19:55:41] ah right :) [19:55:45] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [19:55:53] I am recovering some of the servers now [19:56:05] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [19:56:07] but connections quickly pile up [19:56:09] page loads are slow for me, but they do load [19:56:16] starting to see some query throuhput [19:56:38] db1087 as soon as I stop killing its starts getting more and more connections until they reach the limit [19:56:38] <_joe_> marostegui: do you have any idea why they pile up? [19:56:40] people reloading? [19:56:48] <_joe_> nope [19:56:53] <_joe_> I don't think that's the case [19:57:02] PROBLEM - MariaDB Slave IO: s8 on db1087 is CRITICAL: CRITICAL slave_io_state could not connect [19:57:08] <_joe_> can we kill the dispatch for now? [19:57:13] _joe_: sure [19:57:18] its backlogged by 20 hours anyway [19:57:19] <_joe_> marostegui: is db1087 special in any way? [19:57:21] _joe_: editing on de works again now [19:57:21] Are we sure we have disabled the thing? [19:57:26] a few more wont hurt [19:57:26] PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/page/random/{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 504 (expecting: 303) [19:57:35] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/random/{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 504 (expecting: 303) [19:57:35] I still see: SELECT /* Wikibase\Lib\Store\Sql\TermSqlIndex::getMatchingTerms [19:57:45] 10Operations, 10Wikimedia-General-or-Unknown, 10Wikimedia-log-errors: Multiple projects reporting Cannot access the database: No working replica DB server - https://phabricator.wikimedia.org/T195520#4229835 (10Mainframe98) Still intermittently receiving this error while attempting to combat spam. [19:58:05] * addshore doesn't know which guest is amir... [19:58:06] 10Operations, 10Wikimedia-General-or-Unknown, 10Wikimedia-log-errors: Multiple projects reporting Cannot access the database: No working replica DB server - https://phabricator.wikimedia.org/T195520#4229836 (10ToBeFree) Quote of a message currently on wikidata: **Database locked** The database is currently... [19:58:07] <_joe_> ok, what does perform that query? [19:58:16] <_joe_> addshore: any idea? [19:58:24] 16:31 SMalyshev: starting wikidata full reindex for T163642 [19:58:24] T163642: Index Wikidata strings in statements in the search engine - https://phabricator.wikimedia.org/T163642 [19:58:25] ? [19:58:30] from yesterday [19:58:37] need any help? [19:58:39] no idea what db queries that might make? [19:58:47] marostegui: there is no part of the wikibase that does run TermSqlIndex::getMatchingTerms in production code [19:58:53] Guest79402: check this [19:58:53] <_joe_> they would come from a restricted set of servers [19:58:55] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/mobile-sections/{title}{/revision}{/tid} (retrieve en.wp main page via mobile-sections) is CRITICAL: Test retrieve en.wp main page via mobile-sections returned the unexpected status 404 (expecting: 200): /{domain}/v1/page/metadata/{title}{/revision}{/tid} (retrieve extended metadata for Video article on English Wikipedia) is CRITICAL: Te [19:58:56] d metadata for Video article on English Wikipedia returned the unexpected status 404 (expecting: 200): /{domain}/v1/page/random/title (retrieve a random article) is CRITICAL: Test retrieve a random article returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/mobile-sections-lead/{title}{/revision}{/tid} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead se [19:58:56] incham page via mobile-sections-lead returned the unexpected status 404 (expecting: 200) [19:58:56] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received [19:58:57] * Amir1 is now known as Guest48088 [19:59:05] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) timed out before a response was received: /en.wikipedia.org/v1/page/metadata/{title}{/revision} (Get extended metadata of a test page) is CRITICAL: Test Get extended metadata of a test page returned the unexpected status 404 (expecting: 200) [19:59:08] +------+-------------+----------+------+------------------+------------------+---------+-------+-----------+------------------------------------+ [19:59:10] | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | [19:59:14] +------+-------------+----------+------+------------------+------------------+---------+-------+-----------+------------------------------------+ [19:59:14] yes, I can't login with my nick sorry [19:59:16] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received [19:59:16] | 1 | SIMPLE | wb_terms | ref | term_search_full | term_search_full | 34 | const | 161154588 | Using index condition; Using where | [19:59:19] +------+-------------+----------+------+------------------+------------------+---------+-------+-----------+------------------------------------+ [19:59:21] addshore: reindex should not do anything with TermSqlIndex [19:59:23] that query is still running [19:59:25] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 504 (expecting: 303): /en.wikipedia.org/v1/page/metadata/{title}{/revision} (Get extended metadata of a test page) is CRITICAL: Test Get extended metadata of a test page returned the unexpected status 404 (expecting: 200) [19:59:25] SMalyshev: ack [19:59:25] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/page/random/{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 504 (expecting: 303) [19:59:29] TermSqlIndex is wbsearchentities and alike [19:59:38] Guest79402: /msg NickServ ID [19:59:42] Guest79402: removing property suggestor doesnt wbsearchentities still use it? [19:59:43] can we just disable the API module entirely for now? [19:59:45] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) timed out before a response was received [19:59:48] legoktm: I can make a patch [19:59:48] I know, I saved the pass and I can't find it [19:59:50] (which shouldn't even be used anymore as it should use Elastic now?) [19:59:53] $wgAPIModules['wbsearchentities'] = 'ApiDisabled'; [19:59:57] but maybe from Lua/Client [20:00:01] although that will basically break most of wikidata [20:00:03] Wikibase client would still use TermSqlIndex [20:00:15] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) timed out before a response was received [20:00:21] PROBLEM - MariaDB Slave IO: s8 on db1087 is CRITICAL: CRITICAL slave_io_state could not connect [20:00:27] wbsearch entities use elastic as backend [20:00:29] Guest79402: if irccloud uses that for login, maybe just reconnect? [20:00:36] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [20:00:43] Guest79402: /nick Amir_________________ [20:00:45] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/random/{format} (Random title redirect) timed out before a response was received: /api/rest_v1/page/title/{title}{/revision} (Get rev by title from storage) timed out before a response was received [20:00:46] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy [20:00:46] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [20:00:46] RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy [20:00:49] Guest79402: yes, but the analogues from client/Lua would not [20:01:05] Let's forget about Amir1's nicks troubles ;) [20:01:06] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy [20:01:06] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [20:01:13] marostegui: exactly [20:01:22] Guest79402: indeed, so where the hell are these queries coming from? [20:01:23] SMalyshev: we fixed that too AFAIK [20:01:23] 10Operations, 10MediaWiki-extensions-Translate, 10Language-2018-Apr-June, 10Wikimedia-Incident, 10Wikimedia-log-errors: 503 error attempting to open multiple projects (Wikipedia and meta wiki are loading very slowly) - https://phabricator.wikimedia.org/T195293#4229844 (10Urbanecm) Definitely T195520 [20:01:29] Guest79402: the explain I pasted above is still hitting us [20:01:31] addshore: property suggester [20:01:36] RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy [20:01:37] which is disabled? [20:01:40] good olde property suggester [20:01:45] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy [20:01:45] PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/page/random/{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 504 (expecting: 303): /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) is WARNING: Test Get citation for Darth Vader responds with unexpected value at path [0]/itemType = webpage [20:01:47] Guest79402: hmm I don't think so but maybe I missed something... the bug is still open [20:02:22] marostegui: can you see which ips are doing those queries? [20:02:24] Guest79402: https://phabricator.wikimedia.org/T194143 [20:02:32] PROBLEM - MariaDB Slave IO: s8 on db1087 is CRITICAL: CRITICAL slave_io_state could not connect [20:02:32] addshore: live hack a wfGetAllCallers() into a debug log? [20:02:46] We have to disable that query, it doesn't even finishh when I try to run it, so it is kilinng the servers [20:02:47] Guest79402: wait, not that one.. [20:02:50] mark: let me check [20:02:52] let me locate the right one [20:02:54] addshore: one very weird way would be to kill all those queries [20:02:54] (03CR) 10Aaron Schulz: profile::mediawiki::mcrouter_wancache: add ssl, proxy support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/431737 (https://phabricator.wikimedia.org/T192370) (owner: 10Giuseppe Lavagetto) [20:02:56] mark: a random one was mw1311 [20:03:10] marostegui: can we put a regex based query killer? [20:03:11] fyi: I'm working on incident documentation [20:03:24] twentyafterfour: thank you [20:03:26] mark: ie: mw1280.eqiad.wmnet. [20:03:37] volans: so a jobrunner [20:03:39] mw1275, mw1319, etc... [20:03:53] 1280 is api [20:03:55] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/page/metadata/{title}{/revision} (Get extended metadata of a test page) is CRITICAL: Test Get extended metadata of a test page returned the unexpected status 404 (expecting: 200) [20:03:56] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 504 (expecting: 303) [20:04:06] mw1305, mw1273 [20:04:15] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/random/{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 504 (expecting: 303) [20:04:23] Guest79402: I can try, but we better look for something else - not sure it will work well [20:04:26] I will check now [20:04:35] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received [20:04:35] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/mobile-sections/{title}{/revision}{/tid} (retrieve en.wp main page via mobile-sections) is CRITICAL: Test retrieve en.wp main page via mobile-sections returned the unexpected status 404 (expecting: 200): /{domain}/v1/page/metadata/{title}{/revision}{/tid} (retrieve extended metadata for Video article on English Wikipedia) is CRITICAL: Te [20:04:35] d metadata for Video article on English Wikipedia returned the unexpected status 404 (expecting: 200): /{domain}/v1/page/random/title (retrieve a random article) is CRITICAL: Test retrieve a random article returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/mobile-sections-lead/{title}{/revision}{/tid} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead se [20:04:35] incham page via mobile-sections-lead returned the unexpected status 404 (expecting: 200) [20:04:37] addshore: are you still trying to figure out where the queries are coming from? [20:04:46] legoktm: looking at code currently [20:04:51] PROBLEM - MariaDB Slave IO: s8 on db1087 is CRITICAL: CRITICAL slave_io_state could not connect [20:04:56] marostegui: very short solution so it doesn't bring down the whole thing [20:04:58] 10Operations, 10Wikimedia-General-or-Unknown, 10Wikimedia-log-errors: Multiple projects reporting Cannot access the database: No working replica DB server - https://phabricator.wikimedia.org/T195520#4229709 (10Jarekt) On Commons I suddenly see [[ https://commons.wikimedia.org/wiki/Category:Pages_with_script_... [20:05:05] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/page/random/{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 504 (expecting: 303) [20:05:30] let's see [20:05:40] just arrived home [20:05:46] Guest79402: this one still open: https://phabricator.wikimedia.org/T177453 [20:06:13] oh articles placeholder [20:06:13] addshore: https://paste.fedoraproject.org/paste/xD9WXsueb3d4eqe4bfR0Hw [20:06:15] RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy [20:06:17] deploy overloaded the database, creating 30-minute queries, or something else? [20:06:25] PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/page/random/{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 504 (expecting: 303) [20:06:56] SMalyshev: what about lua modules? do you know about them? [20:06:56] <_joe_> jynus: a dropped index [20:06:56] jynus: no, the query we saw earlier on wikidata bringing down the DBs [20:06:59] jynus: so far seems unrelated to the train, might be related to an index that was dropped, only s8 [20:06:59] jynus: the deploy didn't actually happen. it seems related to wikidata property suggested [20:07:05] _joe_: called "tmp1" [20:07:05] Guest79402: Lua is client, not? [20:07:12] yes [20:07:23] Guest79402: so same applies I assume [20:07:32] that's the hardest part [20:07:53] addshore: disabling data access? [20:07:56] Guest79402: yeah the problem there is that we don't have configs we need for search, since they belong to another wiki [20:07:58] um [20:08:00] _joe_: I added another mcrouter comment. I didn't notice PS6/7, heh, it looks like you already did what I was thinking. [20:08:03] then it is https://phabricator.wikimedia.org/T194273#4228564 [20:08:11] why can't we just no-op the method that's failing?? [20:08:18] that would be idea [20:08:21] *ideal [20:08:21] Guest79402: if we had some kind of internal request mechanism or something... [20:08:21] <_joe_> AaronSchulz: ack, we're in the middle of an outage [20:08:25] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/random/{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 504 (expecting: 303) [20:08:28] 10Operations, 10Wikimedia-General-or-Unknown, 10Wikimedia-log-errors: Multiple projects reporting Cannot access the database: No working replica DB server - https://phabricator.wikimedia.org/T195520#4229709 (10Classicwiki) Still getting this error on Twinkle rollbacks while trying to combat vandalism: "Grabb... [20:08:29] <_joe_> legoktm: I proposed that earlier [20:08:32] <_joe_> let's do that [20:08:35] ok [20:08:38] I'm going to do that [20:08:44] Guest79402: any objections? [20:08:44] <_joe_> please do [20:08:49] random thought: how about disabling the PropertySuggester extension [20:08:51] nope [20:08:52] TermSqlIndex::getMatchingTerms() return [] [20:09:01] <_joe_> yes [20:09:03] I can do that [20:09:03] legoktm: yeah disable the query generation [20:09:07] DanielK_WMDE_: we did and dind't help, client still uses wb_terms for search [20:09:08] no idea what it might fuck up though [20:09:17] DanielK_WMDE_: I'm Amir btw [20:09:18] wmf.4 or wmf.5? [20:09:19] (03Abandoned) 10Aaron Schulz: [WIP] Enable mcrouter on mediawiki memcached nodes [puppet] - 10https://gerrit.wikimedia.org/r/433913 (https://phabricator.wikimedia.org/T194225) (owner: 10Aaron Schulz) [20:09:25] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/page/random/{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 504 (expecting: 303) [20:09:27] Guest79402: hey Amir [20:09:35] <_joe_> Amir, give yourself a name like Amir12343542 [20:09:36] <_joe_> :P [20:09:40] .5 [20:09:42] given it's broken, we'll live with whatever it fucks up, I'd say [20:09:45] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 504 (expecting: 303) [20:10:02] <_joe_> apergos: well it can create a spurious edit history in some cases, I fear [20:10:09] 10Operations, 10Wikimedia-General-or-Unknown, 10Wikimedia-log-errors: Multiple projects reporting Cannot access the database: No working replica DB server - https://phabricator.wikimedia.org/T195520#4229709 (10Pigsonthewing) In the UK, I'm still seeing "No working replica DB server errors", on Wikidata & Wik... [20:10:15] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [20:10:19] * AaronSchulz reads backscroll [20:10:23] Guest7762: what do you mean by "client still uses wb_terms for search"? what kind of search? [20:10:25] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy [20:10:27] apergos: It probably mean disabled data access for when people look up property names [20:10:35] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [20:10:36] RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy [20:10:36] RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy [20:10:36] Guest74533: looking up properties by name? is that it? [20:10:37] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [20:10:37] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy [20:10:40] it's acceptable loss I think [20:10:44] <_joe_> ok [20:10:46] RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy [20:10:46] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy [20:10:47] it's not awesome but it will do for now [20:10:49] DanielK_WMDE_: and items [20:10:50] ok, I am now killing those queries only [20:10:52] <_joe_> let's go that way [20:10:55] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [20:11:03] Amir1111WTF: items cannot be looked up by label. we should not have code that does that. [20:11:04] syncing [20:11:06] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy [20:11:07] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [20:11:08] https://gerrit.wikimedia.org/r/#/c/435005/ [20:11:17] <_joe_> marostegui: that should mostly fix things [20:11:19] lots of red things are going green! [20:11:20] DanielK_WMDE_: so only properties [20:11:25] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [20:11:42] !log legoktm@tin Synchronized php-1.32.0-wmf.4/extensions/Wikibase/lib/includes/Store/Sql/TermSqlIndex.php: Disable TermSqlIndex::getMatchingTerms (duration: 01m 20s) [20:11:45] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy [20:11:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:52] Amir1111WTF: that can be no-oped by a monkey patch. will break some pages that then will need too be purged. [20:11:53] legoktm: if you get time, add the log too? [20:11:58] <_joe_> marostegui: you should see those queries disappear now [20:12:04] addshore: yeah, doing that now [20:12:10] seeing a mass drop off in fatal monitor [20:12:13] db1109 is clean [20:12:24] <_joe_> if not, please do tell us [20:12:31] wikidata loads for me again now [20:13:07] DanielK_WMDE_: there are so many ways to fix that [20:13:11] I should write that down [20:13:13] <_joe_> tendril seems to agree things are now better [20:13:20] addshore: can you try to browse it normally? to make sure I am not killing what I shouldn't [20:13:30] Amir1111WTF: PropertyIdResolver [20:13:36] <_joe_> marostegui: stop killing now, the queries should have gone [20:13:36] the icinga criticals are now mostly just the slave lag, or graph-data-based checks that tend to be laggy on recovery [20:13:55] _joe_: they will come back [20:14:02] _joe_: going to stop it on db1109 for instance [20:14:04] 10Operations, 10Wikimedia-General-or-Unknown, 10Wikimedia-log-errors: Multiple projects reporting Cannot access the database: No working replica DB server - https://phabricator.wikimedia.org/T195520#4229709 (10Doc_James) Seems like most of us are having issues. Hope to see this resolved soon. [20:14:22] <_joe_> marostegui: why? the query is now removed [20:14:25] starting to re-enable notifications for things that turned green [20:14:29] !log legoktm@tin Synchronized php-1.32.0-wmf.4/extensions/Wikibase/lib/includes/Store/Sql/TermSqlIndex.php: Add debug logging (duration: 01m 19s) [20:14:31] those lags are at 1k seconds plus... will take a bit [20:14:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:48] marostegui: seems to be, ill keep checking around [20:14:57] so, it's a query against wb_terms that kileld the site? or the property suggester stuff? [20:14:57] marostegui: it seems so, having several safegaurds makes sense [20:15:01] addshore: k, your patch is basically deployed [20:15:03] _joe_: it is not removed [20:15:05] legoktm: thanks [20:15:11] I'm seeing Special:Search + ArticlePlaceholder in the logs [20:15:14] DanielK_WMDE_: several things [20:15:19] the query keeps coming [20:15:22] hmmm restbase/cxserver problems already creeping back in a little [20:15:29] marostegui: the query is still coming even now? [20:15:32] wait [20:15:33] wmf.5? [20:15:35] it all started with this [20:15:36] https://phabricator.wikimedia.org/T194273 [20:15:36] <_joe_> marostegui: you mean you still see that query? [20:15:36] .5 yes [20:15:38] I didn't hack that yet [20:15:45] <_joe_> ahah [20:15:46] Amir1111WTF: one is probably the trigger, the other stuff just piling on. we have had that before. hard to figure out what when wrong first [20:15:47] one moment [20:15:47] <_joe_> ok [20:16:00] marostegui: the stopped replicas were intentional? [20:16:00] _joe_: yep, the query isn't disabled [20:16:02] 10Operations, 10Wikimedia-General-or-Unknown, 10Wikimedia-log-errors: Multiple projects reporting Cannot access the database: No working replica DB server - https://phabricator.wikimedia.org/T195520#4229939 (10Bharel) I can confirm it happens on he.wikipedia.org too regarding admin actions. [20:16:06] volans: ? [20:16:07] DanielK_WMDE_: when I said we should fixed wb_terms I meant htis [20:16:14] marostegui: https://tendril.wikimedia.org/host/view/db1071.eqiad.wmnet/3306 [20:16:15] addshore: is it possible that dealing with the dispatch lag overloaded something? [20:16:27] Amir1111WTF: i'm all for it :) [20:16:28] syncing [20:16:36] <_joe_> legoktm: you disabled that method in the wrong version? [20:16:43] DanielK_WMDE_: no idea, possibly (why i mentioned it), but it hasn't done before [20:16:46] _joe_: it seems so [20:16:50] <_joe_> just checking, that would explain what marostegui saw [20:16:52] _joe_: well only half. I think most wikis are still on wmf.4 [20:17:03] volans: I guess they failed before, I have started them [20:17:04] <_joe_> legoktm: heh ok [20:17:19] https://tools.wmflabs.org/versions/ [20:17:24] marostegui: ack, there was no error so I was not sure if you stopped them on purpose ;) [20:17:26] volans: I started a bunch, which ones are still stopped? [20:17:41] marostegui: s8 is ok [20:17:42] all started [20:17:46] !log legoktm@tin Synchronized php-1.32.0-wmf.5/extensions/Wikibase/lib/includes/Store/Sql/TermSqlIndex.php: wmf.5 this time (duration: 01m 19s) [20:17:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:01] RECOVERY - MariaDB Slave IO: s8 on db1087 is OK: OK slave_io_state Slave_IO_Running: Yes [20:18:06] all wikis have the fix now [20:18:13] SpecialItemDisambiguation ? [20:18:21] thats what i see in the logs now legoktm [20:18:24] same [20:18:29] like.... a lot... [20:18:30] addshore: that's a radical patch :) that also disables uniqueness checks, right? [20:18:34] addshore: I remember we fixed it for conflict detection [20:18:37] DanielK_WMDE_: no idea [20:18:39] I'm pretty sure [20:18:47] should be just readonly mode also for a bit? :/ [20:18:57] I did say above I wasn't sure what that would end up killing [20:19:07] <_joe_> addshore: the issue is with read queries [20:19:10] DanielK_WMDE_: does this all mean we need to accelerate finding solution for https://phabricator.wikimedia.org/T177453 ? or it's unrelated? [20:19:12] legoktm: I think lets turn off SpecialItemDisambiguation [20:19:17] addshore: oh, i forget we *had* SpecialItemDisambiguation. [20:19:21] yup [20:19:25] is something hitting that hard right now? [20:19:28] thats where all of the callers are coming from [20:19:29] how are the database servers looking now? [20:19:46] Krenair: They are fine now, as I am killing on a loop the bad query [20:19:53] can this be an attack? just saying [20:20:11] no one uses ItemDisambiguation [20:20:13] I'm sure that discussion is being had in a different channel [20:20:21] <_joe_> the query throughput on s8 is almost back [20:20:21] Amir1111WTF: I am surprised it was like this all of a sudden [20:20:26] addshore: you broke it :) https://www.wikidata.org/wiki/Special:ItemDisambiguation?language=en&label=Berlin [20:20:30] marostegui: theoretically the queries should stop now... [20:20:32] looks at the web requests its being hammerd like hell [20:20:33] Do you guys want me to stop killing queries to see what happens? [20:20:34] <_joe_> https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=s8&var-role=All&from=now-3h&to=now [20:20:56] marostegui: please don't [20:21:05] addshore: *sigh* we should have rate limits for *everything*. [20:21:05] its 4 chan [20:21:06] heh [20:21:13] addshore: seriously? [20:21:17] Amir1111WTF: why not? [20:21:18] looking at what I see in the logs [20:21:46] the word 4 chan is in one of the requests to Special:ItemDisambiguation [20:21:48] :| [20:22:04] <_joe_> heh I feared something like this [20:22:09] :? [20:22:11] addshore: I thought it's getting through. If not, feel free [20:22:18] legoktm: ^ sorry [20:22:24] https://wikitech.wikimedia.org/wiki/Incident_documentation/20180524-wikidata [20:22:46] right [20:22:51] marostegui: I think it's ok to stop killing queries now, if new ones are coming through we have a different problem [20:22:55] proposal: wait for things to calm down, yank Special:ItemDisambiguation, undo addshore's patch, see what happens. [20:22:57] can we disable the special page and re enable that other method? [20:23:01] DanielK_WMDE_: yeh [20:23:06] legoktm: let me stop on one server then [20:23:09] and see how it goes [20:23:19] stopped on db1109 [20:23:21] IMO we might as well go readonly for a short period? [20:23:25] <_joe_> addshore: let's confirm we put off the fire [20:23:30] ack [20:23:38] DanielK_WMDE_: ArticlePlaceholder is still causing the queries as well [20:23:50] who uses ArticlePlaceholder [20:23:56] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [20:24:00] 2018-05-24 20:23:54 [WwcfWgpAICoAAEGEdAIAAABB] mw1321 bnwiki 1.32.0-wmf.4 AdHocDebug INFO: Wikibase\Lib\Store\Sql\TermSqlIndexinclude/MediaWiki->run/MediaWiki->main/MediaWiki->performRequest/SpecialPageFactory::executePath/SpecialPage->run/SpecialSearch->execute/SpecialSearch->showResults/Hooks::run/Hooks::callHook/ArticlePlaceholder\SearchHookHandler::onSpecialSearchResultsAppend/ArticlePlaceholder\SearchHookHandler->addToSearch/ [20:24:00] ArticlePlaceholder\SearchHookHandler->getTermSearchResults/ArticlePlaceholder\SearchHookHandler->searchEntities/Wikibase\Lib\Interactors\DispatchingTermSearchInteractor->searchForEntities/Wikibase\Lib\Interactors\TermIndexSearchInteractor->searchForEntities/Wikibase\Lib\Interactors\TermIndexSearchInteractor->getMatchingTermIndexEntries/Wikibase\Lib\Interactors\TermIndexSearchInteractor->getFallbackMatchedTermIndexEntries/Wikibase\Lib\ [20:24:00] Store\Sql\TermSqlIndex->getTopMatchingTerms/Wikibase\Lib\Store\Sql\TermSqlIndex->getMatchingTerms [20:24:05] Amir1111WTF: maybe something just started usuing it a lot. [20:24:07] legoktm: db1109 looking good [20:24:14] there's a SpecialSearch hook [20:24:17] marostegui: sweet [20:24:20] Going to stop all the killing now [20:24:26] legoktm: is ArticlePlaceholder hit more than usual? [20:24:41] it'S a separate extension, we could just disable it [20:24:44] Ok, I have stopped all the killings now [20:24:48] DanielK_WMDE_: it's been hit 526 times since we started logging [20:24:49] Let's monitor [20:24:59] legoktm: how long ago is that? [20:25:06] <_joe_> legoktm: have a dashboard on logstash to share? [20:25:12] 10 minutes maybe? [20:25:13] ah, a few minutes. hm. [20:25:28] We are looking good [20:25:28] _joe_: /srv/mw-log/AdHocDebug.log on mwlog1001 [20:25:35] it's not that much but it's way more than what I expected [20:25:35] <_joe_> ahahah ok [20:25:39] <_joe_> old-style [20:25:39] Can someone let me know what else was disabled? I have missed it? [20:25:43] In the logs I see Wikibase\Repo\Api\QuerySearchEntities->executeGenerator (API request, it looks like), Wikibase\Client\DataAccess\Scribunto\Scribunto_LuaWikibaseLibrary->resolvePropertyId (from Lua), Closure$Wikibase\Client\Hooks\ParserFunctionRegistrant::registerParserFunctions (some parser function?), ArticlePlaceholder\SearchHookHandler->searchEntities -> Wikibase\Lib\Interactors\DispatchingTermSearchInteractor->searchForEntities (via [20:25:44] Special:Search), and that Special:ItemDisambiguation [20:26:27] hmm SpecialSearch should not be using wikibase client? [20:26:46] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [20:26:50] that's scribunto [20:26:53] marostegui: the PropertySuggester extension was turned off, and then we turned function that makes the query into a no-op (TermSqlIndex::getMatchingTerms()) [20:26:55] * DanielK_WMDE_ suspects that ItemDisambiguation triggered an overload, causing resolvePropertyId and ArticlePlaceholder to pile on. [20:26:58] client lua modules [20:27:06] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [20:27:09] DanielK_WMDE_: seems that way [20:27:20] SMalyshev: Wikibase\Lib\Store\Sql\TermSqlIndexinclude → MediaWiki->run → MediaWiki->main → MediaWiki->performRequest → SpecialPageFactory::executePath → SpecialPage->run → SpecialSearch->execute → SpecialSearch->showResults → Hooks::run → Hooks::callHook → ArticlePlaceholder\SearchHookHandler::onSpecialSearchResultsAppend → ArticlePlaceholder\SearchHookHandler->addToSearch → ArticlePlaceholder\SearchHookHandler->getT [20:27:20] ermSearchResults → ArticlePlaceholder\SearchHookHandler->searchEntities → Wikibase\Lib\Interactors\DispatchingTermSearchInteractor->searchForEntities → Wikibase\Lib\Interactors\TermIndexSearchInteractor->searchForEntities → Wikibase\Lib\Interactors\TermIndexSearchInteractor->getMatchingTermIndexEntries → Wikibase\Lib\Interactors\TermIndexSearchInteractor->getFallbackMatchedTermIndexEntries → Wikibase\Lib\Store\Sql\TermSqlIndex-> [20:27:21] getTopMatchingTerms → Wikibase\Lib\Store\Sql\TermSqlIndex->getMatchingTerms [20:27:25] SMalyshev: it does for ArticlePLaceholder integration. [20:27:31] legoktm: ah right, thanks [20:27:35] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [20:27:40] Yay, good job everyone. Just wanting to say that I am appreciating your efficient style communication and collaborative process from here. :) [20:27:45] ohh hooks [20:27:56] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [20:28:06] RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [20:28:09] SMalyshev: great, aren't they? [20:28:26] that could probably be made to work via Elastic I assume [20:28:29] so is this related to the wikibase indext that was dropped? [20:28:38] yes [20:28:51] can we add the index back? [20:28:51] <_joe_> yeah so, we need to re-add that index? [20:29:00] <_joe_> legoktm: it will take 10 hours or so [20:29:03] so are we expecting people to no longer be seeing DB errors? [20:29:05] or use elastic for the queries [20:29:07] legoktm: it'll take time and also it's called tmp1 [20:29:07] <_joe_> marostegui, jynus ? [20:29:07] iirc there was an estimate of 10h to re-add [20:29:13] "marostegui: adding an index there will take more than 10h" [20:29:19] <_joe_> Krenair: in theory, no [20:29:30] Krenair: monkey patch is in place. will be remooved at some point. no telling before that [20:29:32] DanielK_WMDE_: as I said, that's https://phabricator.wikimedia.org/T177453 - now low priority, but we can raise it? [20:29:36] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [20:29:47] wb_terms table is more than 800G, adding an index takes almost a day [20:29:50] <_joe_> can we focus on the current issue please? [20:29:57] rfarrand: thanks for the kind words [20:29:58] SMalyshev: the whole 1.5B rows table should go away [20:30:00] <_joe_> marostegui: I think we should start ASAP [20:30:17] <_joe_> we can reason on all second-order issues later [20:30:21] Amir1111WTF: we have some work to do I think befor that can happen though [20:30:26] Amir1111WTF: I think that's going to take longer than 10 hours [20:30:28] <_joe_> let's get to the point where we can remove the monkey-patch [20:30:29] _joe_: agreed [20:30:30] _joe_: We have to depool/repool, but I can start if we think that is the way to go [20:30:48] so are we ok to keep getMatchingTerms() disabled until then? Or do we want to disable everything that calls that? [20:30:50] what is our alternative? [20:30:50] <_joe_> marostegui: do you see a viable alternative? I can help, we can take turns [20:30:54] legoktm: I can rework the code that they dont' look up the search parts [20:30:55] Amir1111WTF: Would you like to backport https://gerrit.wikimedia.org/r/434807 today? [20:30:55] SMalyshev: yes. if using wb_terms for this is no longer viable, then that task should get high prio. the question is what we do until it's implemented. [20:31:01] it takes some time [20:31:02] so. [20:31:12] I don't want to stay online so long today… [20:31:20] we can live without Special:ItemDisambiguation, ArticlePlaceholder, and maybe even without PropertySuggester [20:31:29] we probably do need resolvePropertyId [20:31:32] hoo: do we have a swat? will tackle it [20:31:44] Amir1111WTF: Not yet [20:31:44] DanielK_WMDE_: well I guess some monkey patches... but I think after fixing index it should work short-term? [20:32:11] I will write up actionables [20:32:13] _joe_: either we add the index or we fix it from code. adding the index will take a few days, but I am fine either way. Whatever fixes it faster for now [20:32:19] SMalyshev: i don't know if fixing the index is possible short term. may take days or weeks. [20:32:32] Amir1111WTF: https://wikitech.wikimedia.org/wiki/Incident_documentation/20180524-wikidata [20:32:33] DanielK_WMDE_: that's short term :) [20:32:35] https://etherpad.wikimedia.org/p/wb_terms_solution [20:32:37] <_joe_> Amir1111WTF, DanielK_WMDE_ do you see a way to fix this from the code? [20:32:40] DanielK_WMDE_: yes, I think disabling the 3 of those, but reenabeling that method could be okay? [20:32:48] <_joe_> as in, modifying the query so that it won't kill the db? [20:32:48] SMalyshev: can't live without resolvePropertyId for that long, I'm afraid [20:33:12] <_joe_> I don't think we can live with anything performing that query right now [20:33:44] DanielK_WMDE_: we probably won't have fix for T177453 in days. I mean, we can do local resolution with elastic (though the timing is super-unfortunate because I am out tomorrow till tuesday) but client talking to wikibase from another wiki is hard [20:33:45] T177453: Add wikibase client support for searching wikidata items - https://phabricator.wikimedia.org/T177453 [20:33:48] addshore: i think so. we should try to re-enable one at a time, in order of prio. first resolveEntityId. then PropertySuggester. then ArticlePlaceholder. Then ItemDisambig (or maybe not that one) [20:33:54] <_joe_> marostegui, jynus do you think we could live with that query, if the number of those was smaller? [20:34:15] <_joe_> DanielK_WMDE_: let's hear from the dbas [20:34:20] SMalyshev: i know. [20:34:27] _joe_: I don't know really as we don't really know what's the limit [20:34:52] _joe_: yes. i'd be interested to know *which* query is bad. I suspect it's not the one used by resolvePropertyId. [20:34:53] _joe_: from the explain I'm doing of a query I got from the processlist is checking 140M rows [20:35:00] <_joe_> ok so DanielK_WMDE_ let's try, one by one [20:35:10] the method that addshore disabled is very... amorphous, it can result in many different kinds of queries [20:35:18] DanielK_WMDE_: So far the one we pasted above and the one we were killing [20:35:21] <_joe_> volans: which query specifically? [20:35:24] Once that one was killed/disabled we were fine [20:35:28] <_joe_> marostegui: re-paste the query [20:35:28] DanielK_WMDE_: I have them in tendril [20:35:46] _joe_: we first have to actually *disabled* them individually. right now, we just cut off the querry interface they use. [20:35:58] _joe_: https://phabricator.wikimedia.org/T194273#4228564 [20:36:00] <_joe_> yes, I'm aware of that [20:36:16] i'd suggest to start by disabling the ArticlePlaceholder and PropertySuggester extensions. [20:36:19] marostegui: I've seen also a simplified version of it [20:36:26] I can make a patch that kills Special:ItemDisambig [20:36:26] 10Operations, 10Wikimedia-General-or-Unknown, 10Wikimedia-log-errors: Multiple projects reporting Cannot access the database: No working replica DB server - https://phabricator.wikimedia.org/T195520#4229709 (10Krenair) For the record a workaround is in place which effectively disables some Wikidata-related f... [20:36:39] DanielK_WMDE_: thats a 1 liner, just do't register it ;) [20:36:44] indeed [20:36:50] with only WHERE term_language = 'en' AND term_type = 'label' AND term_entity_type = 'property' [20:37:01] addshore: if you you have it open, go ahead. i don't even have master checked out on this maching [20:37:02] but the index is `term_language`,`term_full_entity_id`,`term_type`,`term_search_key`(16), so is not taking full advantage of it [20:37:02] that will cause any instance of SpecialPage::getTitleFor(...) to fail [20:37:04] tbh if we can disable the callers (the extensions) and ive with that for a day while indices are rebuilt, at least that's a clean path to a fix [20:37:07] because missing the term_full_entity_id [20:37:12] <_joe_> volans: well I guess at some point any query on wm_terms had the same fate [20:37:13] addshore: i'm at home, drinking beer ;) [20:37:13] I'm not sure if you can unregister the special page that safely [20:37:27] _joe_: I'm talking about explain done right now [20:37:30] legoktm: hmm? [20:37:42] addshore: DanielK_WMDE_ what about moving all of terms of propeties to a table and then drop the wb_terms [20:37:49] legoktm: aaah, okay [20:38:25] Amir1111WTF: maybe. if needed [20:38:45] most of the things in here are happening because of look up on properties and not itmes [20:38:54] we can index the bleep out that table [20:39:49] Amir1111WTF: you want to make a specialized table for resovlePropertyId? Fine, but make sure it also gets updated. that's not a quick hack. it' [20:39:53] 10Operations: status.wikimedia.org showing all lights green during major outage - https://phabricator.wikimedia.org/T195530#4230010 (10TheDJ) p:05Triage>03Normal [20:39:55] it's work for a couple of days [20:40:28] 10Operations: status.wikimedia.org showing all lights green during major outage - https://phabricator.wikimedia.org/T195530#4230010 (10Krenair) I assume that site just hits varnish on some basic, likely heavily-cached pages? [20:40:40] DanielK_WMDE_: I'm just thinking out loud right now [20:40:56] 10Operations: status.wikimedia.org showing all lights green during major outage - https://phabricator.wikimedia.org/T195530#4230027 (10TheDJ) Status also has checks for "UNCACHED", so... [20:41:02] addshore, legoktm: i didn't get the issue with not registering teh special page [20:41:12] that's what happens when you disable an extension, no? [20:41:25] 10Operations, 10Traffic, 10Browser-Support-Apple-Safari, 10Upstream: Fix broken referer categorization for visits from Safari browsers - https://phabricator.wikimedia.org/T154702#4230029 (10JKatzWMF) @TheDJ Thanks for monitoring this and letting us know! It's a huge relief. [20:41:30] !log legoktm@tin Synchronized php-1.32.0-wmf.4/extensions/Wikibase/lib/includes/Store/Sql/TermSqlIndex.php: no-op, sync with git state (duration: 01m 20s) [20:41:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:53] wtf [20:42:59] Amir1111WTF: so we are not adding the index back then for now, right? [20:43:00] <_joe_> legoktm: uh? [20:43:03] why is there an unpulled submodule commit again [20:43:32] * Update extensions/VisualEditor from branch 'wmf/1.32.0-wmf.5' [20:43:32] to 402a2e5c957e8f1a1afb42af38924eae4721ed8e [20:43:32] - MobileArticleTarget: Include placeholder for references [20:43:41] marostegui: we should unfortunately, doing any of this will take lots of time [20:43:48] 10Operations, 10Wikimedia-General-or-Unknown, 10Wikimedia-log-errors: Multiple projects reporting Cannot access the database: No working replica DB server - https://phabricator.wikimedia.org/T195520#4230032 (10Framawiki) [20:43:53] <_joe_> Amir1111WTF: ok [20:43:58] these are mid-term solution [20:44:02] Amir1111WTF: ok, then I will start now with one server [20:44:09] <_joe_> DanielK_WMDE_: do you agree? we should re-add the index? [20:44:26] marostegui: you mean the ex tmp1 one? [20:44:36] I mean, ~10hours isn't acually that bad [20:44:45] if thats the short term solution [20:44:47] twentyafterfour: I don't know what you did, but https://gerrit.wikimedia.org/r/#/c/434992/ wasn't deployed properly. And this has been happening multiple times this week [20:44:48] volans: I gues so yeah [20:44:56] _joe_: sorry, just tin was in a messy state. [20:45:01] that's the one that would fix the other query I've seen passing by too [20:45:15] legoktm: I was mid-deploy when the outage happened [20:45:21] scap aborted due to canary checks [20:45:23] <_joe_> hey I need a moment of focus, everyone [20:45:42] <_joe_> should we re-add the index, and keep half of our team awake tonight? [20:45:42] twentyafterfour: no, I mean it was never pulled down properly. the submodule is out of sync with core [20:46:00] legoktm: because I didn't get to finish [20:46:02] <_joe_> I think so, but I need others to make a call too [20:46:09] _joe_: no need to keep anyone awake. The index won't be finished till tomorrow morning [20:46:15] twentyafterfour: so it's not deployed? [20:46:18] _joe_: If you ask my opinion that can happen tomorrow too [20:46:20] legoktm: right [20:46:21] queries would remain disabled until after that [20:46:26] twentyafterfour: it's pulled in on extensions/VisualEditor, but not in core [20:46:31] right [20:46:32] Amir1111WTF: I will prefer to get a server done tonight, so we are not "wasting" a night [20:46:33] so no need for babysitters [20:46:45] +1, so far I tend to think we leave disabled things disabled while waiting for recreation of index [20:46:53] twentyafterfour: that...shouldn't happen [20:46:59] marostegui: agreed [20:47:03] the fallout at present is fairly-well contained, even if it's undesirable [20:47:08] legoktm: I didn't pull from core I pulled the submodule first [20:47:09] yes, let's readd the index [20:47:22] twentyafterfour: why? you're supposed to pull core then git submodule update... [20:47:33] DanielK_WMDE_: Amir1111WTF we should come up with a solid list of things currently disabled / functioning badly so the community know? [20:48:01] addshore, legoktm: https://gerrit.wikimedia.org/r/q/I70a62e6d55e61160cec3c037febcd6fddeaf3b7d [20:48:02] addshore: I talked to Lydia and she said she will let the community know [20:48:08] tomorrow [20:48:10] but IMO none of them are 'super critical' for the next 10 hours [20:48:30] we just might have a small ammount of data to clean up, item constraints and what not after? [20:48:34] <_joe_> ok so, we're all in agreement not to reenable any functionality tonight? [20:48:39] (03PS1) 10Marostegui: db-eqiad.php: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435045 [20:48:42] marostegui: is 10h per host right? [20:48:52] volans: that was a rough guess [20:48:52] addshore: having resolvePropertyId() return false negatives is going to cause confusion and broken renderings. [20:48:55] DanielK_WMDE_: have you tested that? I'm worried that calls to SpecialPage::getTitleFor('ItemDisambiguation') (if there are any) will fail [20:48:58] could be more [20:49:07] DanielK_WMDE_: what is it used in? [20:49:12] sure, but we can do one at a time I guess [20:49:46] I was trying to estimate how long before we can re-enable the feature [20:49:51] in theory adding an index can be done online, but we know metadata locking issues, so better to depool and do one at the time [20:49:58] addshore: all lua code that refers to a property by name [20:50:10] ouch [20:50:18] it's not 10h before re-enabling, was just trying to make clear this fact [20:50:22] that really wouldnt be cool [20:50:23] legoktm: not tested, wanted to make the code available asap. [20:50:23] (03CR) 10Aaron Schulz: [C: 031] mediawiki/hhvm: Move fatal-error.php to Puppet [puppet] - 10https://gerrit.wikimedia.org/r/379953 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [20:50:39] will test now. i don't have a decent dev environment set up on thiws box though [20:51:02] <_joe_> AaronSchulz: btw, I have alternative proposals for how to configure mcrouter that should be faster [20:51:13] https://etherpad.wikimedia.org/p/wb_terms_solution [20:51:21] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435045 (owner: 10Marostegui) [20:51:24] and then I move this to the incident report [20:51:26] addshore: yes - so my idea is to undeploy the suspect extensions and yank the special page, but undo your patch to make resolveEntityId work again [20:51:27] <_joe_> but we can discuss that after we have a working PoC, I still have issues with PKI and TLS certs to issue in a clean way [20:51:32] DanielK_WMDE_: addshore ^ [20:51:38] thanks for that, Amir1 [20:52:32] yw :) [20:52:50] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435045 (owner: 10Marostegui) [20:53:57] legoktm: 20:53:29 sync-file failed: Failed to acquire lock "/var/lock/scap.operations_mediawiki-config.lock"; owner is "legoktm"; reason is "no-op, sync with git state" [20:54:01] :) [20:54:25] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435045 (owner: 10Marostegui) [20:54:27] !log legoktm@tin Synchronized php-1.32.0-wmf.5/extensions/Wikibase/lib/includes/Store/Sql/TermSqlIndex.php: no-op, sync with git state (duration: 01m 20s) [20:54:30] marostegui: sorry, I'm getting rid of the live hacks [20:54:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:33] :) [20:54:34] marostegui: I have one more thing to sync [20:54:36] ok [20:54:39] go for it [20:54:56] meh [20:55:00] DanielK_WMDE_: the bad renderings on clients would just result in lack of data right? no data provided by lua opposed to errors etc? [20:56:24] !log legoktm@tin Synchronized php-1.32.0-wmf.5/extensions/VisualEditor/: no-op, sync with git state (duration: 01m 21s) [20:56:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:34] Addshore2: that would show itself in a very bad way [20:56:38] marostegui: all done [20:56:42] legoktm: deploying! [20:57:56] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1104 for alter table (duration: 01m 20s) [20:57:59] !log Add tmp1 index back on db1104 [20:58:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:21] 10Operations, 10Wikimedia-General-or-Unknown, 10Patch-For-Review, 10Wikimedia-log-errors: Multiple projects reporting Cannot access the database: No working replica DB server - https://phabricator.wikimedia.org/T195520#4229709 (10Legoktm) A hot fix has been applied to keep the sites up. Wikidata functions... [20:59:35] <_joe_> DanielK_WMDE_ Addshore2 Amir1 do we have a plan on how to selectively reenable the functionality? [20:59:38] <_joe_> any timeline? [20:59:47] _joe_: no timeline. [20:59:53] <_joe_> ok. [20:59:58] my plan is to first selectively *disable* [21:00:04] _joe_: I put this up to come to write things down and get a plan [21:00:04] https://etherpad.wikimedia.org/p/wb_terms_solution [21:00:08] <_joe_> It's going to take up to 2 days to re-add the index everywhere [21:00:19] <_joe_> DanielK_WMDE_: you have to first disable all, then reenable one by one [21:00:24] <_joe_> and see if the canary survives [21:00:33] i made a patch for the special page https://gerrit.wikimedia.org/r/q/I70a62e6d55e61160cec3c037febcd6fddeaf3b7d . [21:00:47] _joe_: well, yes :) that's what i'm working on [21:00:50] DanielK_WMDE_: I agree, selectively disable, then reenable the search method we butchered [21:01:09] 10Operations, 10Wikimedia-General-or-Unknown, 10Patch-For-Review, 10Wikimedia-log-errors: Multiple projects reporting Cannot access the database: No working replica DB server - https://phabricator.wikimedia.org/T195520#4229709 (10Ladsgroup) Also property suggester is disabled and article placeholder won't... [21:01:11] <_joe_> ok I misunderstood "selectively disable", sorry [21:01:32] addshore: so, pull the extensions, patch out the special püage, then revert your patch. can you do the swatting and config changes? i'm no good at this. [21:01:40] yes [21:02:33] wait [21:02:45] I can deploy [21:03:57] Amir1: that would be glorious, I have already had a horribly long week of deploying... [21:04:12] sure thing [21:04:18] let me get back to my desk [21:04:46] <_joe_> I'm here if you need me ofc [21:05:01] thanks [21:05:16] addshore: see pm [21:07:03] legoktm: should I stpo deploying? [21:07:23] DanielK_WMDE_: probably the issue with looking up properties by label isnt that big? [21:07:29] as most places will actually use the ids right? [21:08:08] addshore: i honestly don't know [21:08:08] Amir1: I'm uncomfortable trying to turn it back on. It seems that no one fully understands the dependencies here given that this all happened because it was assumed that this was totally unused. [21:08:14] i don't htink it has a big impact on the db [21:08:16] Personally I'd vote to wait until the index is back [21:08:23] i can't tell how bad it is to have it broken [21:08:28] Instead of hoping that the query volume will be low enough to not cause problems [21:08:56] If people's templates are broken and they can use P### instead of relying on labels, that seems like a good enough workaround for me [21:09:19] yeah [21:09:41] legoktm: ...until people start editing templates that are used on millions of pages, causing the servers to die :) [21:09:42] <_joe_> I would suggest to try to optimize the query maybe? [21:09:49] 10Operations, 10Wikimedia-General-or-Unknown, 10Patch-For-Review, 10Wikimedia-log-errors: Multiple projects reporting Cannot access the database: No working replica DB server - https://phabricator.wikimedia.org/T195520#4229709 (10Sunpriat2) I confirm the language for the aliens {F18514237} [21:09:52] if some of the look like they're actually-unrelated to the index troubles and were just tossed out too quickly while digging for causes, we could make a case to re-enable. [21:10:05] but I'd be pretty hesitant to re-enable any disabled things we actually know are related [21:10:08] DanielK_WMDE_: Not really. That's already a situation we can handle and do handle well. [21:10:23] _joe_: the problem is the table, that's overly big for any funcationality [21:10:27] editing lots of templates just causes pages to be out of date for a while, it doesn't bring the site down [21:10:39] bblack: that was my guess. but it's really just that: a guess. nobody knows *which* query first caused the trouoble. we just saw that eventually, they all failed. [21:11:22] DanielK_WMDE_: but failures apart, if we do an explain of them, they are not optimized with the current indexes and take forever [21:11:25] so. can someone tell me how PropertySuggester factores into this? [21:11:31] how did it come under suspicion? [21:11:34] it seems unrelated to me [21:12:26] [12:36:05] addshore: nope, it's because the property suggester uses wb_terms directly [21:12:26] [12:36:59] addshore: marostegui one very quick way is to disable property suggester until we fix it properly [21:12:29] that's why [21:12:29] 10Operations, 10Wikimedia-General-or-Unknown, 10Patch-For-Review, 10Wikimedia-log-errors: Multiple projects reporting Cannot access the database: No working replica DB server - https://phabricator.wikimedia.org/T195520#4230064 (10matmarex) There is a separate task about the broken text: {T195525}. It alrea... [21:13:05] DanielK_WMDE_: when the errors pointed out I looked the code and saw property suggester usese it [21:13:08] let me find the code [21:13:12] hmm, SuggestionGenerator::getMatchingIDs, calls termIndex->getTopMatchingTerms [21:13:34] which is in turn called by the get suggestions api module [21:13:38] addshore: Which should be handled by Cirrus on WD [21:14:06] should be or is? [21:14:11] DanielK_WMDE_: https://github.com/wikimedia/mediawiki-extensions-PropertySuggester/blob/master/src/SuggestionGenerator.php#L123 [21:14:16] yup [21:14:17] addshore: oh, right, the non-smart suggestions use it. but i thought we switched that to elastic? [21:14:34] SMalyshev: completion search in entity suggester is using elastic, no? [21:14:44] yes [21:14:53] i thought property suggester would also use that, then. does it bypass elastic and hit wb_terms? [21:14:53] hoo: https://phabricator.wikimedia.org/T195490 [21:15:03] yup [21:15:10] suggester doesn't use elastic though, it has its own code [21:15:20] it's not mere search, it's doing some other stuff [21:15:45] I mean property suggester of course [21:15:48] * legoktm -> afk [21:15:58] Amir1: Ah, duh [21:16:05] It should use EntitySearchHelper [21:16:14] Should be easy to replace [21:16:47] (still a bit of work,…) [21:16:51] SMalyshev: oh, PropertySuggester needs to be ported to use EntitySearchHelper. *sigh* [21:17:01] will add that to the etherpad [21:17:03] hoo: it's not actually I was working on it this afternoon but the DI for newTermSearchInteractor needs langauge [21:17:09] it gets messy [21:17:11] hoo: indeed [21:17:13] So, looking at the adhoc logging, it looks like SpecialItemDisambiguation was hit 62k times in the last hour [21:17:29] the regular hits per hour should be closer to 20-30k [21:17:39] FYI [21:17:44] addshore: any patterns in referer or user-agent? [21:17:48] once things are failures, could be retries from timeouts too? [21:18:17] <_joe_> bblack: nope, we're now returning an empty result list [21:18:25] <_joe_> so it "succeeeds" fast [21:18:28] <_joe_> and logs usage [21:18:29] UA sheds nothing, different ips, probably web hosts, and a mix of terms being used [21:18:53] huh. so that thing actually *is* used that much? [21:19:02] how about making *it* use EntitySearchHelper? [21:19:39] I can work on that too but is it a way to see the pattern in the past couple of days? [21:20:07] addshore: ^ [21:20:10] <_joe_> Amir1: pivot I guess? [21:20:14] pattern of? [21:20:19] are we logging also the generated query? IIUIC this method could generate multiple queries, would be useful to get some stats on their type [21:20:28] number of requests to that special page [21:20:44] I wrote a hive query, but the data in hive only goes up to 2018-5-24T19:00 currently (before our issues) [21:21:04] what time did shit hit the spinning blades? [21:21:15] 19:35 [21:21:19] hehe [21:21:19] or 19:32 [21:21:41] wtf is the relationship between EntitySearchHelper and TermSearchInteractor? The interfaces are basically identical. [21:21:54] * DanielK_WMDE_ probably wrote this [21:22:21] DanielK_WMDE_: TermSearchInteractor is the replacement [21:22:26] Other way round [21:22:27] bah [21:22:31] hmhm [21:22:33] other way around [21:22:46] I was working on it but it's not super fast [21:22:51] looks like ~19:27 earlier possible time, but there were other issues much earlier, which could've been early warning signs of impending doom we didn't realize [21:22:55] (to fix it) [21:23:20] addshore: EntitySearchHelper should be the "current" one [21:23:24] hoo: yes [21:24:34] <_joe_> things started around 19:23 [21:24:42] <_joe_> looking at the db monitoring data [21:24:44] 19:18 from tendril [21:26:25] yeh, around 19:20 and then a bigger from 19:27 [21:27:19] <_joe_> the connections started skyroketing around 19:23, up to 19:27, which is when we had a full blown outage [21:27:37] <_joe_> but the issues on databases happened at the time volans mentioned [21:28:13] https://wikitech.wikimedia.org/wiki/Incident_documentation/20180524-wikidata [21:28:47] ~19:22 database connections started rising, 19:27 first alert in irc [21:29:34] 10Operations, 10HHVM, 10Patch-For-Review, 10Vuln-DoS: Long running mediawiki web requests impacts service availability, specially databases - https://phabricator.wikimedia.org/T149421#4230074 (10Krinkle) [21:29:58] I'm querying hadoop [21:30:07] legoktm, _joe_: we have two choices: leave things as they are for now (see https://etherpad.wikimedia.org/p/wb_terms_solution), or start doing the things listed under "short term". [21:30:57] <_joe_> if only etherpad worked for me :P [21:31:00] can you re-paste the link of the etherpad please? [21:31:17] legoktm, _joe_: I'm tempted to try to get resolvePropeprtyId back, but there is of course some risk. There's also damage done by not getting it back, but I can't really assess how bad that is. [21:31:41] DanielK_WMDE_: do you know what queries it generates? [21:31:45] I'm about to zone out, it's late here. i suppose _joe_ feels the same. [21:32:19] <_joe_> DanielK_WMDE_: I agree with your strategy FWIW [21:32:26] <_joe_> disable things, see what keeps being logged [21:32:46] <_joe_> then maybe try to reenable the low-level functionality and be ready to rollback [21:32:56] Amir1: Can you do backport for the dump patch? [21:32:56] volans: not exactly. that methods is very... flexible. it can gernerate a variety of queries. we could change the monkey patch to log the query though [21:32:59] <_joe_> but I'd like other's opinion, this is of cours risky [21:33:04] I can prepare it, if you want [21:33:08] should be super-easy [21:33:15] https://etherpad.wikimedia.org/p/wb_terms_solution [21:33:20] * hoo doesn't want to stay around for SWAT [21:33:25] volans: [21:33:29] hoo: I'm awake [21:33:34] will do it [21:33:34] I think it would be wise to log the queries, to know what we're re-enabling [21:33:35] <_joe_> DanielK_WMDE_: yeah it's very late for almost all of us [21:33:36] Amir1: when have you last slept? [21:33:40] Amir1: THANKS! [21:33:49] thanks for the link apergos [21:33:50] except for me, it's fairly early for me [21:33:54] <_joe_> volans: that too, yes [21:34:00] DanielK_WMDE_: viva coffee [21:34:02] but this isn't my area of expertise. I can drive/watch simple things though. [21:34:07] (Mate atm) [21:34:07] <_joe_> and legoktm too :) [21:34:18] <_joe_> but he's currently afk [21:34:23] _joe_: so maybe leave it for tomorrow? unless bblack wants to play with it over the next couple of hours. [21:34:40] volans: let me check if i can get the query into the log [21:34:41] I'm not touching anything on my own guesses, I'm too likely to make things worse in this area :) [21:34:54] <_joe_> DanielK_WMDE_: at least disabling higher-level things is harmless until you don't reenable the underlying function [21:34:59] DanielK_WMDE_: I can tell you what are the 2 main queries seen by the DB, if that helps (I guess you already know, but JIC) [21:36:20] in any case, I'm one of the few SREs that shouldn't be eating or sleeping right now, so if things re-blow-up and I'm not talking here on IRC, call me first. [21:37:29] perhaps we should put a simple statsd counter on that method too and monitor call rate as we reintroduce thing? [21:37:31] *things [21:38:12] (03PS1) 10Hoo man: Wikidata entity dumps: Only dump Items and Properties [puppet] - 10https://gerrit.wikimedia.org/r/435056 (https://phabricator.wikimedia.org/T195419) [21:38:15] Amir1 re hadoop, I tihnk you'll have to wait another hour or so for the relevant data to actually be in there [21:39:12] addshore: I'm going back in time [21:39:32] Amir1, addshore, _joe_, volans: https://gerrit.wikimedia.org/r/q/Id9fdc74829e6268ecc3861602adf6666c2eaffc4 [21:39:36] back in time ?:P [21:39:44] :D [21:39:50] today we had 30K request per hour since 00 [21:39:57] I'm checking yesterday [21:39:57] (03CR) 10Hoo man: "Note: Requires b9d0465ce37fb78d706cca6ec189f10296614705 to be in place." [puppet] - 10https://gerrit.wikimedia.org/r/435056 (https://phabricator.wikimedia.org/T195419) (owner: 10Hoo man) [21:40:11] Amir1: yes, that sounds normal [21:40:20] Amir1: I checked 3 days and its all around that level [21:40:47] https://www.irccloud.com/pastebin/PSXsbToO/ [21:40:49] Amir1: ^^ [21:40:56] still doesn't make any sense to me [21:41:04] volans: that patch should get the query into the logs. now, as to getting them out of the logs, don't ask me... i lost track of the monitoring infrastructure changes a couple of years ago :P [21:41:25] <_joe_> DanielK_WMDE_: AdHoc.log on mwlog1001 :P [21:41:26] so. [21:41:27] could all be red herrings [21:41:28] lol, yeah that's not that hard [21:41:34] who is going to tell lydia? [21:41:45] DanielK_WMDE_: tell her what? :p [21:42:00] <_joe_> I see... you're drawing the short straw right now [21:42:06] that wikidata went up in smoke, and several things are now emergency-disabled? [21:42:14] I think she noticed? [21:42:14] addshore: I'm giving up on that, yeah [21:42:23] addshore: did she? [21:42:31] DanielK_WMDE_: the patch sounds good to me, okay to deploy that? [21:42:43] DanielK_WMDE_: im talking to her on hangouts now [21:42:46] someone somewhere said she would inform the community tomorrow ;) [21:42:50] <_joe_> shouldnt you deploy that patch DanielK_WMDE_ ? or tomorrow :P [21:43:04] DanielK_WMDE_: Already called her several hours ago [21:43:18] mark: I was [21:43:24] ok [21:45:05] Amir1: DanielK_WMDE_ _joe_ I could sync that patch now [21:45:22] addshore: do you want me to do? [21:45:25] unless Amir1 wants to [21:45:29] I need to get up early, so I am also off to bed [21:45:38] thanks for working through this everyone :) [21:45:49] mark: sleep tight [21:45:53] Amir1: maybe it's best I leave you to tidy things up here and then I can do anything that needs to happen in the morning? [21:46:06] sure thing [21:46:08] Amir1, also, you have my number too right? :P [21:46:09] I'm going to check out also... see folks in the morning [21:46:19] addshore: nope I guess [21:47:40] seeingg what queries are caused by what callers would probably be quite helpful [21:48:15] yup [21:48:33] 10Operations, 10Wikidata, 10Wikimedia-General-or-Unknown, 10Patch-For-Review, 10Wikimedia-log-errors: Multiple projects reporting Cannot access the database: No working replica DB server - https://phabricator.wikimedia.org/T195520#4230135 (10daniel) Pinging @Lydia_Pintscher. This mostly affected Wikidiat... [21:48:53] Amir1: FYI the way I estimated the current call rate for the special page was 1 hour after logging I did... addshore@mwlog1001:/srv/mw-log$ cat AdHocDebug.log |grep -c "SpecialItemDisambiguation->execute" [21:49:29] but i just realised that there are 2 lines per request in there? [21:49:58] DanielK_WMDE_: so wb_terms now is used for exact matches only? or has some advanced functions? [21:50:15] Amir1: which implies that the rate of hits on that page is no more than normal [21:50:23] https://gerrit.wikimedia.org/r/#/c/435057/ [21:50:33] This is getting merged DanielK_WMDE_ [21:50:40] which suggests something else 'really' caused this, or some combination of things [21:51:07] * hoo will leave soon… is there anything I should look at? [21:51:10] SMalyshev: prefix matches on case folded key. i thought we didn't use that any more, but i guess PropertySuggester still did. [21:51:15] * addshore will also leave now [21:51:27] SMalyshev: resoolvePropertyId only does full matches (but still case folded, i think) [21:51:54] I've seen just two main offenders in the slow query log [21:51:55] https://gerrit.wikimedia.org/r/#/c/435042/ [21:51:58] Amir1: yes, go ahead [21:52:00] should we deploy this too? [21:52:13] volans: where can that be seen? [21:52:20] in logstash? [21:52:39] I got them from tendril, I can put them in the etherpad, give me a min [21:52:43] Amir1: i say yes, the spepcial page is broken now. maybe want a +1 from someone else. but i see no problem with just not registering the special page [21:53:03] I agree [21:53:08] addshore: what do you think? [21:53:17] *reads* [21:53:18] Amir1, addshore: is ArticlePlaceholder disabled? [21:53:20] Fine with me [21:53:21] DanielK_WMDE_: ok, i just had a thought if we only have these matches we probably don't need all the fancy profiles we have in repo, we can have one fixed profile in client and always use it, not import profiles from repo... this will allow to do it much easier for client [21:53:25] I can do it [21:53:27] (but could also point to SpecialBlankPage) [21:53:32] DanielK_WMDE_: no, only PropertySuggestor is currently disabled [21:53:33] I can go through the list one by one [21:53:43] ArticlePlaceholder search integration needs disabling [21:53:49] the special page should be just fine [21:53:59] SMalyshev: yes, that should be sufficeint. from the client, all we need is straight forward lookups. [21:54:07] SMalyshev: may need languiage fallback, though. [21:54:33] SMalyshev: you pre-expand that, per client. it's always for the content language. it's uised by the parser, not interactively. [21:54:37] that would work same way, same query [21:54:39] DanielK_WMDE_: Even language fallbacks are optional, I think [21:54:44] I doubt we have them currently [21:54:47] is it right that SpecialItemDisambiguation calls getMatchingTerms twice per request? [21:54:53] SMalyshev: oh... except by ArticlePlaceholder... that wants more fancy stuff... [21:54:58] DanielK_WMDE_: No [21:55:03] It's painfully simple [21:55:11] (that's why it's also hit so rarely) [21:55:24] addshore: added to the etherpad [21:55:25] not even prefix search [21:55:25] hoo: it only works with full matches? [21:55:28] Yeah [21:55:32] ok then [21:56:27] LIMIT 2500? heh [21:56:39] addshore, Amir1, hoo: I vote for also disabling ArticlePlaceholder for now [21:56:53] DanielK_WMDE_: sounds okay to me [21:57:00] let's do it [21:57:06] otoh, if we disable stuff, we can't log the query that this stuff generates [21:57:07] i can make a patch for you [21:57:12] DanielK_WMDE_: hmm, true [21:57:16] then i vote don't disable it [21:57:21] that would actualyl be an argument for re-enabling property suggester [21:57:27] DanielK_WMDE_: indeed [21:57:32] ...for a bit. [21:57:35] as long as the root function is still returning [] [21:57:37] ...tomorrow :) [21:57:38] mwdebug? [21:57:50] deploy the query logging patch now, and then we can turn things off one by one after [21:57:53] hoo: yes, the patch is already in. [21:58:04] but if we disable everything that calls the code, we won't see anything :) [21:58:17] anyway. killer patch stays in for the night. [21:58:22] let's figure this out tomorrow [21:58:26] +1 [21:58:35] okay, sounds good to me [21:58:35] yup [21:58:41] thanks all! [22:00:04] 10Operations, 10Wikidata, 10Wikimedia-General-or-Unknown, 10Patch-For-Review, 10Wikimedia-log-errors: Multiple projects reporting Cannot access the database: No working replica DB server - https://phabricator.wikimedia.org/T195520#4229709 (10MrFulano) Some users on ptwiki related that coudn't access the... [22:04:47] right im off [22:06:15] DanielK_WMDE_: jenkins is failing on your patch [22:06:28] of course it fails [22:06:33] phan? [22:06:44] phan currently always fails for Wikibase backports I believe [22:07:01] because lots of phpunit tests fails [22:07:13] not just phan [22:08:22] oh... [22:08:23] heh [22:08:39] wait, of course they will, we commented out some pretty important functionality :p [22:08:46] exactly [22:09:07] * addshore just realized it is friday tommorrow also [22:09:27] btw from what I understand right now Lexeme searches also use wb_terms for displaying data, but the lookup is id->term so it's probably ok? [22:10:51] ah wait it might be using ElasticTermLookup already [22:11:12] afaik lexemes just arent even stored in the terms table [22:11:19] ah ok :) [22:11:35] that means it probably does use elastic :) [22:11:36] afaik [22:11:42] !log ladsgroup@tin Synchronized php-1.32.0-wmf.4/extensions/Wikibase/lib/includes/Store/Sql/TermSqlIndex.php: Log the query that would hit wb_terms. (T195520) (duration: 01m 21s) [22:11:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:11:47] T195520: Multiple projects reporting Cannot access the database: No working replica DB server - https://phabricator.wikimedia.org/T195520 [22:12:46] I wait a little to be sure wmf.4 is fine [22:12:53] nothing goes bad [22:13:37] looks like the query ending with limit LIMIT 2500 is from special search [22:13:58] via ArticlePlaceholder [22:14:02] hmm don't see ElasticTermLookup being used... now I am confused [22:15:01] * hoo leaves for today [22:15:02] See you [22:15:06] * addshore waves [22:15:38] hoo: wmf.5 only? [22:15:54] if you're still around [22:15:57] Amir1: hm [22:16:05] If we're sure Wikidata will be on that on Monday, yes [22:16:07] wikidata is already on 5 [22:16:16] yes [22:16:16] if rollback is possible [22:16:20] then both [22:16:51] Ok, so wmf.5 should be enough [22:17:10] in the absolute worst case we can still manually stop the dump to prevent Lexemes in there [22:17:12] we should sync to both [22:17:22] this is used by clients too, and half the stuff is still on .4 [22:17:46] oh, wait, for the dump patch [22:17:47] ignore me [22:18:00] 10Operations, 10Wikidata, 10Wikimedia-General-or-Unknown, 10Patch-For-Review, 10Wikimedia-log-errors: Multiple projects reporting Cannot access the database: No working replica DB server - https://phabricator.wikimedia.org/T195520#4230194 (101339861mzb) It appears to be resolved now in arabic wikipedia [22:19:31] addshore: I was waiting to be sure if nothing break [22:19:38] Already deploying to wmf.5 [22:20:27] im not sure if the log gives us a good representation of calls, because property suggester is disabled currently, but meh [22:20:56] hmm, should we enable it back? [22:20:59] I'm all for it [22:21:06] nah, lets leave it for tommorrow [22:21:11] but there are two offending methods [22:21:12] !log ladsgroup@tin Synchronized php-1.32.0-wmf.5/extensions/Wikibase/lib/includes/Store/Sql/TermSqlIndex.php: Log the query that would hit wb_terms. (T195520) (duration: 01m 20s) [22:21:14] still in the log SpecialItemDisambiguation is by far the biggest caller [22:21:15] * volans about to go off [22:21:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:17] T195520: Multiple projects reporting Cannot access the database: No working replica DB server - https://phabricator.wikimedia.org/T195520 [22:21:26] * addshore leaves [22:21:29] o/ [22:21:36] thank you addshore [22:21:38] o/ [22:22:20] yeah looks like LexemeSearch is still using wb_terms for fetching labells. It fetches Wikidata item labels, not lexeme labels, so it works [22:22:46] but we may want to move it to use ElasticTermLookup [22:24:39] * _joe_ off too [22:34:16] fatalmonitor is fine [22:35:33] I'm getting queries for the special page and also I'm working on to fix that thing and incident documentation and sending emails, making phab, etc. I'm around [22:35:42] * Amir1 plays eye of the tiger [23:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear deployers, time to do the Evening SWAT (Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180524T2300). [23:00:04] Amir1: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:07] 10Operations, 10Traffic, 10Browser-Support-Apple-Safari, 10Upstream: Fix broken referer categorization for visits from Safari browsers - https://phabricator.wikimedia.org/T154702#4230268 (10Nuria) So nice when things add up! I can check on the preview staff, let me know if you want that done [23:00:12] o/ [23:00:19] not testable, related to dumps [23:00:40] I don't think swats are allowed? [23:01:21] if this one doesn't go out by monday, dumps of wikidata will explode [23:01:32] (monday new dumps will be built) [23:04:10] this specific patch is fine and needed, you have my permission [23:04:33] that puppet patch is on someone's radar to get merged before it's too late, too, right? [23:04:37] the related puppet patch, that is [23:04:59] yes [23:05:20] Amir1: feel free to do the deploy yourself [23:05:32] thanks [23:06:24] what do you need for puppet? [23:07:11] mutante: https://gerrit.wikimedia.org/r/#/c/435056/ [23:07:15] hey :) [23:08:42] hi :) [23:08:50] adding these options means we are dumping fewer things? [23:11:25] mutante: in practice we means we don't dump lexemes (new entity types) [23:13:19] !log ladsgroup@tin Synchronized php-1.32.0-wmf.5/extensions/Wikibase: Dumps: Allow several --entity-type arguments (T195420) (duration: 02m 25s) [23:13:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:24] T195420: Allow including multiple specific entity types in a Wikibase dump - https://phabricator.wikimedia.org/T195420 [23:13:59] (03CR) 10Dzahn: [C: 032] Wikidata entity dumps: Only dump Items and Properties [puppet] - 10https://gerrit.wikimedia.org/r/435056 (https://phabricator.wikimedia.org/T195419) (owner: 10Hoo man) [23:14:35] (03CR) 10Dzahn: [C: 032] "this is to prevent the wikidata dump from exploding" [puppet] - 10https://gerrit.wikimedia.org/r/435056 (https://phabricator.wikimedia.org/T195419) (owner: 10Hoo man) [23:16:21] mutante: thank you [23:16:53] you're welcome [23:33:21] 10Operations, 10Traffic, 10Wikimedia-Hackathon-2018: Create and deploy a centralized letsencrypt service - https://phabricator.wikimedia.org/T194962#4230355 (10Krenair) So I've got it serving files to a puppet client successfully. Client just has this: ```lang=puppet file { '/etc/centralcerts/testing.pub... [23:40:28] Logs are all fine for a while now, I go get a nap and come back soon. I keep my phone close to me