[00:17:33] PROBLEM - MariaDB Slave Lag: s4 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 969.80 seconds [00:18:05] PROBLEM - MariaDB Slave Lag: s8 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1025.50 seconds [00:18:53] PROBLEM - MariaDB Slave Lag: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 977.11 seconds [00:19:41] PROBLEM - MariaDB Slave Lag: s6 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect [00:19:45] PROBLEM - MariaDB Slave Lag: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect [00:20:05] PROBLEM - MariaDB Slave Lag: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect [00:20:21] PROBLEM - MariaDB Slave Lag: s2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect [00:21:31] PROBLEM - MariaDB Slave IO: s8 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [00:21:31] PROBLEM - MariaDB Slave SQL: s8 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [00:21:37] PROBLEM - MariaDB Slave IO: s1 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [00:21:41] PROBLEM - MariaDB Slave SQL: s2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [00:21:45] PROBLEM - MariaDB Slave SQL: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [00:21:51] PROBLEM - MariaDB Slave SQL: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [00:21:53] PROBLEM - MariaDB Slave IO: s6 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [00:21:53] PROBLEM - MariaDB Slave IO: s3 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [00:21:55] PROBLEM - MariaDB Slave IO: x1 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [00:22:01] PROBLEM - MariaDB Slave SQL: m2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [00:22:05] PROBLEM - MariaDB Slave SQL: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [00:22:14] ? [00:22:15] PROBLEM - MariaDB Slave IO: s5 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [00:22:25] PROBLEM - MariaDB Slave IO: s4 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [00:22:25] PROBLEM - MariaDB Slave SQL: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [00:22:25] PROBLEM - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [00:22:25] PROBLEM - MariaDB Slave IO: m2 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [00:22:27] PROBLEM - MariaDB Slave IO: s7 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [00:22:29] PROBLEM - MariaDB Slave SQL: m3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [00:22:31] PROBLEM - MariaDB Slave SQL: s4 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [00:22:31] PROBLEM - MariaDB Slave IO: m3 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [00:22:35] PROBLEM - MariaDB Slave IO: s2 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [00:22:41] PROBLEM - MariaDB Slave SQL: s6 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [00:25:37] PROBLEM - Disk space on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused [00:25:51] PROBLEM - DPKG on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused [00:26:21] PROBLEM - dhclient process on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused [00:26:21] PROBLEM - MD RAID on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused [00:26:21] PROBLEM - Check systemd state on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused [00:26:29] PROBLEM - configured eth on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused [00:27:18] that's weird [00:28:04] its' been going on all week. [00:28:10] (dbstore1002) [00:28:22] is the stat1007 thing normal too? [00:30:11] PROBLEM - puppet last run on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused [00:30:39] PROBLEM - MariaDB Slave Lag: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect [00:31:05] PROBLEM - MariaDB Slave Lag: m3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect [00:31:13] PROBLEM - MariaDB Slave Lag: m2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect [00:31:37] Krinkle nope [00:31:41] oh wrong user [00:31:43] * Krenair [00:35:25] RECOVERY - puppet last run on stat1007 is OK: OK: Puppet is currently enabled, last run 16 minutes ago with 0 failures [00:35:31] RECOVERY - Disk space on stat1007 is OK: DISK OK [00:35:45] RECOVERY - DPKG on stat1007 is OK: All packages OK [00:36:03] ^^^ restarted nagios-nrpe-server dunno if that is actually the rpborlem but that's what the alerts were about [00:36:13] RECOVERY - dhclient process on stat1007 is OK: PROCS OK: 0 processes with command name dhclient [00:36:15] RECOVERY - MD RAID on stat1007 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [00:36:15] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational [00:36:23] RECOVERY - configured eth on stat1007 is OK: OK - interfaces up [00:38:07] chaomodus thanks, you may want to log that using !log. [00:38:12] okay [00:38:28] !log stat1007 nagios-srpe-server was off and alerted, restarting fixed it [00:38:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:39:31] thanks [00:41:06] np :) [00:48:33] PROBLEM - tilerator on maps1002 is CRITICAL: connect to address 10.64.16.42 and port 6534: Connection refused [00:48:49] PROBLEM - tilerator on maps1001 is CRITICAL: connect to address 10.64.0.79 and port 6534: Connection refused [01:00:30] (03CR) 10JJMC89: Merge the "extended-uploader" and "autopatrolled" user groups on Commons (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485487 (https://phabricator.wikimedia.org/T214003) (owner: 10Zoranzoki21) [01:08:53] PROBLEM - MariaDB Slave Lag: s2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 3906.63 seconds [01:09:03] PROBLEM - MariaDB Slave Lag: s8 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 4083.71 seconds [01:09:09] PROBLEM - MariaDB Slave Lag: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 3012.47 seconds [01:09:17] RECOVERY - MariaDB Slave SQL: m2 on dbstore1002 is OK: OK slave_sql_state not a slave [01:09:25] PROBLEM - MariaDB Slave Lag: s6 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 4002.19 seconds [01:09:27] PROBLEM - MariaDB Slave Lag: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 4001.66 seconds [01:09:41] RECOVERY - MariaDB Slave Lag: m3 on dbstore1002 is OK: OK slave_sql_lag not a slave [01:09:41] RECOVERY - MariaDB Slave IO: m2 on dbstore1002 is OK: OK slave_io_state not a slave [01:09:43] RECOVERY - MariaDB Slave IO: m3 on dbstore1002 is OK: OK slave_io_state not a slave [01:09:43] PROBLEM - MariaDB Slave Lag: s4 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 4099.77 seconds [01:09:43] RECOVERY - MariaDB Slave SQL: m3 on dbstore1002 is OK: OK slave_sql_state not a slave [01:09:49] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 172782.19 seconds [01:09:49] RECOVERY - MariaDB Slave Lag: m2 on dbstore1002 is OK: OK slave_sql_lag not a slave [01:09:51] PROBLEM - MariaDB Slave Lag: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 4035.91 seconds [01:09:51] PROBLEM - MariaDB Slave Lag: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 3991.91 seconds [01:30:29] 10Operations, 10Icinga, 10monitoring, 10cloud-services-team (Kanban): cloudvirt1021/Disk space is CRITICAL - https://phabricator.wikimedia.org/T214325 (10GTirloni) This is where check_disk dies: https://github.com/nagios-plugins/nagios-plugins/blob/master/plugins/check_disk.c#L997 The error happens becaus... [01:30:42] 10Operations, 10Icinga, 10monitoring, 10cloud-services-team (Kanban): cloudvirt1021/Disk space is CRITICAL - https://phabricator.wikimedia.org/T214325 (10GTirloni) 05Open→03Resolved p:05Triage→03Normal a:03GTirloni [02:15:58] looking at maps [02:17:32] !log restarting tilerator on maps100[1-2] [02:17:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:21:17] RECOVERY - tilerator on maps1002 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.038 second response time [02:21:45] RECOVERY - tilerator on maps1001 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.048 second response time [02:52:06] 10Operations, 10Elasticsearch, 10Maps, 10Discovery-Search (Current work): Review Elastic/maps Grafana dashboards - https://phabricator.wikimedia.org/T209812 (10Mathew.onipe) [02:53:54] 10Operations, 10Elasticsearch, 10Maps, 10Discovery-Search (Current work): Review Elastic/maps Grafana dashboards - https://phabricator.wikimedia.org/T209812 (10Mathew.onipe) [03:00:04] kart_: That opportune time is upon us again. Time for a ContentTranslation Draft Purge Script Run deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190122T0300). [03:26:23] PROBLEM - puppet last run on mwlog1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:57:45] RECOVERY - puppet last run on mwlog1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [04:01:47] 10Operations, 10Elasticsearch, 10Maps, 10Discovery-Search (Current work): Review Elastic/maps Grafana dashboards - https://phabricator.wikimedia.org/T209812 (10Mathew.onipe) [04:23:45] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 46 probes of 375 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [04:23:59] PROBLEM - HHVM rendering on mw1345 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:25:01] (03PS2) 10Krinkle: [labs] Remove GuidedTour config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483892 (owner: 10MaxSem) [04:25:03] RECOVERY - HHVM rendering on mw1345 is OK: HTTP OK: HTTP/1.1 200 OK - 76100 bytes in 0.520 second response time [04:25:05] (03CR) 10Krinkle: [C: 03+2] [labs] Remove GuidedTour config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483892 (owner: 10MaxSem) [04:25:20] (03PS2) 10Krinkle: [labs] Remove $wgKartographerUsePageLanguage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483893 (owner: 10MaxSem) [04:25:23] (03CR) 10Krinkle: [C: 03+2] [labs] Remove $wgKartographerUsePageLanguage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483893 (owner: 10MaxSem) [04:26:04] (03PS2) 10Krinkle: [labs] Remove $wmgVisualEditorUseSingleEditTab [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483894 (owner: 10MaxSem) [04:26:07] (03CR) 10Krinkle: [C: 03+2] [labs] Remove $wmgVisualEditorUseSingleEditTab [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483894 (owner: 10MaxSem) [04:26:25] (03Merged) 10jenkins-bot: [labs] Remove GuidedTour config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483892 (owner: 10MaxSem) [04:26:28] (03PS2) 10Krinkle: [labs] Remove $wmgVisualEditorTransitionDefault [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483895 (owner: 10MaxSem) [04:26:32] (03CR) 10Krinkle: [C: 03+2] [labs] Remove $wmgVisualEditorTransitionDefault [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483895 (owner: 10MaxSem) [04:26:44] (03Merged) 10jenkins-bot: [labs] Remove $wgKartographerUsePageLanguage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483893 (owner: 10MaxSem) [04:26:53] (03PS2) 10Krinkle: [labs] Remove $wmgULSCompactLanguageLinksBetaFeature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483896 (owner: 10MaxSem) [04:26:58] (03CR) 10Krinkle: [C: 03+2] "Confirmed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483896 (owner: 10MaxSem) [04:27:21] (03Merged) 10jenkins-bot: [labs] Remove $wmgVisualEditorUseSingleEditTab [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483894 (owner: 10MaxSem) [04:27:40] (03Merged) 10jenkins-bot: [labs] Remove $wmgVisualEditorTransitionDefault [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483895 (owner: 10MaxSem) [04:28:12] (03Merged) 10jenkins-bot: [labs] Remove $wmgULSCompactLanguageLinksBetaFeature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483896 (owner: 10MaxSem) [04:29:58] (03CR) 10Krinkle: [C: 03+2] "Confirmed. For any post-merge reviewers searching for "GettingStartedRunTest" without the "wg" prefix, note that this isn't for "wg", it's" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483897 (owner: 10MaxSem) [04:31:27] (03PS2) 10Krinkle: [labs] Remove $wmgGettingStartedRunTest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483897 (owner: 10MaxSem) [04:31:36] (03PS2) 10Krinkle: [labs] Remove $wmgUseQuickSurveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483898 (owner: 10MaxSem) [04:31:39] (03CR) 10Krinkle: [C: 03+2] [labs] Remove $wmgUseQuickSurveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483898 (owner: 10MaxSem) [04:32:03] (03PS2) 10Krinkle: [labs] Remove $wmgUseElectronPdfService [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483899 (owner: 10MaxSem) [04:32:08] (03CR) 10Krinkle: [C: 03+2] [labs] Remove $wmgUseElectronPdfService [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483899 (owner: 10MaxSem) [04:32:10] (03CR) 10jenkins-bot: [labs] Remove GuidedTour config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483892 (owner: 10MaxSem) [04:32:12] (03CR) 10jenkins-bot: [labs] Remove $wgKartographerUsePageLanguage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483893 (owner: 10MaxSem) [04:32:14] (03PS2) 10Krinkle: [labs] Remove $wmgUseTemplateWizard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483900 (owner: 10MaxSem) [04:32:16] (03CR) 10jenkins-bot: [labs] Remove $wmgVisualEditorUseSingleEditTab [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483894 (owner: 10MaxSem) [04:32:18] (03CR) 10jenkins-bot: [labs] Remove $wmgVisualEditorTransitionDefault [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483895 (owner: 10MaxSem) [04:32:20] (03CR) 10jenkins-bot: [labs] Remove $wmgULSCompactLanguageLinksBetaFeature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483896 (owner: 10MaxSem) [04:32:22] (03CR) 10Krinkle: [C: 03+2] [labs] Remove $wmgUseTemplateWizard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483900 (owner: 10MaxSem) [04:32:24] (03PS2) 10Krinkle: [labs] Remove $wmgUseLoginNotify [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483901 (owner: 10MaxSem) [04:32:39] (03CR) 10Krinkle: [C: 03+2] "Confirmed subset of prod." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483901 (owner: 10MaxSem) [04:32:49] (03Merged) 10jenkins-bot: [labs] Remove $wmgGettingStartedRunTest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483897 (owner: 10MaxSem) [04:32:55] (03Merged) 10jenkins-bot: [labs] Remove $wmgUseQuickSurveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483898 (owner: 10MaxSem) [04:33:13] (03Merged) 10jenkins-bot: [labs] Remove $wmgUseElectronPdfService [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483899 (owner: 10MaxSem) [04:33:38] (03Merged) 10jenkins-bot: [labs] Remove $wmgUseTemplateWizard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483900 (owner: 10MaxSem) [04:34:06] (03Merged) 10jenkins-bot: [labs] Remove $wmgUseLoginNotify [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483901 (owner: 10MaxSem) [04:34:13] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 28 probes of 375 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [04:37:22] (03PS2) 10Krinkle: [labs] Remove $wmgUseCodeMirror [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483904 (owner: 10MaxSem) [04:37:37] (03CR) 10Krinkle: [C: 03+2] "Rebased to resolve merge conflict." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483904 (owner: 10MaxSem) [04:37:47] (03PS2) 10Krinkle: [labs] Remove $wmgUseAdvancedSearch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483905 (owner: 10MaxSem) [04:38:01] (03CR) 10Krinkle: [C: 03+2] [labs] Remove $wmgUseAdvancedSearch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483905 (owner: 10MaxSem) [04:38:05] (03PS2) 10Krinkle: [labs] Remove $wgAdvancedSearchBetaFeature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483906 (owner: 10MaxSem) [04:38:20] (03CR) 10Krinkle: [C: 03+2] [labs] Remove $wgAdvancedSearchBetaFeature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483906 (owner: 10MaxSem) [04:38:24] (03PS2) 10Krinkle: [labs] Remove $wmgUsePageViewInfo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483907 (owner: 10MaxSem) [04:38:43] (03CR) 10Krinkle: [C: 03+2] [labs] Remove $wmgUsePageViewInfo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483907 (owner: 10MaxSem) [04:38:46] (03Merged) 10jenkins-bot: [labs] Remove $wmgUseCodeMirror [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483904 (owner: 10MaxSem) [04:38:48] (03PS2) 10Krinkle: [labs] Remove $wmgUseLinter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483908 (owner: 10MaxSem) [04:39:02] (03CR) 10Krinkle: [C: 03+2] [labs] Remove $wmgUseLinter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483908 (owner: 10MaxSem) [04:39:07] (03Merged) 10jenkins-bot: [labs] Remove $wmgUseAdvancedSearch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483905 (owner: 10MaxSem) [04:39:09] (03PS2) 10Krinkle: [labs] Remove $wgPopupsOptInStateForNewAccounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483909 (owner: 10MaxSem) [04:39:27] (03Merged) 10jenkins-bot: [labs] Remove $wgAdvancedSearchBetaFeature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483906 (owner: 10MaxSem) [04:39:34] (03CR) 10Krinkle: [C: 03+2] "Confirmed subset of prod and comment is preserved there." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483909 (owner: 10MaxSem) [04:39:38] (03PS2) 10Krinkle: [labs] Remove $wmgUseTemplateStyles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483910 (owner: 10MaxSem) [04:39:58] (03Merged) 10jenkins-bot: [labs] Remove $wmgUsePageViewInfo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483907 (owner: 10MaxSem) [04:40:07] (03CR) 10Krinkle: [C: 03+2] [labs] Remove $wmgUseTemplateStyles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483910 (owner: 10MaxSem) [04:40:12] (03PS2) 10Krinkle: [labs] Remove $wgEchoMaxMentionsInEditSummary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483911 (owner: 10MaxSem) [04:40:15] (03Merged) 10jenkins-bot: [labs] Remove $wmgUseLinter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483908 (owner: 10MaxSem) [04:40:51] (03Merged) 10jenkins-bot: [labs] Remove $wgPopupsOptInStateForNewAccounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483909 (owner: 10MaxSem) [04:41:10] (03CR) 10Krinkle: [C: 03+2] [labs] Remove $wgEchoMaxMentionsInEditSummary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483911 (owner: 10MaxSem) [04:41:14] (03PS2) 10Krinkle: [labs] Remove $wmgUseNewWikiDiff2Extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483912 (owner: 10MaxSem) [04:41:17] (03Merged) 10jenkins-bot: [labs] Remove $wmgUseTemplateStyles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483910 (owner: 10MaxSem) [04:41:32] (03CR) 10Krinkle: [C: 03+2] [labs] Remove $wmgUseNewWikiDiff2Extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483912 (owner: 10MaxSem) [04:42:26] (03Merged) 10jenkins-bot: [labs] Remove $wgEchoMaxMentionsInEditSummary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483911 (owner: 10MaxSem) [04:42:40] (03Merged) 10jenkins-bot: [labs] Remove $wmgUseNewWikiDiff2Extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483912 (owner: 10MaxSem) [04:45:36] (03CR) 10jenkins-bot: [labs] Remove $wmgGettingStartedRunTest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483897 (owner: 10MaxSem) [04:45:38] (03CR) 10jenkins-bot: [labs] Remove $wmgUseQuickSurveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483898 (owner: 10MaxSem) [04:45:40] (03CR) 10jenkins-bot: [labs] Remove $wmgUseElectronPdfService [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483899 (owner: 10MaxSem) [04:45:42] (03CR) 10jenkins-bot: [labs] Remove $wmgUseTemplateWizard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483900 (owner: 10MaxSem) [04:45:44] (03CR) 10jenkins-bot: [labs] Remove $wmgUseLoginNotify [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483901 (owner: 10MaxSem) [04:45:46] (03CR) 10jenkins-bot: [labs] Remove $wmgUseCodeMirror [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483904 (owner: 10MaxSem) [04:45:48] (03CR) 10jenkins-bot: [labs] Remove $wmgUseAdvancedSearch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483905 (owner: 10MaxSem) [04:45:50] (03CR) 10jenkins-bot: [labs] Remove $wgAdvancedSearchBetaFeature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483906 (owner: 10MaxSem) [04:45:52] (03CR) 10jenkins-bot: [labs] Remove $wmgUsePageViewInfo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483907 (owner: 10MaxSem) [04:45:54] (03CR) 10jenkins-bot: [labs] Remove $wmgUseLinter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483908 (owner: 10MaxSem) [04:45:56] (03CR) 10jenkins-bot: [labs] Remove $wgPopupsOptInStateForNewAccounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483909 (owner: 10MaxSem) [04:45:58] (03CR) 10jenkins-bot: [labs] Remove $wmgUseTemplateStyles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483910 (owner: 10MaxSem) [04:46:00] (03CR) 10jenkins-bot: [labs] Remove $wgEchoMaxMentionsInEditSummary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483911 (owner: 10MaxSem) [04:46:02] (03CR) 10jenkins-bot: [labs] Remove $wmgUseNewWikiDiff2Extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483912 (owner: 10MaxSem) [04:53:47] (03CR) 10Krinkle: [C: 03+1] Class wrapper for ProductionServices.php etc. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477956 (owner: 10Tim Starling) [04:54:16] (03CR) 10Krinkle: [C: 03+1] "LGTM, fine to land as-is, but a pending question for later," [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477956 (owner: 10Tim Starling) [04:54:47] (03CR) 10Krinkle: "Does this depend on https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/480695/ ?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480419 (owner: 10Tim Starling) [05:40:51] !log kartik@deploy1001 Started deploy [cxserver/deploy@e0ca16b]: Update cxserver to c5ff0bf [05:40:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:45:06] !log kartik@deploy1001 Finished deploy [cxserver/deploy@e0ca16b]: Update cxserver to c5ff0bf (duration: 04m 15s) [05:45:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:02] 10Operations, 10Gerrit, 10Icinga, 10monitoring, 10Release-Engineering-Team (Backlog): Install "healthcheck" plugin on gerrit - https://phabricator.wikimedia.org/T214326 (10greg) p:05Triage→03Normal The beneficial information, it reports specific status of these sub-parts of Gerrit (from the plugin re... [06:12:51] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS1299/IPv4: Connect, AS1299/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:13:14] 10Operations, 10Analytics, 10Analytics-Kanban, 10Product-Analytics: dbstore1002 Mysql errors - https://phabricator.wikimedia.org/T213670 (10Marostegui) Another crash happened last night ` Thread pointer: 0x0x0 Attempting backtrace. You can use the following information to find out where mysqld died. If you... [06:13:43] RECOVERY - MariaDB Slave IO: s5 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [06:13:53] RECOVERY - MariaDB Slave SQL: s7 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [06:13:58] (03PS7) 10Marostegui: dbstore_multiinstance: Add staging db [puppet] - 10https://gerrit.wikimedia.org/r/485367 (https://phabricator.wikimedia.org/T210478) [06:13:59] RECOVERY - MariaDB Slave IO: s1 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [06:13:59] RECOVERY - MariaDB Slave SQL: s1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [06:13:59] PROBLEM - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Update_rows_v1 event on table wikidatawiki.echo_notification: Cant find record in echo_notification, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1069-bin.000317, end_log_pos 979910253 [06:13:59] RECOVERY - MariaDB Slave IO: s4 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [06:13:59] RECOVERY - MariaDB Slave IO: s7 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [06:14:00] RECOVERY - MariaDB Slave SQL: s6 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [06:14:01] RECOVERY - MariaDB Slave SQL: s4 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [06:14:13] RECOVERY - MariaDB Slave IO: s2 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [06:14:17] RECOVERY - MariaDB Slave SQL: s3 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [06:14:23] RECOVERY - MariaDB Slave SQL: s2 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [06:14:25] RECOVERY - MariaDB Slave IO: s8 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [06:14:35] RECOVERY - MariaDB Slave SQL: s5 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [06:14:39] RECOVERY - MariaDB Slave IO: x1 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [06:14:39] RECOVERY - MariaDB Slave IO: s6 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [06:14:39] RECOVERY - MariaDB Slave IO: s3 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [06:14:39] RECOVERY - MariaDB Slave SQL: s8 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [06:15:35] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 40 probes of 375 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [06:20:47] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 21 probes of 375 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [06:22:33] RECOVERY - MariaDB Slave SQL: x1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [06:29:31] PROBLEM - puppet last run on wdqs1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/rsyslog.d/10-puppet-agent.conf] [06:29:39] PROBLEM - puppet last run on phab1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/mysql-ps1.sh] [06:31:35] PROBLEM - puppet last run on mw1278 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/puppet-enabled] [06:32:42] (03PS1) 10Marostegui: db-eqiad.php: Depool db1090:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485739 (https://phabricator.wikimedia.org/T210478) [06:32:47] PROBLEM - puppet last run on mw1300 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ferm/conf.d/00_main] [06:34:07] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1090:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485739 (https://phabricator.wikimedia.org/T210478) (owner: 10Marostegui) [06:34:17] RECOVERY - MariaDB Slave Lag: x1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 97.30 seconds [06:34:32] 10Operations, 10Analytics, 10Analytics-Kanban, 10Product-Analytics: dbstore1002 Mysql errors - https://phabricator.wikimedia.org/T213670 (10Marostegui) All the replication threads but x1 started fine. I have fixed all the x1 rows that failed and it has now caught up [06:35:26] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1090:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485739 (https://phabricator.wikimedia.org/T210478) (owner: 10Marostegui) [06:36:40] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1090:3317 T210478 (duration: 00m 49s) [06:36:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:43] T210478: Migrate dbstore1002 to a multi instance setup on dbstore100[3-5] - https://phabricator.wikimedia.org/T210478 [06:36:44] !log Stop MySQL on db1090:3317 to clone dbstore1003 - T210478 [06:36:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:56] !log Deploy schema change on db1078 (s3 master) - T85757 [06:43:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:01] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [06:43:54] (03CR) 10Krinkle: [C: 03+1] InitialiseSettings.php: Increase parsercache TTL (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485582 (https://phabricator.wikimedia.org/T210992) (owner: 10Marostegui) [06:45:00] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1090:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485739 (https://phabricator.wikimedia.org/T210478) (owner: 10Marostegui) [06:47:07] (03PS2) 10Marostegui: InitialiseSettings.php: Increase parsercache TTL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485582 (https://phabricator.wikimedia.org/T210992) [06:48:23] (03CR) 10Marostegui: InitialiseSettings.php: Increase parsercache TTL (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485582 (https://phabricator.wikimedia.org/T210992) (owner: 10Marostegui) [06:55:39] RECOVERY - puppet last run on wdqs1006 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:55:45] RECOVERY - puppet last run on phab1002 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:57:41] RECOVERY - puppet last run on mw1278 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:58:55] RECOVERY - puppet last run on mw1300 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:05:56] (03PS2) 10Marostegui: parsercachepurging: Increase keys TTL [puppet] - 10https://gerrit.wikimedia.org/r/485583 (https://phabricator.wikimedia.org/T210992) [07:11:31] RECOVERY - MariaDB Slave Lag: s6 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [07:16:51] (03CR) 10Elukey: [C: 03+1] dbstore_multiinstance: Add staging db [puppet] - 10https://gerrit.wikimedia.org/r/485367 (https://phabricator.wikimedia.org/T210478) (owner: 10Marostegui) [07:33:59] RECOVERY - MariaDB Slave Lag: s2 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [07:35:23] RECOVERY - MariaDB Slave Lag: s5 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 84.53 seconds [07:41:09] (03PS2) 10Muehlenhoff: Remove Diamond from Kafka hosts [puppet] - 10https://gerrit.wikimedia.org/r/485004 (https://phabricator.wikimedia.org/T212231) [07:45:29] (03CR) 10Muehlenhoff: [C: 03+2] Remove Diamond from Kafka hosts [puppet] - 10https://gerrit.wikimedia.org/r/485004 (https://phabricator.wikimedia.org/T212231) (owner: 10Muehlenhoff) [07:52:50] (03PS1) 10Elukey: profile::refinery::job::refine: move all crons to timers [puppet] - 10https://gerrit.wikimedia.org/r/485744 (https://phabricator.wikimedia.org/T172532) [07:56:21] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/14413/an-coord1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/485744 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [08:01:38] (03CR) 10Mathew.onipe: maps: migrate maps1002 to stretch (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/485584 (https://phabricator.wikimedia.org/T198622) (owner: 10Mathew.onipe) [08:02:51] (03PS2) 10Muehlenhoff: Remove obsolete Hadoop netboot entries and obsolete analytics-dell recipe [puppet] - 10https://gerrit.wikimedia.org/r/485667 (https://phabricator.wikimedia.org/T156955) [08:05:53] (03PS3) 10Mathew.onipe: maps: migrate maps1002 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/485584 (https://phabricator.wikimedia.org/T198622) [08:05:59] (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete Hadoop netboot entries and obsolete analytics-dell recipe [puppet] - 10https://gerrit.wikimedia.org/r/485667 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [08:11:05] RECOVERY - MariaDB Slave Lag: s3 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 228.41 seconds [08:14:09] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1090:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485749 [08:14:11] 10Operations, 10Elasticsearch, 10Maps, 10Discovery-Search (Current work): Review Elastic/maps Grafana dashboards - https://phabricator.wikimedia.org/T209812 (10dcausse) Thanks! Few comments on elasticsearch-memory: - "Completion Indices Top 5" graphs have not been converted to the topK, they show more t... [08:14:54] !log Compress s7 on dbstore1003 - T210478 [08:14:55] (03PS9) 10Giuseppe Lavagetto: profile::services_proxy: simple local proxying for remote services [puppet] - 10https://gerrit.wikimedia.org/r/483788 (https://phabricator.wikimedia.org/T210717) [08:14:57] (03PS10) 10Giuseppe Lavagetto: mediawiki::common: add proxy for services [puppet] - 10https://gerrit.wikimedia.org/r/483789 (https://phabricator.wikimedia.org/T210717) [08:14:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:58] T210478: Migrate dbstore1002 to a multi instance setup on dbstore100[3-5] - https://phabricator.wikimedia.org/T210478 [08:15:17] RECOVERY - MariaDB Slave Lag: s7 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 240.49 seconds [08:15:46] (03CR) 10jerkins-bot: [V: 04-1] profile::services_proxy: simple local proxying for remote services [puppet] - 10https://gerrit.wikimedia.org/r/483788 (https://phabricator.wikimedia.org/T210717) (owner: 10Giuseppe Lavagetto) [08:18:15] (03PS4) 10Mathew.onipe: maps: migrate maps1002 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/485584 (https://phabricator.wikimedia.org/T198622) [08:18:37] (03PS1) 10Elukey: profile::analytics::refinery::job::eventlogging_to_druid_job: move to timers [puppet] - 10https://gerrit.wikimedia.org/r/485750 (https://phabricator.wikimedia.org/T172532) [08:19:08] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1090:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485749 (owner: 10Marostegui) [08:20:36] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/14416/an-coord1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/485750 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [08:20:40] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1090:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485749 (owner: 10Marostegui) [08:21:38] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1090:3317 T210478 (duration: 00m 48s) [08:21:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:41] T210478: Migrate dbstore1002 to a multi instance setup on dbstore100[3-5] - https://phabricator.wikimedia.org/T210478 [08:26:21] !log Deploy schema change on dbstore1001:3316 - T210713 [08:26:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:24] T210713: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 [08:29:30] (03PS1) 10Jcrespo: Depool db1097 (on both sections) for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485752 [08:30:59] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1090:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485749 (owner: 10Marostegui) [08:32:36] (03CR) 10Marostegui: [C: 03+1] Depool db1097 (on both sections) for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485752 (owner: 10Jcrespo) [08:33:40] (03CR) 10Mathew.onipe: "PCC output is expected: https://puppet-compiler.wmflabs.org/compiler1002/14417/" [puppet] - 10https://gerrit.wikimedia.org/r/485584 (https://phabricator.wikimedia.org/T198622) (owner: 10Mathew.onipe) [08:37:09] (03PS2) 10Jcrespo: Depool db1097 (on both sections) for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485752 [08:37:41] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 24, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:37:51] jouncebot: now [08:37:51] No deployments scheduled for the next 3 hour(s) and 22 minute(s) [08:37:53] jouncebot: next [08:37:53] In 3 hour(s) and 22 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190122T1200) [08:38:05] I was about to deploy some maintenance [08:38:08] ^ [08:38:21] (03CR) 10Jcrespo: [C: 03+2] Depool db1097 (on both sections) for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485752 (owner: 10Jcrespo) [08:39:30] (03Merged) 10jenkins-bot: Depool db1097 (on both sections) for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485752 (owner: 10Jcrespo) [08:40:05] jynus: that is okay with me :) [08:40:12] jynus: I'm scheduling some stuff for in 1.5 hours [08:40:17] jouncebot: refresh [08:40:18] I refreshed my knowledge about deployments. [08:40:22] jouncebot: next [08:40:22] In 1 hour(s) and 19 minute(s): WikibaseQualityConstraints configuration changes (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190122T1000) [08:42:40] !log installing policykit-1 security updates on trusty [08:42:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:25] (03PS1) 10Elukey: profile::analytics::refinery::job::project_namespace_map: move to timers [puppet] - 10https://gerrit.wikimedia.org/r/485755 (https://phabricator.wikimedia.org/T172532) [08:44:40] (03CR) 10jenkins-bot: Depool db1097 (on both sections) for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485752 (owner: 10Jcrespo) [08:46:23] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/14418/an-coord1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/485755 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [08:53:05] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1097 (duration: 00m 45s) [08:53:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:45] (03PS1) 10Elukey: profile::statitics::private: remove stat1007 specific bits [puppet] - 10https://gerrit.wikimedia.org/r/485757 [08:55:36] !log elasticsearch: closing indices in search-chi@(eqiad|codfw) moved to other elastic instances (T214052) [08:55:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:39] T214052: Delete indices moved from chi to psi/omega - https://phabricator.wikimedia.org/T214052 [09:09:38] 10Operations, 10MediaWiki-General-or-Unknown, 10Multimedia, 10media-storage: Lost file Juan_Guaidó.jpg - https://phabricator.wikimedia.org/T213655 (10jcrespo) a:05jcrespo→03None I am blocked on someone with Mediawiki file metadata workflow knowledge to guide me on what to do here. [09:11:41] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/14419/" [puppet] - 10https://gerrit.wikimedia.org/r/485757 (owner: 10Elukey) [09:14:05] PROBLEM - MariaDB Slave Lag: s6 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 994.28 seconds [09:15:23] (03PS1) 10Marostegui: mariadb: Provision dbstore1004 [puppet] - 10https://gerrit.wikimedia.org/r/485758 (https://phabricator.wikimedia.org/T210478) [09:15:52] marostegui: \o/ [09:15:56] (03PS2) 10Marostegui: mariadb: Provision dbstore1004 [puppet] - 10https://gerrit.wikimedia.org/r/485758 (https://phabricator.wikimedia.org/T210478) [09:16:05] elukey: can you give it a look? [09:17:54] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/compiler1002/14420/" [puppet] - 10https://gerrit.wikimedia.org/r/485758 (https://phabricator.wikimedia.org/T210478) (owner: 10Marostegui) [09:18:28] (03PS1) 10Jcrespo: Revert "Depool db1097 (on both sections) for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485760 [09:19:18] (03PS3) 10Marostegui: mariadb: Provision dbstore1004 [puppet] - 10https://gerrit.wikimedia.org/r/485758 (https://phabricator.wikimedia.org/T210478) [09:20:53] (03PS4) 10Marostegui: mariadb: Provision dbstore1004 [puppet] - 10https://gerrit.wikimedia.org/r/485758 (https://phabricator.wikimedia.org/T210478) [09:22:15] 10Operations, 10MediaWiki-Cache, 10Performance-Team, 10User-Elukey: Consider removing the last traces of nutcracker in Mediawiki configs - https://phabricator.wikimedia.org/T214275 (10elukey) This morning while looking at logstash I found something that seems related to enwiki mentioning the nutcracker's u... [09:23:50] !log stop upgrade and restart db1097 [09:23:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:41] (03PS1) 10Marostegui: db-eqiad.php: Depool db1096:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485765 (https://phabricator.wikimedia.org/T210713) [09:37:01] RECOVERY - MariaDB Slave Lag: s6 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.93 seconds [09:37:48] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1096:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485765 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [09:38:56] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1096:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485765 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [09:39:11] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1096:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485765 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [09:40:09] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1096:3316 T210713 (duration: 00m 48s) [09:40:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:12] T210713: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 [09:40:15] !log Deploy schema change on db1096:3316 - T210713 [09:40:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:23] jynus re: T213655 I think it should be mentioned in SoS, though no pad this week to add outbound SoS notes (cc akosiaris) [09:44:24] T213655: Lost file Juan_Guaidó.jpg - https://phabricator.wikimedia.org/T213655 [09:48:15] godog: hm good point. I 'll add it to this weeks SoS [09:48:58] (03PS2) 10Jcrespo: Revert "Depool db1097 (on both sections) for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485760 [09:53:02] akosiaris: thanks a lot! [09:55:04] !log repooling maps1003 after upgrade to stretch - T198622 [09:55:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:07] T198622: migrate maps servers to stretch with the current style - https://phabricator.wikimedia.org/T198622 [09:56:16] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1096:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485769 [09:56:28] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=maps,name=maps1003.eqiad.wmnet [09:56:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:41] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1096:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485769 (owner: 10Marostegui) [09:58:13] (03PS30) 10DCausse: [cirrus] Cleanup transitional states [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476273 (https://phabricator.wikimedia.org/T210381) [09:58:44] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1096:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485769 (owner: 10Marostegui) [09:59:40] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1096:3316 T210713 (duration: 00m 47s) [09:59:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:43] T210713: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 [10:00:04] addshore: Dear deployers, time to do the WikibaseQualityConstraints configuration changes deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190122T1000). [10:04:25] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1096:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485769 (owner: 10Marostegui) [10:05:13] \o [10:05:35] (03PS2) 10Addshore: wikidata: post edit constraint jobs on 10% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484629 (https://phabricator.wikimedia.org/T204031) [10:05:52] marostegui: am I okay to proceed? :) [10:05:58] yeah! [10:06:00] (2x config changes) [10:06:01] ty [10:06:06] (03CR) 10Addshore: [C: 03+2] wikidata: post edit constraint jobs on 10% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484629 (https://phabricator.wikimedia.org/T204031) (owner: 10Addshore) [10:06:06] yw! [10:07:13] (03Merged) 10jenkins-bot: wikidata: post edit constraint jobs on 10% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484629 (https://phabricator.wikimedia.org/T204031) (owner: 10Addshore) [10:08:02] (03PS2) 10Addshore: Decrease WBQualityConstraintsTypeCheckMaxEntities [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485628 (https://phabricator.wikimedia.org/T209504) (owner: 10Lucas Werkmeister (WMDE)) [10:08:08] (03PS2) 10DCausse: [elasticsearch] Mark production plugins as mandatory [puppet] - 10https://gerrit.wikimedia.org/r/383345 [10:08:42] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T204031 wikidata: post edit constraint jobs on 10% of edits (duration: 00m 47s) [10:08:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:45] T204031: Deploy regular running of wikidata constraint checks using the job queue - https://phabricator.wikimedia.org/T204031 [10:09:27] marostegui: I have to wait a little bit between the 2 patchs, so if you want to do anything now feel free to [10:09:52] addshore: no worries, not going to merge MW config soon :) [10:11:34] (03CR) 10DCausse: [C: 03+1] "These are the plugins we use now minux surrogate-merger which will disappear as we move to 6.5.x" [puppet] - 10https://gerrit.wikimedia.org/r/383345 (owner: 10DCausse) [10:11:41] (03PS5) 10Marostegui: mariadb: Provision dbstore1004 [puppet] - 10https://gerrit.wikimedia.org/r/485758 (https://phabricator.wikimedia.org/T210478) [10:13:19] (03CR) 10Addshore: [C: 03+2] Decrease WBQualityConstraintsTypeCheckMaxEntities [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485628 (https://phabricator.wikimedia.org/T209504) (owner: 10Lucas Werkmeister (WMDE)) [10:14:24] (03Merged) 10jenkins-bot: Decrease WBQualityConstraintsTypeCheckMaxEntities [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485628 (https://phabricator.wikimedia.org/T209504) (owner: 10Lucas Werkmeister (WMDE)) [10:15:17] actually going to do 3 changes :) [10:15:37] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T209504 Decrease WBQualityConstraintsTypeCheckMaxEntities from 300 to 150 (duration: 00m 47s) [10:15:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:41] T209504: Perform more constraint type checks in PHP before falling back to SPARQL - https://phabricator.wikimedia.org/T209504 [10:15:46] "WBQualityConstraintsTypeCheckMaxEntities" amazing [10:16:08] wikibase got shortened [10:16:37] (03PS3) 10Addshore: wikidata: post edit constraint jobs on 25% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484630 (https://phabricator.wikimedia.org/T204031) [10:17:05] jynus: got to cut down on those characters! [10:17:25] next it will be WBQCTypeCheckMaxEntities [10:17:37] (03CR) 10jenkins-bot: wikidata: post edit constraint jobs on 10% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484629 (https://phabricator.wikimedia.org/T204031) (owner: 10Addshore) [10:17:39] (03CR) 10jenkins-bot: Decrease WBQualityConstraintsTypeCheckMaxEntities [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485628 (https://phabricator.wikimedia.org/T209504) (owner: 10Lucas Werkmeister (WMDE)) [10:18:46] (03CR) 10Addshore: [C: 03+2] wikidata: post edit constraint jobs on 25% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484630 (https://phabricator.wikimedia.org/T204031) (owner: 10Addshore) [10:19:51] (03Merged) 10jenkins-bot: wikidata: post edit constraint jobs on 25% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484630 (https://phabricator.wikimedia.org/T204031) (owner: 10Addshore) [10:20:51] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T204031 wikidata: post edit constraint jobs on 25% of edits (duration: 00m 45s) [10:20:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:55] T204031: Deploy regular running of wikidata constraint checks using the job queue - https://phabricator.wikimedia.org/T204031 [10:23:42] * addshore watches graphs [10:28:34] (03PS3) 10Jcrespo: Revert "Depool db1097 (on both sections) for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485760 [10:30:51] (03CR) 10jenkins-bot: wikidata: post edit constraint jobs on 25% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484630 (https://phabricator.wikimedia.org/T204031) (owner: 10Addshore) [10:35:15] (03CR) 10Jcrespo: [C: 03+2] Revert "Depool db1097 (on both sections) for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485760 (owner: 10Jcrespo) [10:36:21] (03Merged) 10jenkins-bot: Revert "Depool db1097 (on both sections) for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485760 (owner: 10Jcrespo) [10:38:01] RECOVERY - MariaDB Slave Lag: s8 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 174.18 seconds [10:44:02] (03CR) 10jenkins-bot: Revert "Depool db1097 (on both sections) for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485760 (owner: 10Jcrespo) [10:50:56] (03PS1) 10Marostegui: db-eqiad.php: Depool db1098:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485784 (https://phabricator.wikimedia.org/T210713) [10:52:25] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1098:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485784 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [10:54:24] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1098:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485784 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [10:54:52] (03PS1) 10Pmiazga: Enable page issues improvements to english wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485787 (https://phabricator.wikimedia.org/T210554) [10:55:39] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1098:3316 T210713 (duration: 00m 45s) [10:55:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:43] T210713: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 [10:55:54] !log Deploy schema change on db1098:3316 - T210713 [10:55:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:42] (03PS2) 10Pmiazga: Enable page issues improvements on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485787 (https://phabricator.wikimedia.org/T210554) [10:56:53] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1098:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485784 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [11:23:11] 10Operations, 10MediaWiki Language Extension Bundle, 10MediaWiki-Cache, 10Language-Team (Language-2019-January-March), and 5 others: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) @Nikerabbit another... [11:26:28] 10Operations, 10Patch-For-Review: Reimage analytics1001 to stretch (as an exercise) - https://phabricator.wikimedia.org/T214294 (10jbond) 05Open→03Resolved [11:27:09] (03PS2) 10Jbond: Small change to test merge permissions (now with a different account) [puppet] - 10https://gerrit.wikimedia.org/r/483427 (https://phabricator.wikimedia.org/T213079) [11:27:23] 10Operations, 10Cassandra, 10Dependency-Tracking, 10Wikibase-Quality, and 7 others: Store WikibaseQualityConstraint check data in persistent storage instead of in the cache - https://phabricator.wikimedia.org/T204024 (10Addshore) RFC created as T214362 [11:27:44] (03CR) 10KartikMistry: [C: 03+1] Define ImportSources for nywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485151 (owner: 10Amire80) [11:27:59] (03CR) 10Jbond: [C: 03+2] Small change to test merge permissions (now with a different account) [puppet] - 10https://gerrit.wikimedia.org/r/483427 (https://phabricator.wikimedia.org/T213079) (owner: 10Jbond) [11:31:17] 10Operations, 10MediaWiki Language Extension Bundle, 10MediaWiki-Cache, 10Language-Team (Language-2019-January-March), and 5 others: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10Nikerabbit) There is no sing... [11:39:56] 10Operations, 10Analytics, 10User-Elukey: Archiva relies on a tmpfs directory that is wiped after each reboot - https://phabricator.wikimedia.org/T214366 (10elukey) p:05Triage→03Normal [12:00:04] (03PS5) 10Mathew.onipe: wdqs: convert prom exporter script tp py3 [puppet] - 10https://gerrit.wikimedia.org/r/484974 (https://phabricator.wikimedia.org/T213305) [12:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor I � Unicode. All rise for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190122T1200). [12:00:04] TBhagat, Urbanecm, Amir1, and raynor: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:10] Here [12:00:38] oh, it's swat time! [12:01:11] (03CR) 10Mathew.onipe: wdqs: convert prom exporter script tp py3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/484974 (https://phabricator.wikimedia.org/T213305) (owner: 10Mathew.onipe) [12:01:12] Hi zeljkof, [12:02:10] !log tried and failed to deploy patch for T212118 [12:02:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:14] I can swat today [12:02:18] I think I cleaned up everything, but if something’s funny on the deployment host it may be my fault, sorry [12:02:30] Lucas_WMDE: I'll scream ;) [12:02:30] but I’ll let you SWAT now [12:02:38] :) [12:03:24] Hey, please do mine at last. On my way to the office [12:03:57] TIL, always check your phone alarm carefully before sleep [12:04:07] Amir1: :D you still can't deploy? [12:04:30] (03CR) 10Alexandros Kosiaris: Remove externalIP settings (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/484670 (owner: 10Alexandros Kosiaris) [12:04:34] zeljkof: I can [12:04:39] oh, cool [12:04:41] (03PS3) 10Alexandros Kosiaris: Remove externalIP settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/484670 [12:04:48] But commuting atm [12:04:52] 😁 [12:04:57] Amir1: sure, we'll wait for you [12:05:03] TBhagat: ready? [12:05:08] Yup! [12:05:16] (03PS3) 10Zfilipin: Enable transwiki user group on ne.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485043 (https://phabricator.wikimedia.org/T214036) (owner: 10Tulsi Bhagat) [12:07:27] (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485043 (https://phabricator.wikimedia.org/T214036) (owner: 10Tulsi Bhagat) [12:08:33] (03Merged) 10jenkins-bot: Enable transwiki user group on ne.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485043 (https://phabricator.wikimedia.org/T214036) (owner: 10Tulsi Bhagat) [12:08:40] Urbanecm: please stand by, you're next [12:08:43] ack [12:08:54] raynor: around for swat? [12:09:14] !log running mariabackup on dbstore1001:s1 [12:09:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:47] (03CR) 10Zfilipin: "This is scheduled for EU SWAT but I don't see Pmiazga/raynor in #wikimedia-operations." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485787 (https://phabricator.wikimedia.org/T210554) (owner: 10Pmiazga) [12:11:12] TBhagat: the patch is at mwdebug1002, please test and let me know if I can deploy [12:11:55] Seems fine to me. Please deploy! [12:12:48] ok, deploying [12:13:51] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:485043|Enable transwiki user group on ne.wikipedia (T214036)]] (duration: 00m 47s) [12:13:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:54] T214036: Enable Transwiki-importer add or remove by sysop on newiki - https://phabricator.wikimedia.org/T214036 [12:14:54] TBhagat: it's deployed, please test and thanks for deploying with #releng ;) [12:15:08] (03PS2) 10Zfilipin: Create extra namespace in kawiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484021 (https://phabricator.wikimedia.org/T212956) (owner: 10Urbanecm) [12:15:58] (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484021 (https://phabricator.wikimedia.org/T212956) (owner: 10Urbanecm) [12:16:04] Thank you for deploying! zeljkof ;) [12:16:17] Urbanecm: do I need to run a script after deploying 484021? [12:16:23] TBhagat: you're welcome! :) [12:16:39] zeljkof, yes, run namespaceDupes.php, please [12:16:48] ok [12:17:03] (03Merged) 10jenkins-bot: Create extra namespace in kawiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484021 (https://phabricator.wikimedia.org/T212956) (owner: 10Urbanecm) [12:17:05] (03CR) 10jenkins-bot: Enable transwiki user group on ne.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485043 (https://phabricator.wikimedia.org/T214036) (owner: 10Tulsi Bhagat) [12:17:19] (03CR) 10jenkins-bot: Create extra namespace in kawiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484021 (https://phabricator.wikimedia.org/T212956) (owner: 10Urbanecm) [12:17:59] Urbanecm: 484021 is at mwdebug [12:18:03] thx, looking [12:18:58] zeljkof, working, please deploy [12:19:03] ok [12:19:45] 10Operations: Rebuild installer images for CVE-2019-3462 - https://phabricator.wikimedia.org/T214368 (10MoritzMuehlenhoff) [12:19:58] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:484021|Create extra namespace in kawiktionary (T212956)]] (duration: 00m 46s) [12:20:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:01] T212956: New namespace "ფორმაწარმოება" for ka.wiktionary - https://phabricator.wikimedia.org/T212956 [12:20:42] Urbanecm: deployed, running scripts [12:20:45] thx [12:20:55] 10Operations, 10Cloud-Services: Update OpenStack images for jessie/stretch for CVE-2019-3462 - https://phabricator.wikimedia.org/T214369 (10MoritzMuehlenhoff) [12:21:11] !log installing apt security updates for stretch [12:21:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:01] Urbanecm: no problems with scripts https://phabricator.wikimedia.org/T212956#4898414 [12:23:06] ack [12:24:05] (03PS2) 10Zfilipin: Remove ability for bureaucrats on outreachwiki to remove bureaucrat flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485493 (https://phabricator.wikimedia.org/T214133) (owner: 10Urbanecm) [12:24:14] (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485493 (https://phabricator.wikimedia.org/T214133) (owner: 10Urbanecm) [12:25:18] (03Merged) 10jenkins-bot: Remove ability for bureaucrats on outreachwiki to remove bureaucrat flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485493 (https://phabricator.wikimedia.org/T214133) (owner: 10Urbanecm) [12:27:12] Urbanecm: 485493 is at mwdebug [12:27:18] looking [12:28:15] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): Update OpenStack images for jessie/stretch for CVE-2019-3462 - https://phabricator.wikimedia.org/T214369 (10aborrero) p:05Triage→03High a:03aborrero [12:28:17] zeljkof, working, please deploy [12:28:38] ok [12:29:10] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Remove externalIP settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/484670 (owner: 10Alexandros Kosiaris) [12:29:43] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:485493|Remove ability for bureaucrats on outreachwiki to remove bureaucrat flag (T214133)]] (duration: 00m 46s) [12:29:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:46] T214133: Remove ability for bureaucrats on outreachwiki to remove bureaucrat flag - https://phabricator.wikimedia.org/T214133 [12:30:10] (03CR) 10jenkins-bot: Remove ability for bureaucrats on outreachwiki to remove bureaucrat flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485493 (https://phabricator.wikimedia.org/T214133) (owner: 10Urbanecm) [12:30:27] Urbanecm: deployed, please test [12:30:48] working, thanks zeljkof [12:31:36] (03PS1) 10Arturo Borrero Gonzalez: cloudnet2002-dev: cleanup old labtestneutron2002 FQDNs [dns] - 10https://gerrit.wikimedia.org/r/485799 (https://phabricator.wikimedia.org/T214303) [12:32:12] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudnet2002-dev: cleanup old labtestneutron2002 FQDNs [dns] - 10https://gerrit.wikimedia.org/r/485799 (https://phabricator.wikimedia.org/T214303) (owner: 10Arturo Borrero Gonzalez) [12:32:32] (03CR) 10Alexandros Kosiaris: [C: 03+1] jenkins: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/485094 (owner: 10Dzahn) [12:32:37] ok, moving on [12:33:23] (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485494 (https://phabricator.wikimedia.org/T150618) (owner: 10Urbanecm) [12:34:28] (03Merged) 10jenkins-bot: Upload HD logos for several projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485494 (https://phabricator.wikimedia.org/T150618) (owner: 10Urbanecm) [12:34:54] Urbanecm: should I deploy 485494 without mwdebug? [12:35:07] yes, please [12:36:25] ok [12:36:43] !log zfilipin@deploy1001 Synchronized static/images/project-logos/: SWAT: [[gerrit:485494|Upload HD logos for several projects (T150618)]] (duration: 00m 46s) [12:36:43] 10Operations, 10ops-codfw, 10DC-Ops: labtestneutron2002: refresh/rename to cloudnet2002-dev - https://phabricator.wikimedia.org/T214370 (10aborrero) p:05Triage→03Normal [12:36:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:46] T150618: Provide HD logos for all projects - https://phabricator.wikimedia.org/T150618 [12:37:18] Just arrived at the office [12:37:32] Urbanecm: it's deployed, please test logos, let me know if I need to purge any [12:37:46] Amir1: just in time, one more patch and swat is yours [12:37:59] looking [12:38:19] (03PS2) 10Zfilipin: Use new logos in IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485495 (https://phabricator.wikimedia.org/T150618) (owner: 10Urbanecm) [12:38:48] zeljkof, lgtm [12:38:54] great [12:39:09] waiting to see if 485495 will pass CI now [12:39:13] (03CR) 10jerkins-bot: [V: 04-1] Use new logos in IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485495 (https://phabricator.wikimedia.org/T150618) (owner: 10Urbanecm) [12:39:35] thanks [12:39:38] No rush [12:39:38] Urbanecm: 485495 fails CI with https://integration.wikimedia.org/ci/job/operations-mw-config-composer-test-docker/11600/console [12:39:42] looking [12:39:51] Failed asserting that file "/src/tests/../static/images/project-logos/testwiki-1.5x" exists. [12:39:58] !log start stretch upgrade for maps1002 - T198622 [12:40:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:03] T198622: migrate maps servers to stretch with the current style - https://phabricator.wikimedia.org/T198622 [12:40:05] hmm, my mistake [12:40:07] going to fix that [12:40:42] (03PS1) 10Arturo Borrero Gonzalez: labtestn: neutron: refresh hiera settings for interface names [puppet] - 10https://gerrit.wikimedia.org/r/485800 (https://phabricator.wikimedia.org/T214299) [12:41:28] (03PS3) 10Urbanecm: Use new logos in IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485495 (https://phabricator.wikimedia.org/T150618) [12:41:34] zeljkof, now it should pass [12:41:45] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] labtestn: neutron: refresh hiera settings for interface names [puppet] - 10https://gerrit.wikimedia.org/r/485800 (https://phabricator.wikimedia.org/T214299) (owner: 10Arturo Borrero Gonzalez) [12:42:13] PROBLEM - DPKG on db2042 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:42:27] ^ that's me [12:42:39] Urbanecm: ok, I see the fix, missing png :) [12:42:45] yeah :) [12:42:54] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): Update OpenStack images for jessie/stretch for CVE-2019-3462 - https://phabricator.wikimedia.org/T214369 (10GTirloni) a:05aborrero→03GTirloni [12:43:03] PROBLEM - DPKG on puppetdb2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:43:18] (03CR) 10jenkins-bot: Upload HD logos for several projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485494 (https://phabricator.wikimedia.org/T150618) (owner: 10Urbanecm) [12:43:25] PROBLEM - DPKG on elastic2027 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:43:27] RECOVERY - DPKG on db2042 is OK: All packages OK [12:43:45] btw, zeljkof, can you abandon all changes here https://gerrit.wikimedia.org/r/q/project:operations%252Fmediawiki-config+owner:gzt11111%2540gmail.com+status:open ? A GCI student was trying to accomplish this same task, but as there were a lot of mistakes, I uploaded those two patch you just deployed instead of amending theirs. Those patch won't be useful. Thanks! [12:43:51] PROBLEM - DPKG on elastic2043 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:44:15] Urbanecm: sure, but why can't you abandon them? [12:44:19] RECOVERY - DPKG on puppetdb2001 is OK: All packages OK [12:44:37] Urbanecm: please just leave a comment which patch replaces the one that's abandoned [12:44:41] RECOVERY - DPKG on elastic2027 is OK: All packages OK [12:44:42] for reference [12:45:01] because I don't have +2 rights on operations/mediawiki-config [12:45:07] RECOVERY - DPKG on elastic2043 is OK: All packages OK [12:45:19] ah, it makes sense, can't abandon other people's changes [12:45:20] only uploaders and those with +2 access can abandon a change [12:45:26] I'll see if I can do it [12:45:44] (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485495 (https://phabricator.wikimedia.org/T150618) (owner: 10Urbanecm) [12:46:04] btw, I've already added a -1 review with a link to the patch that is the replacement [12:46:14] ah, cool, abandoning [12:46:15] PROBLEM - Host cloudnet2002-dev is DOWN: PING CRITICAL - Packet loss = 100% [12:46:47] (03Merged) 10jenkins-bot: Use new logos in IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485495 (https://phabricator.wikimedia.org/T150618) (owner: 10Urbanecm) [12:46:58] (03Abandoned) 10Zfilipin: Upload HD logos for several projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478498 (owner: 10Robingan7) [12:47:18] (03Abandoned) 10Zfilipin: Use uploaded logos in InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478570 (owner: 10Robingan7) [12:47:35] Urbanecm: abandoned [12:47:42] I see, thanks zeljkof [12:47:47] RECOVERY - Host cloudnet2002-dev is UP: PING OK - Packet loss = 0%, RTA = 36.37 ms [12:48:34] Urbanecm: 485495 is at mwdebug [12:48:41] PROBLEM - DPKG on an-worker1081 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:48:47] PROBLEM - DPKG on an-worker1082 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:49:03] PROBLEM - DPKG on analytics1070 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:49:48] 10Operations, 10ops-codfw, 10cloud-services-team (Kanban): cloudnet2002-dev: ACPI error - https://phabricator.wikimedia.org/T214322 (10aborrero) 05Open→03Resolved >>! In T214322#4897072, @fgiunchedi wrote: > This is known/expected, it is due to the `acpi_power_meter` kernel module which we are blacklisti... [12:50:11] PROBLEM - DPKG on db1106 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:50:20] ^ all dpkg issues are on me, fixing them up [12:50:21] (03PS5) 10Gehel: maps: migrate maps1002 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/485584 (https://phabricator.wikimedia.org/T198622) (owner: 10Mathew.onipe) [12:50:31] PROBLEM - Check systemd state on cloudnet2002-dev is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:50:47] hey all, sorry for my absence, just noticed that my IRC client crashed :/ [12:51:03] Urbanecm: still around [12:51:04] ? [12:51:07] yes [12:51:12] Urbanecm: 485495 is at mwdebug [12:51:15] zeljkof, I assume SWAT window is over [12:51:16] k [12:51:23] RECOVERY - DPKG on db1106 is OK: All packages OK [12:51:32] (03PS10) 10Giuseppe Lavagetto: profile::services_proxy: simple local proxying for remote services [puppet] - 10https://gerrit.wikimedia.org/r/483788 (https://phabricator.wikimedia.org/T210717) [12:51:34] (03PS11) 10Giuseppe Lavagetto: mediawiki::common: add proxy for services [puppet] - 10https://gerrit.wikimedia.org/r/483789 (https://phabricator.wikimedia.org/T210717) [12:51:37] raynor: not over, but Amir1 is next with two patches... so there isn't much time left [12:51:43] zeljkof, working, please deploy [12:51:50] raynor: feel free to extend the window after Amir1 if there's nothing else then [12:51:54] Urbanecm: ok [12:52:10] sure, no problem. I assumed there will be no time for my patch, by I added it anyway, just in case there is a free time [12:52:23] RECOVERY - DPKG on an-worker1081 is OK: All packages OK [12:52:33] sure, if I can deploy my patch fter Amir1 is done, they it makes me super happy [12:52:51] After SWAT is the sanity break but the train is going in US time I think [12:52:52] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:485495|Use new logos in IS.php (T150618)]] (duration: 00m 47s) [12:52:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:55] T150618: Provide HD logos for all projects - https://phabricator.wikimedia.org/T150618 [12:53:27] Urbanecm: 485495 deployed! please test and thanks for deploying with #releng ;) [12:53:34] thanks [12:53:43] RECOVERY - DPKG on an-worker1082 is OK: All packages OK [12:53:51] Amir1, raynor: swat is yours! correct, train is US time, to it should be fine to extend swat [12:53:56] (03PS2) 10Alexandros Kosiaris: mathoid: Remove the logging sidecar container [deployment-charts] - 10https://gerrit.wikimedia.org/r/483385 [12:53:58] (03PS1) 10Alexandros Kosiaris: blubberoid/zotero: Remove the logging sidecar container [deployment-charts] - 10https://gerrit.wikimedia.org/r/485801 (https://phabricator.wikimedia.org/T207200) [12:54:26] Thanks. I have three patches, the third one is security [12:54:41] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] mathoid: Remove the logging sidecar container [deployment-charts] - 10https://gerrit.wikimedia.org/r/483385 (owner: 10Alexandros Kosiaris) [12:54:46] Amir1: want to let raynor go first? he has one? [12:54:52] I can wait [12:54:54] Amir1, please go [12:54:55] but I'll let you self-organize :) [12:54:56] I'll wait [12:55:16] I need to punish myself for not spotting that my hexchat is dead [12:55:23] (hexchat -> irc client) [12:55:41] raynor: can you deploy yourself? [12:55:44] :) [12:55:50] lol, sure, I can [12:55:56] please go first [12:56:12] (03PS6) 10Gehel: maps: migrate maps1002 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/485584 (https://phabricator.wikimedia.org/T198622) (owner: 10Mathew.onipe) [12:56:24] akosiaris: hey, I can't login to logstash, Have I been added back to ldap groups? [12:56:40] ok, on it [12:56:45] (03CR) 10jenkins-bot: Use new logos in IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485495 (https://phabricator.wikimedia.org/T150618) (owner: 10Urbanecm) [12:56:53] (03PS3) 10Pmiazga: Enable page issues improvements on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485787 (https://phabricator.wikimedia.org/T210554) [12:57:11] (03CR) 10Pmiazga: [C: 03+2] Enable page issues improvements on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485787 (https://phabricator.wikimedia.org/T210554) (owner: 10Pmiazga) [12:57:13] any SRE around? [12:57:36] Amir1: L( [12:57:38] :( [12:57:50] (03CR) 10Gehel: [C: 03+2] maps: migrate maps1002 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/485584 (https://phabricator.wikimedia.org/T198622) (owner: 10Mathew.onipe) [12:58:07] At middle of SWAT :( [12:58:55] (03Merged) 10jenkins-bot: Enable page issues improvements on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485787 (https://phabricator.wikimedia.org/T210554) (owner: 10Pmiazga) [12:58:57] RECOVERY - DPKG on analytics1070 is OK: All packages OK [12:59:10] Amir1: probably not [12:59:15] Amir1: lemme have a look [12:59:53] testing my change on mwdebug1002 [13:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190122T1300) [13:01:35] PROBLEM - DPKG on notebook1004 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [13:01:42] Amir1: try again [13:01:45] PROBLEM - DPKG on analytics1032 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [13:01:47] PROBLEM - DPKG on snapshot1008 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [13:01:51] PROBLEM - DPKG on matomo1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [13:02:09] LDAP access should be fixed now [13:02:11] PROBLEM - DPKG on neon is CRITICAL: DPKG CRITICAL dpkg reports broken packages [13:06:47] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] "Tested locally, seems to have the desired outcome, merging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/485801 (https://phabricator.wikimedia.org/T207200) (owner: 10Alexandros Kosiaris) [13:06:49] 10Operations, 10Maps, 10Reading-Infrastructure-Team-Backlog, 10Patch-For-Review: migrate maps servers to stretch with the current style - https://phabricator.wikimedia.org/T198622 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on cumin1001.eqiad.wmnet for hosts: ` ['maps1002.eqiad.wmn... [13:06:58] merging [13:07:04] Amir1 - are you ready? [13:07:12] yup [13:07:17] it's merging [13:07:26] sorry - syncing* [13:07:40] !log pmiazga@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:485787|Enable page issues improvements on English Wikipedia ([T210554])]] (duration: 00m 46s) [13:07:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:43] T210554: Deploy page issues to enwiki and all remaining projects - https://phabricator.wikimedia.org/T210554 [13:07:48] and done - Amir1 thank you so much [13:07:49] akosiaris: yes, thanks [13:07:52] please proceed with your patches [13:07:55] RECOVERY - DPKG on analytics1032 is OK: All packages OK [13:09:32] (03CR) 10jenkins-bot: Enable page issues improvements on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485787 (https://phabricator.wikimedia.org/T210554) (owner: 10Pmiazga) [13:10:09] Taking over SWAT [13:10:11] RECOVERY - DPKG on notebook1004 is OK: All packages OK [13:10:29] RECOVERY - DPKG on matomo1001 is OK: All packages OK [13:10:42] (03PS5) 10Ladsgroup: Add new synonyms for namespaces in Persian (fa) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484256 (https://phabricator.wikimedia.org/T213733) (owner: 10Huji) [13:10:47] RECOVERY - DPKG on neon is OK: All packages OK [13:10:50] (03CR) 10Ladsgroup: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484256 (https://phabricator.wikimedia.org/T213733) (owner: 10Huji) [13:11:37] RECOVERY - DPKG on snapshot1008 is OK: All packages OK [13:11:54] (03Merged) 10jenkins-bot: Add new synonyms for namespaces in Persian (fa) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484256 (https://phabricator.wikimedia.org/T213733) (owner: 10Huji) [13:13:34] !log installing apt security updates for trusty [13:13:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:32] Testing 484256 on mwdebug1002 [13:16:54] works fine [13:16:57] syncing [13:18:45] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:484256|Add new synonyms for namespaces in Persian (fa) (T213733)]] (duration: 00m 47s) [13:18:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:48] T213733: Add new synonyms for namespaces in Persian (fa) - https://phabricator.wikimedia.org/T213733 [13:19:33] !log ladsgroup@mwmaint1002:~$ mwscript namespaceDupes.php fawiki --fix (T213733) [13:19:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:29] (03CR) 10jenkins-bot: Add new synonyms for namespaces in Persian (fa) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484256 (https://phabricator.wikimedia.org/T213733) (owner: 10Huji) [13:22:31] (03CR) 10Ladsgroup: [C: 03+2] Add 'yue' to langlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485015 (https://phabricator.wikimedia.org/T211530) (owner: 10Ladsgroup) [13:22:49] (03PS11) 10Giuseppe Lavagetto: profile::services_proxy: simple local proxying for remote services [puppet] - 10https://gerrit.wikimedia.org/r/483788 (https://phabricator.wikimedia.org/T210717) [13:22:51] (03PS12) 10Giuseppe Lavagetto: mediawiki::common: add proxy for services [puppet] - 10https://gerrit.wikimedia.org/r/483789 (https://phabricator.wikimedia.org/T210717) [13:23:38] (03Merged) 10jenkins-bot: Add 'yue' to langlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485015 (https://phabricator.wikimedia.org/T211530) (owner: 10Ladsgroup) [13:24:50] Testing 485015 on mwdebug1002 [13:25:37] RECOVERY - MariaDB Slave Lag: s4 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 262.12 seconds [13:25:39] (03PS1) 10Alexandros Kosiaris: Remove chartid from deployments/services [deployment-charts] - 10https://gerrit.wikimedia.org/r/485806 [13:26:50] !log installing apt security updates for jessie [13:26:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:11] !log ladsgroup@deploy1001 Synchronized langlist: SWAT: [[gerrit:485015|Add yue to langlist (T211530)]] (duration: 00m 46s) [13:29:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:14] T211530: Cannot add yue.wt sitelinks onto Wikidata items - https://phabricator.wikimedia.org/T211530 [13:29:32] Amir1: is https://phabricator.wikimedia.org/T211530 done now then? [13:29:46] (03PS6) 10Marostegui: mariadb: Provision dbstore1004 [puppet] - 10https://gerrit.wikimedia.org/r/485758 (https://phabricator.wikimedia.org/T210478) [13:29:53] addshore: I need to rebuild the sites table everywhere [13:29:59] ack [13:30:00] that might take hours to finish [13:30:16] !log EU SWAT is finished [13:30:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:02] (03PS1) 10Marostegui: db-eqiad.php: Depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485808 [13:33:22] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485808 (owner: 10Marostegui) [13:34:26] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485808 (owner: 10Marostegui) [13:35:02] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1088" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485809 [13:35:21] !log running extensions/Wikibase/lib/maintenance/populateSitesTable.php on all.dblist (T211530 ) [13:35:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:24] T211530: Cannot add yue.wt sitelinks onto Wikidata items - https://phabricator.wikimedia.org/T211530 [13:35:41] (03CR) 10jenkins-bot: Add 'yue' to langlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485015 (https://phabricator.wikimedia.org/T211530) (owner: 10Ladsgroup) [13:35:43] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485808 (owner: 10Marostegui) [13:35:45] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] "Tested locally, worked fine" [deployment-charts] - 10https://gerrit.wikimedia.org/r/485806 (owner: 10Alexandros Kosiaris) [13:36:22] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1088" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485809 (owner: 10Marostegui) [13:37:28] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1088" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485809 (owner: 10Marostegui) [13:40:12] (03PS1) 10Thcipriani: Remove reviewers-by-blame from deployment [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/485810 [13:41:10] !log Stop replication in sync on dbstore1001:3316 and db1098:3316 [13:41:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:41] 10Operations, 10Maps, 10Reading-Infrastructure-Team-Backlog, 10Patch-For-Review: migrate maps servers to stretch with the current style - https://phabricator.wikimedia.org/T198622 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['maps1002.eqiad.wmnet'] ` and were **ALL** successful. [13:47:51] (03PS1) 10Filippo Giunchedi: logstash: set consumer_threads for kafka input [puppet] - 10https://gerrit.wikimedia.org/r/485812 (https://phabricator.wikimedia.org/T214309) [13:48:47] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1088" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485809 (owner: 10Marostegui) [13:51:25] (03PS2) 10Filippo Giunchedi: logstash: set consumer_threads for kafka input [puppet] - 10https://gerrit.wikimedia.org/r/485812 (https://phabricator.wikimedia.org/T214309) [13:52:21] (03CR) 10Filippo Giunchedi: [C: 03+2] logstash: set consumer_threads for kafka input [puppet] - 10https://gerrit.wikimedia.org/r/485812 (https://phabricator.wikimedia.org/T214309) (owner: 10Filippo Giunchedi) [13:53:28] (03CR) 10Filippo Giunchedi: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1002/14423/" [puppet] - 10https://gerrit.wikimedia.org/r/485812 (https://phabricator.wikimedia.org/T214309) (owner: 10Filippo Giunchedi) [13:53:40] (03PS3) 10Filippo Giunchedi: logstash: set consumer_threads for kafka input [puppet] - 10https://gerrit.wikimedia.org/r/485812 (https://phabricator.wikimedia.org/T214309) [13:55:09] !log bump logstash kafka consumer threads - T214309 [13:55:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:12] T214309: logstash / elasticsearch indexing lag - https://phabricator.wikimedia.org/T214309 [13:55:48] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1098:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485813 [13:56:56] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1098:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485813 (owner: 10Marostegui) [13:57:57] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1098:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485813 (owner: 10Marostegui) [13:59:01] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1098:3316 T210713 (duration: 00m 45s) [13:59:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:04] T210713: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 [14:00:04] Deploy window MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190122T1400) [14:01:48] (03PS2) 10Hashar: Remove reviewers-by-blame from deployment [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/485810 (https://phabricator.wikimedia.org/T101131) (owner: 10Thcipriani) [14:02:06] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1098:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485813 (owner: 10Marostegui) [14:03:05] (03CR) 10Hashar: [V: 03+2 C: 03+2] "Attached to T101131 and we would want to remove the submodule in wmf/stable-2.15." [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/485810 (https://phabricator.wikimedia.org/T101131) (owner: 10Thcipriani) [14:04:12] !log akosiaris@deploy1001 scap-helm blubberoid upgrade -f blubberoid-values.yaml production stable/blubberoid [namespace: blubberoid, clusters: eqiad,codfw] [14:04:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:46] !log akosiaris@deploy1001 scap-helm blubberoid install -n production -f blubberoid-values.yaml stable/blubberoid [namespace: blubberoid, clusters: eqiad,codfw] [14:04:47] !log akosiaris@deploy1001 scap-helm blubberoid cluster eqiad completed [14:04:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:48] !log akosiaris@deploy1001 scap-helm blubberoid cluster codfw completed [14:04:48] !log akosiaris@deploy1001 scap-helm blubberoid finished [14:04:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:39] (03CR) 10Marostegui: [C: 03+2] mariadb: Provision dbstore1004 [puppet] - 10https://gerrit.wikimedia.org/r/485758 (https://phabricator.wikimedia.org/T210478) (owner: 10Marostegui) [14:06:47] (03PS7) 10Marostegui: mariadb: Provision dbstore1004 [puppet] - 10https://gerrit.wikimedia.org/r/485758 (https://phabricator.wikimedia.org/T210478) [14:08:26] (03PS1) 10Alexandros Kosiaris: Introduce blubberoid.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/485814 (https://phabricator.wikimedia.org/T212251) [14:10:38] !log akosiaris@deploy1001 scap-helm blubberoid install -n staging -f blubberoid-values.yaml stable/blubberoid [namespace: blubberoid, clusters: staging] [14:10:39] !log akosiaris@deploy1001 scap-helm blubberoid cluster staging completed [14:10:39] !log akosiaris@deploy1001 scap-helm blubberoid finished [14:10:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:48] (03PS1) 10Marostegui: mariadb: Fix prometheus config dbstore1004 [puppet] - 10https://gerrit.wikimedia.org/r/485815 (https://phabricator.wikimedia.org/T210478) [14:14:14] !log akosiaris@deploy1001 scap-helm mathoid upgrade -f mathoid-values.yaml staging stable/mathoid [namespace: mathoid, clusters: staging] [14:14:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:07] !log akosiaris@deploy1001 scap-helm mathoid install -n staging -f mathoid-values.yaml --version=0.0.12 stable/mathoid [namespace: mathoid, clusters: staging] [14:15:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:32] !log akosiaris@deploy1001 scap-helm mathoid install -n staging -f mathoid-values.yaml --version=0.0.12 stable/mathoid [namespace: mathoid, clusters: staging] [14:15:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:34] !log akosiaris@deploy1001 scap-helm mathoid cluster staging completed [14:15:34] !log akosiaris@deploy1001 scap-helm mathoid finished [14:15:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:24] !log akosiaris@deploy1001 scap-helm mathoid upgrade -f mathoid-values.yaml --set resources.replicas=1 staging stable/mathoid [namespace: mathoid, clusters: staging] [14:17:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:31] !log akosiaris@deploy1001 scap-helm mathoid cluster staging completed [14:17:31] !log akosiaris@deploy1001 scap-helm mathoid finished [14:17:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:49] !log upgrade blubberoid to the latest chart version (0.0.5) [14:17:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:58] !log upgrade mathoid to the latest chart version (0.0.15) [14:17:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:22] (03CR) 10Elukey: [C: 03+1] mariadb: Fix prometheus config dbstore1004 [puppet] - 10https://gerrit.wikimedia.org/r/485815 (https://phabricator.wikimedia.org/T210478) (owner: 10Marostegui) [14:18:41] (03CR) 10Marostegui: [C: 03+2] mariadb: Fix prometheus config dbstore1004 [puppet] - 10https://gerrit.wikimedia.org/r/485815 (https://phabricator.wikimedia.org/T210478) (owner: 10Marostegui) [14:18:51] !log akosiaris@deploy1001 scap-helm mathoid upgrade -f mathoid-values.yaml production stable/mathoid [namespace: mathoid, clusters: eqiad,codfw] [14:18:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:52] !log akosiaris@deploy1001 scap-helm mathoid cluster eqiad completed [14:18:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:54] !log akosiaris@deploy1001 scap-helm mathoid cluster codfw completed [14:18:54] !log akosiaris@deploy1001 scap-helm mathoid finished [14:18:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:18] PROBLEM - nova instance creation test on cloudcontrol1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [14:26:56] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: /api (bad URL) timed out before a response was received [14:27:30] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [14:27:30] ACKNOWLEDGEMENT - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. andrew bogott Looking... [14:27:31] ACKNOWLEDGEMENT - nova instance creation test on cloudcontrol1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack andrew bogott Looking... [14:28:00] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy [14:29:46] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (bad URL) timed out before a response was received [14:31:54] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (bad URL) timed out before a response was received [14:32:02] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [14:33:04] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [14:33:23] RECOVERY - nova instance creation test on cloudcontrol1003 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [14:34:35] !log monkey patch kartotherian configuration to re-add proxy on maps100[34] - T214350 [14:34:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:38] T214350: mapframe dynamic maps (maplink) don't always work - https://phabricator.wikimedia.org/T214350 [14:35:56] (03CR) 10Hashar: [C: 04-1] "Watch out, some parameters have been made optional when they are in fact required." (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/485094 (owner: 10Dzahn) [14:37:44] 10Operations, 10Analytics, 10Research, 10Article-Recommendation, 10User-Marostegui: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10bmansurov) [14:38:51] 10Operations, 10Analytics, 10Research, 10Services, and 2 others: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10bmansurov) [14:42:06] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [14:42:29] 10Operations, 10Analytics, 10Research, 10Services, and 2 others: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10bmansurov) Thanks everyone for the discussion. I've added a summary to the task description. @Pchelolo @Marostegui @Dzahn @Nuria @Ottom... [14:42:33] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): Update OpenStack images for jessie/stretch for CVE-2019-3462 - https://phabricator.wikimedia.org/T214369 (10GTirloni) 05Open→03Resolved [14:43:15] 10Operations, 10Analytics, 10Research, 10Services, and 2 others: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10bmansurov) [14:45:14] (03PS1) 10Alexandros Kosiaris: Introduce blubberoid.wikimedia.org in varnish [puppet] - 10https://gerrit.wikimedia.org/r/485823 (https://phabricator.wikimedia.org/T212251) [14:45:48] !log starting init of postgres replication on maps1002 - T198622 [14:45:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:51] T198622: migrate maps servers to stretch with the current style - https://phabricator.wikimedia.org/T198622 [14:54:25] !log thcipriani@deploy1001 Started deploy [gerrit/gerrit@6cdece9]: [[gerrit:485810|Remove reviewers-by-blame from deployment]] gerrit2001 no restart required [14:54:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:36] !log thcipriani@deploy1001 Finished deploy [gerrit/gerrit@6cdece9]: [[gerrit:485810|Remove reviewers-by-blame from deployment]] gerrit2001 no restart required (duration: 00m 10s) [14:54:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:09] !log thcipriani@deploy1001 Started deploy [gerrit/gerrit@6cdece9]: [[gerrit:485810|Remove reviewers-by-blame from deployment]] cobalt no restart required [14:56:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:14] !log anomie@deploy1001 Synchronized php-1.33.0-wmf.13/includes/page/WikiPage.php: Add more temporary logging for T210739 (duration: 00m 47s) [14:56:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:20] !log thcipriani@deploy1001 Finished deploy [gerrit/gerrit@6cdece9]: [[gerrit:485810|Remove reviewers-by-blame from deployment]] cobalt no restart required (duration: 00m 11s) [14:56:21] T210739: Target deletion during page move fails - https://phabricator.wikimedia.org/T210739 [14:56:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:48] PROBLEM - MariaDB Slave Lag: s4 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 616.88 seconds [15:02:10] (03PS1) 10Alexandros Kosiaris: Remove statsd-proxy image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/485831 [15:02:12] (03PS1) 10Alexandros Kosiaris: Remove fluentd images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/485832 [15:02:15] (03PS1) 10Filippo Giunchedi: rsyslog: enable auto partitions when producing to kafka [puppet] - 10https://gerrit.wikimedia.org/r/485833 (https://phabricator.wikimedia.org/T214309) [15:02:20] PROBLEM - MariaDB Slave Lag: s8 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1036.11 seconds [15:03:09] (03CR) 10Ottomata: [C: 03+1] "Huh!" [puppet] - 10https://gerrit.wikimedia.org/r/485833 (https://phabricator.wikimedia.org/T214309) (owner: 10Filippo Giunchedi) [15:03:14] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Remove statsd-proxy image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/485831 (owner: 10Alexandros Kosiaris) [15:03:28] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Remove fluentd images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/485832 (owner: 10Alexandros Kosiaris) [15:03:55] 10Operations, 10Wikimedia-Logstash, 10User-fgiunchedi, 10User-herron: Increase utilization of application logging pipeline (FY2018-2019 Q3 TEC6) - https://phabricator.wikimedia.org/T213157 (10herron) [15:05:23] (03CR) 10Filippo Giunchedi: [C: 03+1] "PCC https://puppet-compiler.wmflabs.org/compiler1002/14424/" [puppet] - 10https://gerrit.wikimedia.org/r/485833 (https://phabricator.wikimedia.org/T214309) (owner: 10Filippo Giunchedi) [15:05:42] !log akosiaris@puppetmaster1001 conftool action : set/pooled=false; selector: name=codfw,dnsdisc=zotero [15:05:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:27] (03CR) 10Elukey: [C: 03+1] rsyslog: enable auto partitions when producing to kafka [puppet] - 10https://gerrit.wikimedia.org/r/485833 (https://phabricator.wikimedia.org/T214309) (owner: 10Filippo Giunchedi) [15:08:04] (03CR) 10Filippo Giunchedi: [C: 03+2] rsyslog: enable auto partitions when producing to kafka [puppet] - 10https://gerrit.wikimedia.org/r/485833 (https://phabricator.wikimedia.org/T214309) (owner: 10Filippo Giunchedi) [15:08:10] (03PS2) 10Filippo Giunchedi: rsyslog: enable auto partitions when producing to kafka [puppet] - 10https://gerrit.wikimedia.org/r/485833 (https://phabricator.wikimedia.org/T214309) [15:08:20] (03PS1) 10Andrew Bogott: nova scheduler: depool all Jessie cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/485835 [15:08:51] !log akosiaris@deploy1001 scap-helm zotero install -n production -f zotero-values-codfw.yaml stable/zotero [namespace: zotero, clusters: codfw] [15:08:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:53] !log akosiaris@deploy1001 scap-helm zotero cluster codfw completed [15:08:53] !log akosiaris@deploy1001 scap-helm zotero finished [15:08:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:04] 10Operations, 10Core Platform Team, 10Performance-Team, 10serviceops, and 4 others: Set up a beta feature offering the use of PHP7 - https://phabricator.wikimedia.org/T213934 (10CCicalese_WMF) [15:10:27] 10Operations, 10Performance-Team, 10serviceops, 10Core Platform Team (PHP7 (TEC4)), and 4 others: Set up a beta feature offering the use of PHP7 - https://phabricator.wikimedia.org/T213934 (10CCicalese_WMF) [15:11:05] PROBLEM - LVS HTTP IPv4 on zotero.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.16 and port 1969: Connection refused [15:11:06] !log akosiaris@puppetmaster1001 conftool action : set/pooled=true; selector: name=codfw,dnsdisc=zotero [15:11:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:20] !log akosiaris@puppetmaster1001 conftool action : set/pooled=false; selector: name=codfw,dnsdisc=zotero [15:11:22] !log mbsantos@deploy1001 Started deploy [kartotherian/deploy@bb30697] (stretch): monkey patching geoshapes service for maps100[3-4] [15:11:26] dammit [15:11:31] :-) [15:11:34] * akosiaris looking. [15:11:40] this is self inflicted btw [15:11:44] nothing to do with the software [15:11:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:58] PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - CRITICAL - zotero_1969: Servers kubernetes2002.codfw.wmnet, kubernetes2003.codfw.wmnet are marked down but pooled [15:11:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:40] PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - CRITICAL - zotero_1969: Servers kubernetes2004.codfw.wmnet, kubernetes2001.codfw.wmnet are marked down but pooled [15:13:07] !log mbsantos@deploy1001 Finished deploy [kartotherian/deploy@bb30697] (stretch): monkey patching geoshapes service for maps100[3-4] (duration: 01m 45s) [15:13:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:34] PROBLEM - puppet last run on mw1266 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:13:43] 10Operations, 10Performance-Team, 10serviceops, 10Core Platform Team (PHP7 (TEC4)), and 4 others: Set up a beta feature offering the use of PHP7 - https://phabricator.wikimedia.org/T213934 (10Zppix) Is the end goal to switch support to php7 by default? If so, what is blocking that from being set as default... [15:14:50] !log Add dbstore1003:3317 to tendril - T210478 [15:14:52] !log turn on partitions.auto for rsyslog output to kafka - T214309 [15:14:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:53] T210478: Migrate dbstore1002 to a multi instance setup on dbstore100[3-5] - https://phabricator.wikimedia.org/T210478 [15:14:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:56] T214309: logstash / elasticsearch indexing lag - https://phabricator.wikimedia.org/T214309 [15:18:22] 10Operations, 10Performance-Team, 10serviceops, 10Core Platform Team (PHP7 (TEC4)), and 4 others: Set up a beta feature offering the use of PHP7 - https://phabricator.wikimedia.org/T213934 (10Joe) >>! In T213934#4899142, @Zppix wrote: > Is the end goal to switch support to php7 by default? If so, what is b... [15:18:45] (03PS1) 10Alexandros Kosiaris: zotero: Remove chartid from service as well [deployment-charts] - 10https://gerrit.wikimedia.org/r/485839 [15:18:46] ok, found. fixing [15:18:46] jynus, marostegui: Is there a way to check if a DELETE was executed and then rolled back? I'm guessing "no", but if I'm wrong it would be helpful for something I'm trying to debug. Example: fawiki, deletion from `page` of the row with page_id=946128, at 2019-01-22 15:14:50 UTC (give or take a few seconds, maybe). [15:19:06] (03PS6) 10Andrew Bogott: cloud: rewrite spreadcheck.py NPRE check [puppet] - 10https://gerrit.wikimedia.org/r/483606 (owner: 10BryanDavis) [15:19:24] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] zotero: Remove chartid from service as well [deployment-charts] - 10https://gerrit.wikimedia.org/r/485839 (owner: 10Alexandros Kosiaris) [15:22:07] Reedy: was addwiki run for https://phabricator.wikimedia.org/T205546 ? [15:22:08] (03CR) 10Andrew Bogott: [C: 03+2] cloud: rewrite spreadcheck.py NPRE check [puppet] - 10https://gerrit.wikimedia.org/r/483606 (owner: 10BryanDavis) [15:22:39] <_joe_> akosiaris: oh wow the service wouldn't select the pods, right? [15:23:01] yeah, my mistake [15:24:00] RECOVERY - puppet last run on mw1266 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:24:45] jouncebot: now [15:24:45] For the next 0 hour(s) and 35 minute(s): MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190122T1400) [15:27:26] PROBLEM - puppet last run on cloudcontrol1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/etc/spreadcheck-tools.yaml] [15:27:57] (03PS1) 10Andrew Bogott: spreadcheck: fix path to spreadcheck-tools.yaml [puppet] - 10https://gerrit.wikimedia.org/r/485840 [15:28:14] <_joe_> akosiaris: we need to have a way to verify such things [15:28:29] <_joe_> not today, but we need to start thinking about that [15:29:05] (03CR) 10Andrew Bogott: [C: 03+2] spreadcheck: fix path to spreadcheck-tools.yaml [puppet] - 10https://gerrit.wikimedia.org/r/485840 (owner: 10Andrew Bogott) [15:29:42] !log addshore@mwmaint1002:~$ mwscript extensions/Cognate/maintenance/populateCognateSites.php --wiki yuewiktionary --site-group wiktionary [15:29:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:10] bah, let me tag the task... [15:30:10] !log akosiaris@deploy1001 scap-helm zotero upgrade -f zotero-values-codfw.yaml production stable/zotero [namespace: zotero, clusters: codfw] [15:30:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:11] !log akosiaris@deploy1001 scap-helm zotero cluster codfw completed [15:30:11] !log akosiaris@deploy1001 scap-helm zotero finished [15:30:12] PROBLEM - Tool Labs instance distribution on cloudcontrol1003 is CRITICAL: CRITICAL: worker class instances not spread out enough [15:30:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:14] !log addshore@mwmaint1002:~$ mwscript extensions/Cognate/maintenance/populateCognateSites.php --wiki yuewiktionary --site-group wiktionary // T214400 [15:30:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:20] T214400: Add yue.wikt to Cognate - https://phabricator.wikimedia.org/T214400 [15:30:48] RECOVERY - LVS HTTP IPv4 on zotero.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 138 bytes in 0.159 second response time [15:30:59] fixed [15:31:18] !log akosiaris@puppetmaster1001 conftool action : set/pooled=true; selector: name=codfw,dnsdisc=zotero [15:31:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:31] thankfully no impact to users [15:32:30] RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy [15:32:33] (03CR) 10Ottomata: "NICE!" [puppet] - 10https://gerrit.wikimedia.org/r/485640 (https://phabricator.wikimedia.org/T212949) (owner: 10Elukey) [15:32:40] RECOVERY - puppet last run on cloudcontrol1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:32:41] !log akosiaris@puppetmaster1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=zotero [15:32:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:06] RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy [15:34:23] !log addshore@mwmaint1002:~$ mwscript extensions/Cognate/maintenance/populateCognatePages.php --wiki yuewiktionary // T214400 (1 row) [15:34:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:16] PROBLEM - puppet last run on cloudcontrol1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/etc/spreadcheck-tools.yaml] [15:39:25] (03PS1) 10BryanDavis: Fix typo in 172.16.0.0/12 block [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485843 [15:41:11] (03PS2) 10Andrew Bogott: nova scheduler: depool all Jessie cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/485835 [15:42:00] (03CR) 10Andrew Bogott: [C: 03+2] nova scheduler: depool all Jessie cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/485835 (owner: 10Andrew Bogott) [15:42:25] !log akosiaris@deploy1001 scap-helm zotero upgrade -f zotero-values-eqiad.yaml production stable/zotero [namespace: zotero, clusters: eqiad] [15:42:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:00] !log akosiaris@deploy1001 scap-helm zotero install -f zotero-values-eqiad.yaml -n production stable/zotero [namespace: zotero, clusters: eqiad] [15:43:01] !log akosiaris@deploy1001 scap-helm zotero cluster eqiad completed [15:43:01] !log akosiaris@deploy1001 scap-helm zotero finished [15:43:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:54] !log akosiaris@puppetmaster1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=zotero [15:44:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:02] (03CR) 10Zppix: [C: 03+1] "LGTM, thanks for catching this typo!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485843 (owner: 10BryanDavis) [15:45:04] !log upgrade zotero to latest chart version [15:45:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:42] RECOVERY - puppet last run on cloudcontrol1004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:45:56] (03CR) 10Alexandros Kosiaris: [V: 03+2] Remove statsd-proxy image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/485831 (owner: 10Alexandros Kosiaris) [15:46:06] (03CR) 10Alexandros Kosiaris: [V: 03+2] Remove fluentd images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/485832 (owner: 10Alexandros Kosiaris) [15:46:40] PROBLEM - nova-compute proc minimum on cloudvirt1021 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute [15:47:40] RECOVERY - nova-compute proc minimum on cloudvirt1021 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute [15:50:48] addshore: Eh? [15:52:03] Reedy: Did you make it? :) I noticed the site was missing from the cognate table, so cognate hasn't been running there, populateCognateSites is part of addWiki though :/ [15:52:22] addwiki.php is notoriously icky [15:52:26] Reedy: indeed [15:52:45] Amir1: actually, what you were fixing today was adding it to the sites table right? [15:53:03] Which is another thing that is often broken [15:53:11] And no one in Wikidata seems to care when I file issues about it :) [15:53:11] maybe I should just add the cognate bit to https://wikitech.wikimedia.org/wiki/Add_a_wiki#Post-install instead... [15:53:30] I'm still waiting for someone to look at that bit for wikidata [15:53:38] Reedy: which ticket? [15:53:55] Not sure offhand [15:53:56] addWiki.php is notorious for getting broken in every possible way [15:54:12] https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/WikimediaMaintenance/+/339144/ [15:54:25] if the sites table population failes, then the cognate one will be empty too, so perhaps that is what happened [15:54:28] https://phabricator.wikimedia.org/T171013#4146170 [15:54:33] https://phabricator.wikimedia.org/T158751 [15:55:26] I do wonder if we should have it like a two stage thing [15:55:30] Create the wiki [15:55:36] addshore: it was there, it rebuilt it [15:55:38] Run a second script passing the correct --wiki [15:55:49] because it was mentioned as a "special" site not a wiktionary [15:56:15] anomie: we can check changes that were commited [15:56:16] Amir1: does that mean it was in some "special" site group instead of wiktionary? [15:56:36] if they were never committed, they will not persist anywhere [15:56:44] addshore: yup [15:56:45] Amir1: if so that also explains why populateCognateSites would not have worked :P [15:56:51] Amir1: ack! [16:03:37] 10Operations, 10netops, 10Performance-Team (Radar): Stop prioritizing peering over transit - https://phabricator.wikimedia.org/T204281 (10ayounsi) For reference, here are all the Amsterdam transit ports stacked: https://librenms.wikimedia.org/graphs/id=3944,11510,4109,4110,4151,6862/type=multiport_bits_separ... [16:07:16] I'm sure I created it with a wiktionary flag though [16:08:32] PROBLEM - Toolforge instance distribution on cloudcontrol1003 is CRITICAL: CRITICAL: worker class instances not spread out enough [16:12:31] !log deactivate local pref for peering sessions in es/knams - T204281 [16:12:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:37] T204281: Stop prioritizing peering over transit - https://phabricator.wikimedia.org/T204281 [16:14:38] RECOVERY - MariaDB Slave Lag: s8 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 248.98 seconds [16:29:02] (03CR) 10Bstorm: toolforge: kube2proxy: validate requests library version (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/484609 (https://phabricator.wikimedia.org/T213711) (owner: 10BryanDavis) [16:29:11] (03CR) 10Alexandros Kosiaris: [C: 04-1] "A first round of comments. Overall I see an approach that seems to be dictated by some limitations that are not clear." (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/483788 (https://phabricator.wikimedia.org/T210717) (owner: 10Giuseppe Lavagetto) [16:29:24] 10Operations: Rebuild installer images for CVE-2019-3462 - https://phabricator.wikimedia.org/T214368 (10MoritzMuehlenhoff) 05Open→03Resolved We've looked into this; our netboot images don't need an update: In the initrd anna is used instead of apt and it's not affected by CVE-2019-3462. [16:34:16] RECOVERY - MariaDB Slave Lag: s4 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 288.45 seconds [16:35:56] (03PS1) 10Ladsgroup: Fix project talk namespace alias of Persian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485847 (https://phabricator.wikimedia.org/T213733) [16:57:42] 10Operations, 10Analytics, 10Research, 10Services, and 2 others: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10Nuria) @bmansurov I am guessing that option 2 is the most likely one, in any case I want to stress that we should really be working on... [16:59:13] (03PS1) 10Elukey: profile::analytics::packages::statistics: deploy git-lfs [puppet] - 10https://gerrit.wikimedia.org/r/485852 (https://phabricator.wikimedia.org/T214089) [17:00:04] godog and _joe_: I, the Bot under the Fountain, allow thee, The Deployer, to do Puppet SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190122T1700). [17:00:04] No GERRIT patches in the queue for this window AFAICS. [17:00:56] (03CR) 10Huji: [C: 04-1] Fix project talk namespace alias of Persian Wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485847 (https://phabricator.wikimedia.org/T213733) (owner: 10Ladsgroup) [17:01:17] (03PS2) 10Huji: Fix project talk namespace alias of Persian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485847 (https://phabricator.wikimedia.org/T213733) (owner: 10Ladsgroup) [17:01:48] (03CR) 10Huji: "Since the original error was mine, I submitted the correct patch here and I think it would be okay for you to +2 it yourself." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485847 (https://phabricator.wikimedia.org/T213733) (owner: 10Ladsgroup) [17:04:19] 10Operations, 10Analytics, 10Research, 10Services, and 2 others: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10bmansurov) @Nuria, thanks for the input. I suppose you mean the option 2 of the first point. > Has that work started? I'm currently wor... [17:13:16] 10Operations, 10SRE-Access-Requests: Requesting access to production for dsharpe - https://phabricator.wikimedia.org/T214130 (10RobH) I've emailed both @faidon and @mark to make them aware of this request: > Faidon & Mark, > > Normally this is reviewed in the SRE meeting, but we won't be having one for the... [17:14:09] (03PS1) 10Elukey: profile::analytics::refinery::job::project_namespace_map: fix timer's script [puppet] - 10https://gerrit.wikimedia.org/r/485856 (https://phabricator.wikimedia.org/T172532) [17:15:43] (03CR) 10Elukey: [C: 03+2] profile::analytics::refinery::job::project_namespace_map: fix timer's script [puppet] - 10https://gerrit.wikimedia.org/r/485856 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [17:20:01] 10Operations, 10netops, 10Performance-Team (Radar): Stop prioritizing peering over transit - https://phabricator.wikimedia.org/T204281 (10ayounsi) First observation shows a 600Mbps traffic shift from peering to transit, which is small and within expected range. [17:23:53] (03PS1) 10Elukey: profile::analytics::refinery::job::project_namespace_map: fix variables in erb [puppet] - 10https://gerrit.wikimedia.org/r/485858 (https://phabricator.wikimedia.org/T172532) [17:26:00] (03CR) 10Elukey: [C: 03+2] profile::analytics::refinery::job::project_namespace_map: fix variables in erb [puppet] - 10https://gerrit.wikimedia.org/r/485858 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [17:26:21] (03PS3) 10DCausse: [WIP] Upgrade to 6.5.4 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/446869 (https://phabricator.wikimedia.org/T199791) [17:29:29] !log ppchelko@deploy1001 Started deploy [cpjobqueue/deploy@afca813]: Add the constraintsRunCheck job definition T204031 [17:29:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:33] T204031: Deploy regular running of wikidata constraint checks using the job queue - https://phabricator.wikimedia.org/T204031 [17:30:25] !log ppchelko@deploy1001 Finished deploy [cpjobqueue/deploy@afca813]: Add the constraintsRunCheck job definition T204031 (duration: 00m 55s) [17:30:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:48] (03PS4) 10DCausse: [WIP] Upgrade to 6.5.4 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/446869 (https://phabricator.wikimedia.org/T199791) [17:39:50] (03PS1) 10Elukey: refinery-download-project-namespace-map.sh.erb: remove unnecessary escapes [puppet] - 10https://gerrit.wikimedia.org/r/485860 (https://phabricator.wikimedia.org/T172532) [17:40:00] !log milimetric@deploy1001 Started deploy [analytics/refinery@372c0b6]: Denormalized job updates for actor/comment refactor [17:40:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:15] (03CR) 10Elukey: [C: 03+2] refinery-download-project-namespace-map.sh.erb: remove unnecessary escapes [puppet] - 10https://gerrit.wikimedia.org/r/485860 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [17:42:11] !log milimetric@deploy1001 Finished deploy [analytics/refinery@372c0b6]: Denormalized job updates for actor/comment refactor (duration: 02m 11s) [17:42:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:13] !log milimetric@deploy1001 Started deploy [analytics/refinery@b07451e]: Denormalized job updates for actor/comment refactor [17:43:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:45] (03PS2) 10Hashar: scan and process templates in parallel [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/484578 [17:52:58] (03PS11) 10Dzahn: webperf: add data types, split statsd host/port params [puppet] - 10https://gerrit.wikimedia.org/r/485106 [17:54:59] (03CR) 10Hashar: "Rebased due to the merge of the pull and cache options." [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/484578 (owner: 10Hashar) [18:00:04] cscott, arlolra, subbu, halfak, and Amir1: #bothumor I � Unicode. All rise for Services – Graphoid / Parsoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190122T1800). [18:00:17] no parsoid deploy today [18:00:18] (03PS2) 10Volans: dns: fix logging message [software/spicerack] - 10https://gerrit.wikimedia.org/r/484524 [18:00:20] (03CR) 10Dzahn: [C: 03+2] webperf: add data types, split statsd host/port params [puppet] - 10https://gerrit.wikimedia.org/r/485106 (owner: 10Dzahn) [18:00:38] !log milimetric@deploy1001 Finished deploy [analytics/refinery@b07451e]: Denormalized job updates for actor/comment refactor (duration: 17m 24s) [18:00:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:25] (03PS2) 10Volans: sre.switchdc.mediawiki: fix update tendril [cookbooks] - 10https://gerrit.wikimedia.org/r/484255 [18:04:25] (03CR) 10Dzahn: [C: 03+2] "noop on webperf1001" [puppet] - 10https://gerrit.wikimedia.org/r/485106 (owner: 10Dzahn) [18:07:21] (03CR) 10Volans: [C: 03+2] sre.switchdc.mediawiki: fix update tendril [cookbooks] - 10https://gerrit.wikimedia.org/r/484255 (owner: 10Volans) [18:07:50] PROBLEM - MariaDB Slave Lag: s4 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 786.12 seconds [18:09:08] (03Merged) 10jenkins-bot: sre.switchdc.mediawiki: fix update tendril [cookbooks] - 10https://gerrit.wikimedia.org/r/484255 (owner: 10Volans) [18:09:38] (03CR) 10Volans: [C: 03+2] dns: fix logging message [software/spicerack] - 10https://gerrit.wikimedia.org/r/484524 (owner: 10Volans) [18:15:46] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:16:00] (03Merged) 10jenkins-bot: dns: fix logging message [software/spicerack] - 10https://gerrit.wikimedia.org/r/484524 (owner: 10Volans) [18:16:27] (03PS2) 10Volans: spicerack: fix version [software/spicerack] - 10https://gerrit.wikimedia.org/r/484239 (https://phabricator.wikimedia.org/T205884) [18:17:06] (03CR) 10jenkins-bot: dns: fix logging message [software/spicerack] - 10https://gerrit.wikimedia.org/r/484524 (owner: 10Volans) [18:20:48] PROBLEM - Request latencies on acrab is CRITICAL: instance=10.192.16.26:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:22:32] (03PS3) 10Dzahn: package_builder: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/485099 [18:22:53] (03PS4) 10Dzahn: package_builder: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/485099 [18:23:08] (03CR) 10Dzahn: package_builder: add data types to parameters (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/485099 (owner: 10Dzahn) [18:23:12] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:23:23] (03CR) 10Volans: [C: 03+2] "> Patch Set 1: Code-Review+1" [software/spicerack] - 10https://gerrit.wikimedia.org/r/484239 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [18:25:06] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:25:52] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:26:20] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:27:11] (03CR) 10Dzahn: [C: 03+2] "thx for reviews. fixed and noop in compiler https://puppet-compiler.wmflabs.org/compiler1002/14426/boron.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/485099 (owner: 10Dzahn) [18:28:10] RECOVERY - Request latencies on acrab is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:28:20] RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:29:06] (03Merged) 10jenkins-bot: spicerack: fix version [software/spicerack] - 10https://gerrit.wikimedia.org/r/484239 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [18:29:11] (03CR) 10Dzahn: [C: 03+2] "no change on boron.eqiad.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/485099 (owner: 10Dzahn) [18:30:06] (03CR) 10jenkins-bot: spicerack: fix version [software/spicerack] - 10https://gerrit.wikimedia.org/r/484239 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [18:31:06] (03PS1) 10Arturo Borrero Gonzalez: cloudnet1003: reimage to stretch [puppet] - 10https://gerrit.wikimedia.org/r/485869 (https://phabricator.wikimedia.org/T214299) [18:32:17] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudnet1003: reimage to stretch [puppet] - 10https://gerrit.wikimedia.org/r/485869 (https://phabricator.wikimedia.org/T214299) (owner: 10Arturo Borrero Gonzalez) [18:36:56] !log T214299 reimaging cloudnet1003 as debian stretch [18:36:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:59] T214299: cloudvps: neutron: upgrade jessie -> stretch - https://phabricator.wikimedia.org/T214299 [18:42:05] (03CR) 10Dzahn: jenkins: add data types to parameters (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/485094 (owner: 10Dzahn) [18:43:03] (03PS5) 10Dzahn: jenkins: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/485094 [18:44:19] (03CR) 10Dzahn: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/485094 (owner: 10Dzahn) [18:47:15] 10Operations, 10Analytics, 10Research, 10Article-Recommendation, and 2 others: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10Pchelolo) [18:49:51] 10Operations, 10Analytics, 10User-Elukey: notebook/stat server(s) running out of memory - https://phabricator.wikimedia.org/T212824 (10crusnov) Just as an extra data point, early morning 2019-01-22 nagios-nrpe-server crashed on stat1007 from a cannot allocate error. [18:51:59] !log bsitzmann@deploy1001 Started deploy [mobileapps/deploy@0bcdd3f]: Update mobileapps to 0aac268 (fix pronunciation detection in mobile-sections T214338) [18:52:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:02] T214338: Pronunciations are not detected in mobile-section anymore - https://phabricator.wikimedia.org/T214338 [18:52:55] 10Operations, 10Elasticsearch, 10Discovery-Search (Current work): Create Icinga check for failed shard allocation - https://phabricator.wikimedia.org/T212850 (10Mathew.onipe) [18:55:58] !log bsitzmann@deploy1001 Finished deploy [mobileapps/deploy@0bcdd3f]: Update mobileapps to 0aac268 (fix pronunciation detection in mobile-sections T214338) (duration: 04m 00s) [18:56:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:11] (03CR) 10Dzahn: "@Hashar amended to address all your commens. experimental build works with Hosts: with new line bu does not list the hosts on second line " [puppet] - 10https://gerrit.wikimedia.org/r/485094 (owner: 10Dzahn) [19:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190122T1900) [19:00:11] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/14427/" [puppet] - 10https://gerrit.wikimedia.org/r/485094 (owner: 10Dzahn) [19:02:51] !log onimisionipe@deploy1001 Started deploy [kartotherian/deploy@e847e7b] (stretch): Updating maps1002 to reflect latest changes [19:02:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:53] !log onimisionipe@deploy1001 Finished deploy [kartotherian/deploy@e847e7b] (stretch): Updating maps1002 to reflect latest changes (duration: 01m 02s) [19:03:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:04] 10Operations, 10Patch-For-Review: upgrade krypton (webserver_misc_apps) to stretch - https://phabricator.wikimedia.org/T210008 (10Dzahn) a:03Dzahn [19:08:21] 10Operations, 10serviceops: upgrade krypton (webserver_misc_apps) to stretch - https://phabricator.wikimedia.org/T210008 (10Dzahn) [19:12:02] (03PS2) 10Dzahn: wikistats: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/485101 [19:13:24] PROBLEM - Request latencies on acrux is CRITICAL: instance=10.192.0.93:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:14:30] PROBLEM - Request latencies on acrab is CRITICAL: instance=10.192.16.26:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:15:48] RECOVERY - Request latencies on acrux is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:16:52] RECOVERY - Request latencies on acrab is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:19:17] (03CR) 10Dzahn: [C: 03+2] wikistats: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/485101 (owner: 10Dzahn) [19:19:29] (03PS3) 10Dzahn: wikistats: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/485101 [19:20:08] (03CR) 10Dzahn: [C: 03+2] "wikistats-cloud-vps, not analytics-wikistats. not related to stats.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/485101 (owner: 10Dzahn) [19:20:15] (03PS1) 10Legoktm: Use production wikimedia-jessie base image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/485876 [19:21:18] PROBLEM - kartotherian endpoints health on maps1002 is CRITICAL: /osm-intl/11/828/655.png (get a tile in the middle of the ocean, with overzoom) timed out before a response was received [19:21:52] PROBLEM - kartotherian endpoints health on maps1004 is CRITICAL: /osm-intl/11/828/655.png (get a tile in the middle of the ocean, with overzoom) timed out before a response was received [19:21:54] (03PS2) 10Dzahn: contint: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/485096 [19:22:13] (03CR) 10Dzahn: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/485096 (owner: 10Dzahn) [19:22:46] PROBLEM - kartotherian endpoints health on maps1003 is CRITICAL: /osm-intl/11/828/655.png (get a tile in the middle of the ocean, with overzoom) timed out before a response was received [19:22:58] (03CR) 10Dzahn: [V: 03+1] contint: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/485096 (owner: 10Dzahn) [19:23:54] checking maps [19:24:11] thanks Matt [19:27:42] (03PS1) 10Arturo Borrero Gonzalez: cloudnet1003: hiera: refresh interface names [puppet] - 10https://gerrit.wikimedia.org/r/485878 (https://phabricator.wikimedia.org/T214299) [19:28:22] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudnet1003: hiera: refresh interface names [puppet] - 10https://gerrit.wikimedia.org/r/485878 (https://phabricator.wikimedia.org/T214299) (owner: 10Arturo Borrero Gonzalez) [19:28:29] (03PS1) 10Dzahn: wikistats: cron job minutes are integers, not strings now [puppet] - 10https://gerrit.wikimedia.org/r/485879 [19:29:08] (03CR) 10jerkins-bot: [V: 04-1] wikistats: cron job minutes are integers, not strings now [puppet] - 10https://gerrit.wikimedia.org/r/485879 (owner: 10Dzahn) [19:29:57] (03PS1) 10Hashar: Warn about lack of changelog or Dockerfile.template [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/485880 [19:30:41] !log T214299 additional reboot for cloudnet1003 [19:30:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:44] T214299: cloudvps: neutron: upgrade jessie -> stretch - https://phabricator.wikimedia.org/T214299 [19:30:47] (03CR) 10Dzahn: [V: 03+1] "experimental build added and is success" [puppet] - 10https://gerrit.wikimedia.org/r/485096 (owner: 10Dzahn) [19:31:00] (03PS2) 10Dzahn: wikistats: cron job minutes are integers, not strings now [puppet] - 10https://gerrit.wikimedia.org/r/485879 [19:31:16] (03CR) 10jerkins-bot: [V: 04-1] Warn about lack of changelog or Dockerfile.template [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/485880 (owner: 10Hashar) [19:31:53] (03CR) 10Dzahn: [C: 03+2] wikistats: cron job minutes are integers, not strings now [puppet] - 10https://gerrit.wikimedia.org/r/485879 (owner: 10Dzahn) [19:32:07] (03PS3) 10Dzahn: wikistats: cron job minutes are integers, not strings now [puppet] - 10https://gerrit.wikimedia.org/r/485879 [19:32:28] (03PS2) 10Dzahn: ifft: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/485098 [19:33:29] (03CR) 1020after4: [C: 03+2] Fix typo in 172.16.0.0/12 block [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485843 (owner: 10BryanDavis) [19:34:22] (03PS3) 10Dzahn: ifttt: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/485098 [19:34:41] (03Merged) 10jenkins-bot: Fix typo in 172.16.0.0/12 block [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485843 (owner: 10BryanDavis) [19:36:22] PROBLEM - Maps HTTPS on maps1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:36:30] RECOVERY - MariaDB Slave Lag: s4 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 272.11 seconds [19:37:26] RECOVERY - Maps HTTPS on maps1003 is OK: HTTP OK: HTTP/1.1 200 OK - 1323 bytes in 0.457 second response time [19:38:34] (03CR) 10Krinkle: [C: 03+1] InitialiseSettings.php: Increase parsercache TTL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485582 (https://phabricator.wikimedia.org/T210992) (owner: 10Marostegui) [19:38:44] ACKNOWLEDGEMENT - kartotherian endpoints health on maps1002 is CRITICAL: /osm-intl/11/828/655.png (get a tile in the middle of the ocean, with overzoom) timed out before a response was received Mathew.onipe working on this [19:39:08] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-upload site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [19:39:14] ACKNOWLEDGEMENT - kartotherian endpoints health on maps1003 is CRITICAL: /osm-intl/9/207/163@1.5x.png (default scaled tile) timed out before a response was received: /osm-intl/11/828/655.png (get a tile in the middle of the ocean, with overzoom) timed out before a response was received: /v4/marker/pin-m+ffffff@2x.png (Untitled test) timed out before a response was received Mathew.onipe working on this [19:39:20] (03CR) 10GTirloni: [C: 03+2] Use production wikimedia-jessie base image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/485876 (owner: 10Legoktm) [19:39:39] (03CR) 10BryanDavis: [C: 03+1] "As far as I can tell the prod base images are built using the exact same system from modules/docker/templates/images." [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/485876 (owner: 10Legoktm) [19:39:42] (03Merged) 10jenkins-bot: Use production wikimedia-jessie base image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/485876 (owner: 10Legoktm) [19:39:46] (03CR) 10GTirloni: [V: 03+2 C: 03+2] Use production wikimedia-jessie base image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/485876 (owner: 10Legoktm) [19:41:58] PROBLEM - LVS HTTP IPv4 on kartotherian.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:42:37] <_joe_> uh [19:42:50] onimisionipe: you need a hand with anything? [19:43:04] RECOVERY - LVS HTTP IPv4 on kartotherian.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1323 bytes in 0.043 second response time [19:43:06] did this page? [19:43:06] <_joe_> onimisionipe: should I depool eqiad? [19:43:18] * volans still waiting for the page... [19:43:19] it paged [19:43:25] <_joe_> oh just recovered [19:43:55] oh now i get the pages. yay [19:44:06] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-upload site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [19:44:11] * twentyafterfour waits for things to settle before deploying [19:44:29] Its coming back [19:45:00] RECOVERY - kartotherian endpoints health on maps1003 is OK: All endpoints are healthy [19:45:03] _joe_: please do for now [19:45:10] * akosiaris around as well [19:45:35] <_joe_> onimisionipe: should we depool eqiad even if it recovered? [19:45:42] _joe_: wait please [19:45:47] <_joe_> ok [19:45:57] checking to see if it will actually come back (I mean the issues) [19:46:45] (03CR) 10jenkins-bot: Fix typo in 172.16.0.0/12 block [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485843 (owner: 10BryanDavis) [19:48:33] PROBLEM - kartotherian endpoints health on maps1003 is CRITICAL: /osm-intl/9/207/163@1.5x.png (default scaled tile) timed out before a response was received: /osm-intl/11/828/655.png (get a tile in the middle of the ocean, with overzoom) timed out before a response was received [19:48:58] * gehel is looking at kartotherian as well [19:49:29] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:49:43] (03CR) 1020after4: [C: 03+2] "deploying this as soon as the currently ongoing varnish issue settles... don't want to create a distraction in #wikimedia-operations" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485843 (owner: 10BryanDavis) [19:50:31] RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:50:33] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-upload site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [19:51:53] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-upload site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [19:52:25] _joe_: could you depool eqiad while we find out what's going on? Please? [19:52:34] <_joe_> gehel: sure [19:52:40] thanks! [19:52:50] btw, where is the doc to do that depool? [19:53:59] <_joe_> gehel: nevermind, we don't have a discovery record for maps [19:54:14] <_joe_> to depool eqiad, we need to modify the cache configurations [19:54:16] ouch, something more we need to do! [19:54:57] let's see if we can fix that quickly [19:55:03] gehel: I think this has something to do with changing rpelicating factor for cassandra [19:55:11] (03CR) 10Jforrester: [C: 04-2] "> Patch Set 2:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482101 (https://phabricator.wikimedia.org/T212865) (owner: 10Jforrester) [19:55:16] looking [19:55:34] <_joe_> I'm not at my main computer, so if I need to do the puppet patch, you'll have to wait a bit [19:55:43] RECOVERY - kartotherian endpoints health on maps1003 is OK: All endpoints are healthy [19:56:24] (03PS8) 10Smalyshev: Add allocator metrics export for Blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/485135 (https://phabricator.wikimedia.org/T213372) [19:56:59] (03PS2) 10Jforrester: Re-do "Disable ZeroBanner and ZeroPortal on all wikis"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483193 [19:57:02] I'm here, so just tell me if I can help [19:57:03] (03PS3) 10Jforrester: Drop the Wikipedia Zero debug log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482099 (https://phabricator.wikimedia.org/T212865) [19:57:05] (03PS3) 10Jforrester: robots.php: Drop the special treatment for Wikipedia Zero [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482100 (https://phabricator.wikimedia.org/T212865) [19:57:07] (03PS3) 10Jforrester: zerowiki: Stop whitelisting ZeroPortal to logged out users, no longer available [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482101 (https://phabricator.wikimedia.org/T212865) [19:57:08] * volans still waiting for the page btw [19:57:09] (03PS3) 10Jforrester: Drop ZeroBanner and ZeroPortal from production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482102 (https://phabricator.wikimedia.org/T212865) [19:57:11] (03PS3) 10Jforrester: Stop configuring ZeroBanner and ZeroPortal, unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482103 (https://phabricator.wikimedia.org/T212865) [19:57:13] (03PS3) 10Jforrester: Stop loading i18n for ZeroBanner and ZeroPortal, unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482104 (https://phabricator.wikimedia.org/T212865) [19:57:36] (03CR) 10Jforrester: [C: 04-2] "> the entire stack is C-2'ed at this point waiting on SRE anyway (for different stuff, the VCL rules)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482101 (https://phabricator.wikimedia.org/T212865) (owner: 10Jforrester) [19:58:38] gehel: I'm reverting back the replication factor to 1 for system_auth [19:59:07] PROBLEM - kartotherian endpoints health on maps1003 is CRITICAL: /osm-intl/9/207/163@1.5x.png (default scaled tile) timed out before a response was received: /osm-intl/11/828/655.png (get a tile in the middle of the ocean, with overzoom) timed out before a response was received [19:59:20] onimisionipe: don't touch it yet pls [19:59:23] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [19:59:47] gehel: Ok [20:00:04] twentyafterfour: Your horoscope predicts another unfortunate MediaWiki train - Americas version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190122T2000). [20:00:31] <_joe_> bblack: can you takee a look at my patch once git review complies? [20:00:35] <_joe_> or akosiaris [20:00:35] (03PS1) 10Giuseppe Lavagetto: cache::upload: depool kartotherian in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/485884 [20:00:38] <_joe_> just to be sure [20:00:43] <_joe_> here it is ^^ [20:00:55] RECOVERY - kartotherian endpoints health on maps1002 is OK: All endpoints are healthy [20:01:15] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [20:01:41] <_joe_> it should be correct according to https://wikitech.wikimedia.org/wiki/Global_traffic_routing#Cache-to-application_routing [20:01:53] <_joe_> gehel: or you if no one is around :/ [20:01:59] <_joe_> volans maybe? [20:02:04] looking [20:02:34] according to the docs, lgtm [20:02:40] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/485884 (owner: 10Giuseppe Lavagetto) [20:02:42] LGTM _joe_ [20:02:53] at least we agree :) [20:03:04] (03CR) 10Effie Mouzeli: [C: 03+1] cache::upload: depool kartotherian in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/485884 (owner: 10Giuseppe Lavagetto) [20:03:11] <_joe_> volans: since it's morning for you, would you care to deploy it? [20:03:13] (03CR) 10Volans: [C: 03+1] "LGTM, unless there are side-effects I'm not aware of" [puppet] - 10https://gerrit.wikimedia.org/r/485884 (owner: 10Giuseppe Lavagetto) [20:03:15] * akosiaris looking [20:03:16] sure [20:03:22] <_joe_> unless you're on vacation :) [20:03:27] !log running nodetool repair on system_auth for maps / eqiad servers [20:03:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:33] gehel: ok to depool? [20:03:37] (03CR) 10Alexandros Kosiaris: [C: 03+1] cache::upload: depool kartotherian in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/485884 (owner: 10Giuseppe Lavagetto) [20:03:42] volans: please do! [20:03:49] (03CR) 10Volans: [C: 03+2] cache::upload: depool kartotherian in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/485884 (owner: 10Giuseppe Lavagetto) [20:03:59] <_joe_> volans: it just needs to run on role::cache::upload in eqiad AIUI [20:04:42] _joe_: I'm wondering for the intra-cache routing [20:04:52] yes, just upload@eqiad [20:04:55] <_joe_> volans: nothing changes [20:05:00] yeah sounds good to me [20:05:13] <_joe_> oh there is a dnsdisc entry, I just misspelled kartotherian [20:05:15] <_joe_> GRRRRRRR [20:05:26] <_joe_> MaxSem: THANK YOU :D [20:05:29] (and yes, intra-cache can continue flowing through eqiad, and also isn't per-service but per whole cluster (upload or text)) [20:05:36] _joe_: should I stop? [20:05:45] already puppet-merge was about to hit enter on the puppet run [20:05:46] varnish stuff doesn't use dnsdisc [20:05:51] don't stop [20:06:07] gehel: not seeing errors on maps1002 again [20:06:08] !log running cumin 'P{O:cache::upload} and A:eqiad' 'run-puppet-agent' [20:06:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:40] !log oblivian@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=kartotherian,name=eqiad [20:06:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:54] <_joe_> volans: don't stop :) [20:07:05] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-ssl_443: Servers maps1004.eqiad.wmnet are marked down but pooled [20:07:15] puppet run completed [20:07:18] !log onimisionipe@deploy1001 Started deploy [kartotherian/deploy@e847e7b] (stretch): Updating maps1002 to reflect latest changes [20:07:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:19] !log onimisionipe@deploy1001 deploy aborted: Updating maps1002 to reflect latest changes (duration: 00m 01s) [20:07:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:24] volans: thanks! [20:07:34] diff looks sane [20:07:34] - set req.backend_hint = kartotherian.backend(); [20:07:35] <_joe_> ok, I disabled karotherian in eqiad for discovery, volans depooled it in varnish [20:07:36] + set req.backend_hint = cache_codfw.backend(); set req.http.X-Next-Is-Cache = 1; [20:07:38] Oops! mistake deploy! [20:07:53] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-ssl_443: Servers maps1004.eqiad.wmnet are marked down but pooled [20:07:53] <_joe_> you should be ok to test and fix things [20:08:02] looks like we might be hitting https://issues.apache.org/jira/browse/CASSANDRA-8120 [20:08:04] * _joe_ off again [20:08:11] RECOVERY - kartotherian endpoints health on maps1004 is OK: All endpoints are healthy [20:08:15] _joe_, volans and others: thanks a bunch! [20:08:19] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy [20:08:29] RECOVERY - kartotherian endpoints health on maps1003 is OK: All endpoints are healthy [20:08:38] gehel: I'll be around so if you want to tell me what to do in case anything happen please do ;) [20:08:59] as long as eqiad is depooled, not much to do [20:09:04] ack [20:09:07] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [20:09:24] gehel: what did you do? not seeing error in maps1003 again [20:09:27] the issue is most probably related to the work we're doing on upgrading to stretch (no idea how, but that's the obvious candidate) [20:09:54] onimisionipe: so far, only `nodetool repair system_auth`, but I'm seeign errors in its output [20:10:45] <_joe_> onimisionipe: you don'tt see errors because it's depooled [20:11:15] _joe_: true! [20:11:26] _joe_: the health check that was failing is an active check [20:11:41] and it is now green, which is somewhat suspicious :/ [20:11:45] (03CR) 10Alexandros Kosiaris: mathoid: Move config.yaml into a template (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/483184 (owner: 10Alexandros Kosiaris) [20:13:19] gehel: this started immediately I bumped replication factor for both system_auth and v4 [20:13:38] so probably related [20:14:03] argh, I got the page with ~30m of delay, I'll go check the pager logs, that's pretty worrisome [20:14:07] let's move this conversation in #wikimedia-interactive for the moment, it isn't burning anymore [20:14:15] Ok [20:17:08] (03PS2) 10Alexandros Kosiaris: mathoid: Move config.yaml into a template [deployment-charts] - 10https://gerrit.wikimedia.org/r/483184 [20:17:10] (03PS3) 10Alexandros Kosiaris: Add an stdout log stanza to config [deployment-charts] - 10https://gerrit.wikimedia.org/r/483227 [20:17:41] (03PS1) 10Cwhite: role: add forwards-compatibility rules to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/485889 (https://phabricator.wikimedia.org/T213708) [20:19:02] !log milimetric@deploy1001 Started deploy [analytics/refinery@d806b62]: Update jar versions on modified jobs [20:19:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:10] 10Operations, 10Analytics, 10Research, 10Article-Recommendation, and 2 others: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10mobrovac) I agree that the most likely solution to work here is option (2), i.e. getting a host to execute it from. Perhap... [20:22:30] 10Operations, 10Analytics, 10Research, 10Article-Recommendation, and 3 others: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10mobrovac) [20:23:44] (03Abandoned) 10Dzahn: ifttt: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/485098 (owner: 10Dzahn) [20:25:49] !log milimetric@deploy1001 Finished deploy [analytics/refinery@d806b62]: Update jar versions on modified jobs (duration: 06m 48s) [20:25:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:32] !log resetting cassandra authentication on maps / eqiad [20:26:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:54] (03CR) 10Dzahn: "i have not used this method last time i upgraded jenkins. it showed also a lot of warnings related to other packages. i have download the " [puppet] - 10https://gerrit.wikimedia.org/r/485685 (owner: 10Hashar) [20:29:57] (03CR) 10Dzahn: "why do files need to be writable by wikidev?" [puppet] - 10https://gerrit.wikimedia.org/r/484308 (https://phabricator.wikimedia.org/T137890) (owner: 10Hashar) [20:30:54] (03PS2) 10Hashar: Warn about lack of changelog or Dockerfile.template [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/485880 [20:30:56] (03PS3) 10Hashar: scan and process templates in parallel [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/484578 [20:31:29] (03PS9) 10Smalyshev: Add allocator metrics export for Blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/485135 (https://phabricator.wikimedia.org/T213372) [20:32:20] short update on the maps incident just before: looks like we are now OK, but I'll wait until tomorrow to repool eqiad, I don't understand all the details yet (cc: volans) [20:32:39] gehel: ack, leaving it as is then [20:33:21] (03CR) 10Hashar: Warn about lack of changelog or Dockerfile.template (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/485880 (owner: 10Hashar) [20:36:38] so I'm clear to deploy the train? [20:37:11] (03CR) 10Hashar: "I count 192 queries for integration/config.git. Half of them to my local docker, the other half to Wikimedia registry. With the paralle" (032 comments) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/484578 (owner: 10Hashar) [20:38:25] twentyafterfour: I think so as things are stable now and karthoterian is depooled in eqiad [20:40:26] 10Operations, 10Maps: Kartotherian service on maps100[2-4] timed out on when trying to get tiles. - https://phabricator.wikimedia.org/T214434 (10Mathew.onipe) [20:50:52] (03PS2) 10Dzahn: site: add mw2151 as another jobrunner host [puppet] - 10https://gerrit.wikimedia.org/r/483476 (https://phabricator.wikimedia.org/T192457) [20:51:31] 10Operations, 10Performance-Team, 10VisualEditor, 10Software-Licensing: New MongoDB version is not DFSG-compatible, dropped by Debian - https://phabricator.wikimedia.org/T213996 (10Krinkle) [20:52:06] 10Operations, 10Performance-Team, 10VisualEditor, 10Software-Licensing: New MongoDB version is not DFSG-compatible, dropped by Debian - https://phabricator.wikimedia.org/T213996 (10Krinkle) [20:54:19] (03CR) 10Krinkle: site: add mw2151 as another jobrunner host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/483476 (https://phabricator.wikimedia.org/T192457) (owner: 10Dzahn) [20:57:30] (03CR) 10Dzahn: site: add mw2151 as another jobrunner host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/483476 (https://phabricator.wikimedia.org/T192457) (owner: 10Dzahn) [20:59:07] (03CR) 10Dzahn: [C: 04-2] "it's also wrong because mw2251 != mw2151 ..." [puppet] - 10https://gerrit.wikimedia.org/r/483476 (https://phabricator.wikimedia.org/T192457) (owner: 10Dzahn) [21:01:21] (03PS3) 10Dzahn: site: add mw2151 as another jobrunner host [puppet] - 10https://gerrit.wikimedia.org/r/483476 (https://phabricator.wikimedia.org/T192457) [21:03:11] 10Operations, 10Performance-Team, 10monitoring, 10Patch-For-Review: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837 (10Krinkle) a:05Imarlier→03Krinkle [21:04:33] (03PS1) 10Zoranzoki21: Set wgRestrictionLevels for srwiki to autoconfirmed, autopatrol and sysop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485903 [21:08:33] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485903 (owner: 10Zoranzoki21) [21:11:14] 10Operations, 10MediaWiki-Cache, 10Performance-Team, 10User-Elukey: Consider removing the last traces of nutcracker in Mediawiki configs - https://phabricator.wikimedia.org/T214275 (10Krinkle) a:03aaron [21:14:34] (03PS3) 10Cwhite: role: add prometheus2 backwards-compatibility rules [puppet] - 10https://gerrit.wikimedia.org/r/484793 (https://phabricator.wikimedia.org/T213708) [21:19:26] 10Operations, 10VisualEditor, 10Performance-Team (Radar), 10Software-Licensing: New MongoDB version is not DFSG-compatible, dropped by Debian - https://phabricator.wikimedia.org/T213996 (10Krinkle) [21:20:20] (03PS4) 10CRusnov: Upgrade netbox to v2.5.2 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/485142 (https://phabricator.wikimedia.org/T212524) [21:21:20] (03PS5) 10CRusnov: Upgrade netbox to v2.5.3 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/485142 (https://phabricator.wikimedia.org/T212524) [21:23:25] 10Operations, 10serviceops, 10Core Platform Team (PHP7 (TEC4)), 10Core Platform Team Backlog (Watching / External), and 4 others: Set up a beta feature offering the use of PHP7 - https://phabricator.wikimedia.org/T213934 (10kchapman) [21:26:31] 10Operations, 10VisualEditor, 10Performance-Team (Radar), 10Software-Licensing: New MongoDB version is not DFSG-compatible, dropped by Debian - https://phabricator.wikimedia.org/T213996 (10MaxSem) [21:30:42] (03CR) 10Volans: [C: 04-1] Upgrade netbox to v2.5.3 (031 comment) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/485142 (https://phabricator.wikimedia.org/T212524) (owner: 10CRusnov) [21:31:20] !log twentyafterfour@deploy1001 Synchronized wmf-config/CommonSettings.php: deploy I91e9028d70d5a6f109f341bad553816e617b2416 (duration: 01m 39s) [21:31:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:26] (03CR) 10Hashar: [C: 04-1] "I can not test this change properly. There is an issue somewhere that causes docker-pkg to fail to notice an image has been published. Th" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/484578 (owner: 10Hashar) [21:34:40] (03CR) 10Dr0ptp4kt: WIP: Add a Google Translate-specific redirect-to-mobile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/485171 (https://phabricator.wikimedia.org/T212197) (owner: 10Dr0ptp4kt) [21:35:18] ^ bblack review request [21:35:53] (03CR) 10Hashar: [C: 04-1] "Could not load image in /home/hashar/projects/integration/config/dockerfiles/./java8-xgboost: (405, )" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/484578 (owner: 10Hashar) [21:37:46] (03PS1) 1020after4: testwikis wikis to 1.33.0-wmf.14 refs T206668 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485958 [21:37:48] (03CR) 1020after4: [C: 03+2] testwikis wikis to 1.33.0-wmf.14 refs T206668 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485958 (owner: 1020after4) [21:38:17] (03PS6) 10CRusnov: Upgrade netbox to v2.5.3 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/485142 (https://phabricator.wikimedia.org/T212524) [21:39:01] (03Merged) 10jenkins-bot: testwikis wikis to 1.33.0-wmf.14 refs T206668 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485958 (owner: 1020after4) [21:39:42] !log twentyafterfour@deploy1001 Started scap: testwikis wikis to 1.33.0-wmf.14 refs T206668 [21:39:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:45] T206668: 1.33.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T206668 [21:43:28] (03CR) 10Framawiki: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482261 (https://phabricator.wikimedia.org/T212919) (owner: 10Wangql) [21:44:55] (03CR) 10jenkins-bot: testwikis wikis to 1.33.0-wmf.14 refs T206668 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485958 (owner: 1020after4) [21:48:43] (03PS1) 10Dzahn: admins: disable Seddon's SSH key [puppet] - 10https://gerrit.wikimedia.org/r/485962 [21:50:02] (03CR) 10Dzahn: [C: 03+2] site: add mw2151 as another jobrunner host [puppet] - 10https://gerrit.wikimedia.org/r/483476 (https://phabricator.wikimedia.org/T192457) (owner: 10Dzahn) [22:00:17] oh mutante Seddon's leaving or... ? :( [22:00:31] PROBLEM - HHVM rendering on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:01:00] Hauskatze: no, it's just an issue with his key, he needs to make a new one [22:01:17] mutante: aha, so no 'absent' needed here [22:01:26] no [22:02:05] kk [22:02:18] (03PS2) 10Dzahn: admins: disable Seddon's SSH key [puppet] - 10https://gerrit.wikimedia.org/r/485962 [22:03:08] (03CR) 10BryanDavis: toolforge: kube2proxy: validate requests library version (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/484609 (https://phabricator.wikimedia.org/T213711) (owner: 10BryanDavis) [22:03:13] (03CR) 10Dzahn: [C: 03+2] admins: disable Seddon's SSH key [puppet] - 10https://gerrit.wikimedia.org/r/485962 (owner: 10Dzahn) [22:04:03] RECOVERY - HHVM rendering on mwdebug1002 is OK: HTTP OK: HTTP/1.1 200 OK - 76093 bytes in 0.199 second response time [22:05:05] Hauskatze: can i ask you a random other thing? doesn't this look like it should affect "mw2151" https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/483476/3/manifests/site.pp [22:22:39] (03CR) 10Hashar: [C: 04-1] "For the Docker registry, I have filled T214441" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/484578 (owner: 10Hashar) [22:22:41] !log twentyafterfour@deploy1001 Finished scap: testwikis wikis to 1.33.0-wmf.14 refs T206668 (duration: 43m 00s) [22:22:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:45] T206668: 1.33.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T206668 [22:34:51] (03CR) 10Krinkle: [C: 03+1] parsercachepurging: Increase keys TTL [puppet] - 10https://gerrit.wikimedia.org/r/485583 (https://phabricator.wikimedia.org/T210992) (owner: 10Marostegui) [22:39:36] (03CR) 10Bstorm: toolforge: kube2proxy: validate requests library version (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/484609 (https://phabricator.wikimedia.org/T213711) (owner: 10BryanDavis) [22:42:43] 10Operations, 10Fundraising-Backlog, 10Mail, 10fundraising-tech-ops, 10Patch-For-Review: Stronger DKIM key for fundraising emails? - https://phabricator.wikimedia.org/T210445 (10bsisolak) IBM validated the DNS setting, and is using the new key: DKIM-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed; s=spop2... [22:43:32] mutante: sorry I was on the phone, looking into that now [22:46:49] mutante: not sure why I'm looking at re. https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/483476/ [22:46:57] 10Operations, 10Proton, 10Reading-Infrastructure-Team-Backlog, 10Traffic, and 3 others: Document and possibly fine-tune how Proton interacts with Varnish - https://phabricator.wikimedia.org/T213371 (10Jdlrobson) [22:50:36] Hauskatze: i added mw2151, before it started at mw2152. does it look like i did that? [22:51:06] it does to me.. i just wanted a random pair of eyes to confirm.. because .. nothing happens [22:52:26] mutante: perhaps because of the # at the start? [22:52:33] what does puppet-compiler say? [22:53:19] scratch that, the # is a comment like all others [22:54:09] yea, the change is only in 1789.. compiling [22:54:53] mutante: and you want it on mw2151 right? [22:54:59] yes [22:55:38] /^mw21(5[1-9]|6[0-2])\.codfw\.wmnet$/ looks good to me? [22:55:56] Compilation results for mw2151.codfw.wmnet: no change [22:56:04] it should cover mw215[19] andmw216[012] [22:56:05] well.. the change is only in site.pp [22:56:21] (03CR) 10Bstorm: toolforge: kube2proxy: validate requests library version (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/484609 (https://phabricator.wikimedia.org/T213711) (owner: 10BryanDavis) [22:56:27] yes, thanks for confirming..though ..it does not [22:56:36] hmmm [22:57:41] duh:) it should help to remove it from the other place it's used in site [22:58:34] mutante: maybe https://codesearch.wmflabs.org/operations/?q=mediawiki%3A%3Ajobrunner&i=nope&files=&repos= helps [22:58:39] PROBLEM - Check systemd state on cloudnet1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:59:03] I'm just checking the diff, let me see the file entirely :) [22:59:06] Hauskatze: it appears in site.pp a second time.. [22:59:11] that will be it [22:59:12] ah [22:59:20] so puppet says 'f. u.' silently [22:59:44] no "warning danger red lighs duplicate" [23:00:21] it is just like "ok, this matches, using this and from now on ignoring it" [23:00:34] first one wins [23:02:08] (03PS1) 10Dzahn: site: mw2151 should not use spare role anymore [puppet] - 10https://gerrit.wikimedia.org/r/485968 (https://phabricator.wikimedia.org/T192457) [23:03:00] (03PS2) 10Dzahn: site: mw2151 should not use spare role anymore [puppet] - 10https://gerrit.wikimedia.org/r/485968 (https://phabricator.wikimedia.org/T192457) [23:03:16] -1 left ] [23:05:19] (03PS3) 10MarcoAurelio: site: mw2151 should not use spare role anymore [puppet] - 10https://gerrit.wikimedia.org/r/485968 (https://phabricator.wikimedia.org/T192457) (owner: 10Dzahn) [23:05:28] (03CR) 10MarcoAurelio: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/485968 (https://phabricator.wikimedia.org/T192457) (owner: 10Dzahn) [23:07:18] mutante: ^ puppet compiler says fail :S [23:07:35] https://puppet-compiler.wmflabs.org/compiler1001/101/mw2150.codfw.wmnet/ [23:09:10] Warning: You cannot collect exported resources without storeconfigs being set; the export is ignored at /srv/jenkins-workspace/puppet-compiler/101/production/src/modules/monitoring/manifests/exported_nagios_service.pp:26:5 [23:10:28] (03PS4) 10Cwhite: role: add prometheus2 backwards-compatibility rules [puppet] - 10https://gerrit.wikimedia.org/r/484793 (https://phabricator.wikimedia.org/T213708) [23:12:37] Hauskatze: that's not a fail if you can link to the "changes" like above. that warning is known background noise [23:13:19] (03PS5) 10Cwhite: role: add prometheus2 backwards-compatibility rules [puppet] - 10https://gerrit.wikimedia.org/r/484793 (https://phabricator.wikimedia.org/T213708) [23:13:57] mutante: ok then, sorry for polluting the gerrit change then [23:16:22] (03PS2) 10Cwhite: role: add backwards-compatibility rules to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/485889 (https://phabricator.wikimedia.org/T213708) [23:17:00] Hauskatze: i think the "check experimental" fail comes more from "[ 2019-01-22T23:06:19 ] ERROR: Unable to find facts for host , skipping" [23:17:42] mutante: worked w/o issues for me on other patch but maybe. Hashar would know better perhaps. [23:17:51] or whoever is in charge of puppet-compiler [23:17:56] I guess contint-admins [23:18:52] (03PS6) 10Cwhite: role: add prometheus2 backwards-compatibility rules [puppet] - 10https://gerrit.wikimedia.org/r/484793 (https://phabricator.wikimedia.org/T213708) [23:20:24] (03PS3) 10Cwhite: role: add backwards-compatibility rules to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/485889 (https://phabricator.wikimedia.org/T213708) [23:22:11] Hauskatze: on mw2151 it fails to compile because "Error while evaluating a Function Call, OS debian >= stretch required" sigh:) [23:22:30] dont worry about it though and thanks [23:22:44] mutante: duh, I guess you need to reimage that server with strech? [23:22:50] *stretch? [23:22:58] (03PS3) 10BryanDavis: toolforge: kube2proxy: validate requests library version [puppet] - 10https://gerrit.wikimedia.org/r/484609 (https://phabricator.wikimedia.org/T213711) [23:23:11] [mw2151:~] $ lsb_release -c [23:23:11] Codename: stretch [23:23:20] lol? [23:23:36] men, puppet is so 'funny' [23:23:45] it must be about the compiler hosts? [23:23:55] or the facts need to be synced [23:24:02] and it doesn't know yet it got upgraded [23:24:14] been a while though [23:24:45] this server has been working so much for others [23:24:55] needs to stop and know itself a bit [23:27:14] Platonides: medici, cura te ipsum [23:27:21] ;-) [23:27:30] if the servers could do that it'd be awesome [23:32:01] * Hauskatze good night [23:32:40] Hauskatze: bonum nocte [23:34:23] Vale :) [23:34:42] et gratias tibi ago [23:35:20] https://de.wiktionary.org/wiki/nachts_sind_alle_Katzen_grau [23:38:41] lol [23:42:32] (03CR) 10Dzahn: [C: 03+2] "in compiler it fails with "Error while evaluating a Function Call, OS debian >= stretch required" but mw2151 is on stretch" [puppet] - 10https://gerrit.wikimedia.org/r/485968 (https://phabricator.wikimedia.org/T192457) (owner: 10Dzahn) [23:42:43] (03PS4) 10Dzahn: site: mw2151 should not use spare role anymore [puppet] - 10https://gerrit.wikimedia.org/r/485968 (https://phabricator.wikimedia.org/T192457) [23:52:42] ACKNOWLEDGEMENT - Check systemd state on cloudnet2001-dev is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T214167 [23:53:01] (03PS2) 10Volans: decorators: make retry() DRY-RUN aware [software/spicerack] - 10https://gerrit.wikimedia.org/r/484582 [23:53:03] (03PS1) 10Volans: decorators: improve tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/485976 [23:54:33] (03CR) 10Volans: "FYI I've refactored a bit the tests in Ie8c8e9c55eaf11e5e4e4e178661fb0b284645267" (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/484582 (owner: 10Volans) [23:54:37] PROBLEM - puppet last run on analytics1057 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:56:22] ACKNOWLEDGEMENT - Check systemd state on cloudnet2002-dev is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T214303 [23:57:22] ACKNOWLEDGEMENT - Memory correctable errors -EDAC- on thumbor1004 is CRITICAL: 27 ge 4 daniel_zahn https://phabricator.wikimedia.org/T207721 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad+prometheus/ops [23:58:52] ACKNOWLEDGEMENT - High load average on cloudstore1008 is CRITICAL: (null) daniel_zahn https://phabricator.wikimedia.org/T209527 https://grafana.wikimedia.org/dashboard/db/labs-monitoring [23:58:52] ACKNOWLEDGEMENT - High load average on cloudstore1009 is CRITICAL: (null) daniel_zahn https://phabricator.wikimedia.org/T209527 https://grafana.wikimedia.org/dashboard/db/labs-monitoring