[00:08:57] RECOVERY - MariaDB Slave Lag: s4 on db2090 is OK: OK slave_sql_lag Replication lag: 58.21 seconds [00:09:29] RECOVERY - MariaDB Slave Lag: s4 on db2095 is OK: OK slave_sql_lag Replication lag: 55.54 seconds [00:09:49] RECOVERY - MariaDB Slave Lag: s4 on db2051 is OK: OK slave_sql_lag Replication lag: 52.53 seconds [00:09:51] RECOVERY - MariaDB Slave Lag: s4 on db2091 is OK: OK slave_sql_lag Replication lag: 52.38 seconds [00:09:55] RECOVERY - MariaDB Slave Lag: s4 on db2058 is OK: OK slave_sql_lag Replication lag: 50.01 seconds [00:10:01] RECOVERY - MariaDB Slave Lag: s4 on db2073 is OK: OK slave_sql_lag Replication lag: 45.17 seconds [00:10:15] RECOVERY - MariaDB Slave Lag: s4 on db2065 is OK: OK slave_sql_lag Replication lag: 41.12 seconds [00:10:15] RECOVERY - MariaDB Slave Lag: s4 on db2084 is OK: OK slave_sql_lag Replication lag: 41.24 seconds [00:31:41] PROBLEM - MariaDB Slave Lag: s8 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 672.62 seconds [00:36:35] RECOVERY - MariaDB Slave Lag: s8 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 109.95 seconds [00:55:01] 10Operations, 10Performance-Team, 10Traffic: Determine cause of upload.wikimedia.org requests routed to text-lb (404 Not Found) - https://phabricator.wikimedia.org/T207340 (10Krinkle) [01:42:19] PROBLEM - MD RAID on cp5010 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 0, Spare: 0 [01:42:21] ACKNOWLEDGEMENT - MD RAID on cp5010 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T214274 [01:42:25] 10Operations, 10ops-eqsin: Degraded RAID on cp5010 - https://phabricator.wikimedia.org/T214274 (10ops-monitoring-bot) [02:39:17] PROBLEM - MariaDB Slave Lag: s2 on db2091 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.63 seconds [02:39:19] PROBLEM - MariaDB Slave Lag: s2 on db2056 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.26 seconds [02:39:25] PROBLEM - MariaDB Slave Lag: s2 on db2041 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.42 seconds [02:39:27] PROBLEM - MariaDB Slave Lag: s2 on db2063 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.09 seconds [02:39:37] PROBLEM - MariaDB Slave Lag: s2 on db2049 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 305.17 seconds [02:39:53] PROBLEM - MariaDB Slave Lag: s2 on db2035 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 308.41 seconds [02:40:01] PROBLEM - MariaDB Slave Lag: s2 on db2088 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.37 seconds [02:40:15] PROBLEM - MariaDB Slave Lag: s2 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 312.42 seconds [03:04:27] RECOVERY - MariaDB Slave Lag: s2 on db2049 is OK: OK slave_sql_lag Replication lag: 59.70 seconds [03:04:41] RECOVERY - MariaDB Slave Lag: s2 on db2035 is OK: OK slave_sql_lag Replication lag: 18.79 seconds [03:04:49] RECOVERY - MariaDB Slave Lag: s2 on db2088 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [03:05:03] RECOVERY - MariaDB Slave Lag: s2 on db2095 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [03:05:19] RECOVERY - MariaDB Slave Lag: s2 on db2091 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [03:05:23] RECOVERY - MariaDB Slave Lag: s2 on db2056 is OK: OK slave_sql_lag Replication lag: 0.48 seconds [03:05:29] RECOVERY - MariaDB Slave Lag: s2 on db2041 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [03:05:31] RECOVERY - MariaDB Slave Lag: s2 on db2063 is OK: OK slave_sql_lag Replication lag: 0.47 seconds [03:32:15] PROBLEM - MariaDB Slave Lag: s8 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1030.54 seconds [03:35:09] PROBLEM - MariaDB Slave Lag: s2 on db2063 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.51 seconds [03:35:17] PROBLEM - MariaDB Slave Lag: s2 on db2049 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.98 seconds [03:35:33] PROBLEM - MariaDB Slave Lag: s2 on db2035 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 305.10 seconds [03:35:39] PROBLEM - MariaDB Slave Lag: s2 on db2088 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 306.54 seconds [03:35:53] PROBLEM - MariaDB Slave Lag: s2 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 307.06 seconds [03:36:07] PROBLEM - MariaDB Slave Lag: s2 on db2091 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 308.94 seconds [03:36:11] PROBLEM - MariaDB Slave Lag: s2 on db2056 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 308.89 seconds [03:36:19] PROBLEM - MariaDB Slave Lag: s2 on db2041 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.61 seconds [03:49:31] RECOVERY - MariaDB Slave Lag: s8 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 267.27 seconds [04:08:33] PROBLEM - MariaDB Slave Lag: s4 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 601.87 seconds [04:46:37] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [05:20:57] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [05:34:31] PROBLEM - MariaDB Slave Lag: s4 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 664.70 seconds [06:06:13] ACKNOWLEDGEMENT - Memory correctable errors -EDAC- on db1068 is CRITICAL: 7.002 ge 4 Marostegui known - The acknowledgement expires at: 2019-01-25 06:05:36. https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=db1068&var-datasource=eqiad+prometheus/ops [06:08:59] !log tag_summary table from s8 - T212255 [06:09:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:09:02] T212255: Drop tag_summary table - https://phabricator.wikimedia.org/T212255 [06:12:26] !log Drop tag_summary table from s3 codfw - T212255 [06:12:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:13] RECOVERY - MariaDB Slave Lag: s4 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 296.79 seconds [06:27:39] !log Drop tag_summary table from dbstore1002:s3 - T212255 [06:27:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:42] T212255: Drop tag_summary table - https://phabricator.wikimedia.org/T212255 [06:28:11] PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:28:39] running puppet --^ [06:31:53] RECOVERY - Check systemd state on netmon2001 is OK: OK - running: The system is fully operational [06:32:35] !log Drop tag_summary table from db1095:3313 - T212255 [06:32:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:33:03] PROBLEM - puppet last run on dbmonitor2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/apache2/conf-available/00-defaults.conf] [06:33:05] PROBLEM - puppet last run on cp1090 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/varnishmtail-backend/varnishbackend.mtail] [06:42:34] 10Operations, 10ops-eqsin: Degraded RAID on cp5010 - https://phabricator.wikimedia.org/T214274 (10Vgutierrez) kern.log is reporting multiple failures in /dev/sdb3 as well ` Jan 21 06:41:47 cp5010 kernel: [7490330.204759] EXT4-fs error (device sdb3) in ext4_reserve_inode_write:5448: IO failure Jan 21 06:41:48 c... [06:45:01] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5010.eqsin.wmnet [06:45:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:43] !log Drop tag_summary table from db1023, db1077, db1075 and db1078 T212255 [06:47:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:48] T212255: Drop tag_summary table - https://phabricator.wikimedia.org/T212255 [06:51:07] (03PS1) 10Marostegui: db-eqiad.php: Depool db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485580 (https://phabricator.wikimedia.org/T85757) [06:52:15] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485580 (https://phabricator.wikimedia.org/T85757) (owner: 10Marostegui) [06:52:19] RECOVERY - MariaDB Slave Lag: s2 on db2091 is OK: OK slave_sql_lag Replication lag: 50.74 seconds [06:52:25] RECOVERY - MariaDB Slave Lag: s2 on db2056 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [06:52:35] RECOVERY - MariaDB Slave Lag: s2 on db2041 is OK: OK slave_sql_lag Replication lag: 0.44 seconds [06:52:37] RECOVERY - MariaDB Slave Lag: s2 on db2063 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [06:52:49] RECOVERY - MariaDB Slave Lag: s2 on db2049 is OK: OK slave_sql_lag Replication lag: 0.18 seconds [06:53:03] RECOVERY - MariaDB Slave Lag: s2 on db2088 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [06:53:05] RECOVERY - MariaDB Slave Lag: s2 on db2035 is OK: OK slave_sql_lag Replication lag: 0.17 seconds [06:53:17] RECOVERY - MariaDB Slave Lag: s2 on db2095 is OK: OK slave_sql_lag Replication lag: 0.45 seconds [06:53:21] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485580 (https://phabricator.wikimedia.org/T85757) (owner: 10Marostegui) [06:54:30] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1123 - T85757 (duration: 00m 50s) [06:54:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:33] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [06:54:35] !log Deploy schema change on db1123 - T85757 [06:54:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:23] PROBLEM - Check systemd state on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [06:56:29] PROBLEM - Disk space on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [06:56:33] PROBLEM - dhclient process on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [06:56:46] (03PS1) 10Marostegui: db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485581 (https://phabricator.wikimedia.org/T210478) [06:56:57] PROBLEM - configured eth on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [06:56:57] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485580 (https://phabricator.wikimedia.org/T85757) (owner: 10Marostegui) [06:57:03] PROBLEM - DPKG on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [06:57:05] PROBLEM - MD RAID on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [06:57:11] checking --^ [06:57:31] PROBLEM - puppet last run on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [06:58:01] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485581 (https://phabricator.wikimedia.org/T210478) (owner: 10Marostegui) [06:59:02] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485581 (https://phabricator.wikimedia.org/T210478) (owner: 10Marostegui) [06:59:09] RECOVERY - puppet last run on dbmonitor2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:59:11] RECOVERY - puppet last run on cp1090 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:00:05] RECOVERY - Check systemd state on notebook1003 is OK: OK - running: The system is fully operational [07:00:11] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1089 T210478 (duration: 00m 47s) [07:00:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:15] T210478: Migrate dbstore1002 to a multi instance setup on dbstore100[3-5] - https://phabricator.wikimedia.org/T210478 [07:00:15] RECOVERY - dhclient process on notebook1003 is OK: PROCS OK: 0 processes with command name dhclient [07:00:23] 10Operations, 10ops-eqsin, 10Traffic: Degraded RAID on cp5010 - https://phabricator.wikimedia.org/T214274 (10Vgutierrez) initial failure at 01:39: ` vgutierrez@cp5010:~$ grep sdb /var/log/kern.log |grep -v "__ext4_get_inode_loc" |grep -v "IO failure" Jan 21 01:39:17 cp5010 kernel: [7472180.491194] blk_update... [07:00:31] !log Stop MySQL on db1089 to clone dbstore1003 - T210478 [07:00:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:41] RECOVERY - configured eth on notebook1003 is OK: OK - interfaces up [07:00:47] RECOVERY - DPKG on notebook1003 is OK: All packages OK [07:00:49] RECOVERY - MD RAID on notebook1003 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [07:05:05] PROBLEM - Check systemd state on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [07:05:15] PROBLEM - dhclient process on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [07:05:39] PROBLEM - configured eth on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [07:05:45] PROBLEM - DPKG on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [07:05:49] PROBLEM - MD RAID on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [07:08:01] RECOVERY - puppet last run on notebook1003 is OK: OK: Puppet is currently enabled, last run 21 minutes ago with 0 failures [07:08:07] RECOVERY - configured eth on notebook1003 is OK: OK - interfaces up [07:08:13] RECOVERY - DPKG on notebook1003 is OK: All packages OK [07:08:15] RECOVERY - MD RAID on notebook1003 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [07:08:47] RECOVERY - Check systemd state on notebook1003 is OK: OK - running: The system is fully operational [07:08:57] RECOVERY - dhclient process on notebook1003 is OK: PROCS OK: 0 processes with command name dhclient [07:10:15] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485581 (https://phabricator.wikimedia.org/T210478) (owner: 10Marostegui) [07:28:21] (03PS1) 10Marostegui: InitialiseSettings.php: Increase parsercache TTL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485582 (https://phabricator.wikimedia.org/T210992) [07:29:59] (03PS1) 10Marostegui: parsercachepurging: Increase keys TTL [puppet] - 10https://gerrit.wikimedia.org/r/485583 (https://phabricator.wikimedia.org/T210992) [07:32:11] 10Operations, 10DBA, 10Performance-Team, 10Patch-For-Review: Increase parsercache keys TTL from 22 days back to 30 days - https://phabricator.wikimedia.org/T210992 (10Marostegui) @aaron @Joe @jcrespo I have made the first small change, to go from 22 to 24 days: https://gerrit.wikimedia.org/r/#/c/operati... [07:36:43] (03PS4) 10Mathew.onipe: wdqs: convert prom exporter script tp py3 [puppet] - 10https://gerrit.wikimedia.org/r/484974 (https://phabricator.wikimedia.org/T213305) [07:36:57] (03CR) 10Mathew.onipe: wdqs: convert prom exporter script tp py3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/484974 (https://phabricator.wikimedia.org/T213305) (owner: 10Mathew.onipe) [07:39:39] !log Stop replication on db1124:3313 to fix triggers - T85757 [07:39:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:42] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [07:40:18] 10Operations, 10MediaWiki-Cache, 10User-Elukey: Consider removing the last traces of nutcracker in Mediawiki configs - https://phabricator.wikimedia.org/T214275 (10elukey) p:05Triage→03Normal [07:54:37] (03PS1) 10Mathew.onipe: maps: migrate maps1002 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/485584 (https://phabricator.wikimedia.org/T198622) [07:55:07] (03CR) 10jerkins-bot: [V: 04-1] maps: migrate maps1002 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/485584 (https://phabricator.wikimedia.org/T198622) (owner: 10Mathew.onipe) [07:56:34] (03PS2) 10Mathew.onipe: maps: migrate maps1002 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/485584 (https://phabricator.wikimedia.org/T198622) [08:10:19] !log installing OpenSSL security updates [08:10:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:48] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1123" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485587 [08:20:50] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1123" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485587 [08:21:58] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1123" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485587 (owner: 10Marostegui) [08:23:03] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1123" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485587 (owner: 10Marostegui) [08:24:03] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1123 - T85757 (duration: 00m 48s) [08:24:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:06] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [08:25:27] 10Operations, 10ops-eqiad, 10RESTBase, 10RESTBase-Cassandra, and 3 others: Memory error on restbase1016 - https://phabricator.wikimedia.org/T212418 (10fgiunchedi) a:05Cmjohnson→03None >>! In T212418#4893779, @mobrovac wrote: > All of the instances have joined the ring (thnx @fgiunchedi!) and the latest... [08:27:26] (03PS1) 10Marostegui: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485588 (https://phabricator.wikimedia.org/T85757) [08:29:19] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485588 (https://phabricator.wikimedia.org/T85757) (owner: 10Marostegui) [08:30:24] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485588 (https://phabricator.wikimedia.org/T85757) (owner: 10Marostegui) [08:31:59] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1077 - T85757 (duration: 00m 46s) [08:32:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:03] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [08:35:51] !log Stop replication db1077 to deploy schema change - T85757 [08:35:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:57] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1123" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485587 (owner: 10Marostegui) [08:35:59] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485588 (https://phabricator.wikimedia.org/T85757) (owner: 10Marostegui) [08:45:50] (03PS4) 10Elukey: profile::reportupdater::jobs::hadoop: move jobs to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/485210 (https://phabricator.wikimedia.org/T172532) [08:48:42] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/14402/" [puppet] - 10https://gerrit.wikimedia.org/r/485210 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [08:54:35] PROBLEM - puppet last run on stat1007 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 1 minute ago with 2 failures. Failed resources (up to 3 shown): Service[reportupdater-browser.timer],Service[reportupdater-interlanguage.timer] [08:56:09] this is me --^ [08:56:20] (03PS1) 10Elukey: reportupdater::job: use absolute paths in timer's definition [puppet] - 10https://gerrit.wikimedia.org/r/485591 (https://phabricator.wikimedia.org/T172532) [08:57:04] (03CR) 10Elukey: [C: 03+2] reportupdater::job: use absolute paths in timer's definition [puppet] - 10https://gerrit.wikimedia.org/r/485591 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [08:59:32] jouncebot: next [08:59:32] In 18 hour(s) and 0 minute(s): ContentTranslation Draft Purge Script Run (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190122T0300) [08:59:47] RECOVERY - puppet last run on stat1007 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [09:00:07] aah, monday is a no deploy day, heh [09:01:25] * addshore will be backporting a line line fix for ArticlePlaceholder which has been broken since last week :( [09:01:30] once it is merged on the branch [09:03:07] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:06:07] (03PS1) 10Vgutierrez: certcentral: Implement staging time [software/certcentral] - 10https://gerrit.wikimedia.org/r/485594 (https://phabricator.wikimedia.org/T213737) [09:06:25] PROBLEM - DPKG on cumin2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [09:07:55] (03CR) 10jerkins-bot: [V: 04-1] certcentral: Implement staging time [software/certcentral] - 10https://gerrit.wikimedia.org/r/485594 (https://phabricator.wikimedia.org/T213737) (owner: 10Vgutierrez) [09:15:29] PROBLEM - puppet last run on cumin2001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[tshark],Exec[set debconf flag seen for wireshark-common/install-setuid] [09:15:48] (03PS2) 10Vgutierrez: certcentral: Implement staging time [software/certcentral] - 10https://gerrit.wikimedia.org/r/485594 (https://phabricator.wikimedia.org/T213737) [09:17:34] (03CR) 10jerkins-bot: [V: 04-1] certcentral: Implement staging time [software/certcentral] - 10https://gerrit.wikimedia.org/r/485594 (https://phabricator.wikimedia.org/T213737) (owner: 10Vgutierrez) [09:19:10] (03CR) 10Filippo Giunchedi: [C: 03+1] Remove Diamond from Kafka hosts [puppet] - 10https://gerrit.wikimedia.org/r/485004 (https://phabricator.wikimedia.org/T212231) (owner: 10Muehlenhoff) [09:27:15] RECOVERY - DPKG on cumin2001 is OK: All packages OK [09:30:06] !log Compress a few tables on dbstore1003:3315 - T210478 [09:30:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:10] T210478: Migrate dbstore1002 to a multi instance setup on dbstore100[3-5] - https://phabricator.wikimedia.org/T210478 [09:31:57] 10Operations, 10DC-Ops, 10Discovery: Memory correctable errors -EDAC- elastic1029 - https://phabricator.wikimedia.org/T214283 (10Mathew.onipe) [09:33:14] (03PS2) 10Filippo Giunchedi: prometheus: update prometheus-labs-targets to use keystone/nova clients [puppet] - 10https://gerrit.wikimedia.org/r/485193 (https://phabricator.wikimedia.org/T214058) [09:34:31] 10Operations, 10ops-eqiad, 10DC-Ops, 10Discovery: Memory correctable errors -EDAC- elastic1029 - https://phabricator.wikimedia.org/T214283 (10Mathew.onipe) [09:34:39] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational [09:35:45] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485607 [09:39:52] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly repool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485607 (owner: 10Marostegui) [09:40:07] (03PS1) 10DCausse: [cirrus] autocomplete: enable subphrase matching for wikitech and mw.org (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485608 (https://phabricator.wikimedia.org/T212788) [09:40:10] (03PS1) 10DCausse: [cirrus] autocomplete: enable subphrase matching for wikitech and mw.org (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485609 (https://phabricator.wikimedia.org/T212788) [09:41:08] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485607 (owner: 10Marostegui) [09:42:10] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly Repool db1089 T210478 (duration: 00m 45s) [09:42:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:13] T210478: Migrate dbstore1002 to a multi instance setup on dbstore100[3-5] - https://phabricator.wikimedia.org/T210478 [09:44:14] (03PS3) 10Vgutierrez: certcentral: Implement staging time [software/certcentral] - 10https://gerrit.wikimedia.org/r/485594 (https://phabricator.wikimedia.org/T213737) [09:44:46] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485607 (owner: 10Marostegui) [09:44:50] (03CR) 10Filippo Giunchedi: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/485193 (https://phabricator.wikimedia.org/T214058) (owner: 10Filippo Giunchedi) [09:45:58] (03CR) 10jerkins-bot: [V: 04-1] certcentral: Implement staging time [software/certcentral] - 10https://gerrit.wikimedia.org/r/485594 (https://phabricator.wikimedia.org/T213737) (owner: 10Vgutierrez) [09:46:13] RECOVERY - puppet last run on cumin2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:48:26] (03CR) 10Arturo Borrero Gonzalez: "> > Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/485193 (https://phabricator.wikimedia.org/T214058) (owner: 10Filippo Giunchedi) [09:52:01] (03PS1) 10Arturo Borrero Gonzalez: labtestneutron2001: cleanup [dns] - 10https://gerrit.wikimedia.org/r/485613 (https://phabricator.wikimedia.org/T214167) [09:52:21] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] labtestneutron2001: cleanup [dns] - 10https://gerrit.wikimedia.org/r/485613 (https://phabricator.wikimedia.org/T214167) (owner: 10Arturo Borrero Gonzalez) [09:53:52] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485614 [09:55:09] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Increase traffic for db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485614 (owner: 10Marostegui) [09:56:14] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485614 (owner: 10Marostegui) [09:57:16] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Give more traffic to db1089 (duration: 00m 45s) [09:57:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:09] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485614 (owner: 10Marostegui) [10:00:48] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485615 [10:01:31] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485615 [10:01:32] 10Operations, 10ops-codfw, 10cloud-services-team (Kanban): labstore2004 - memory error on DIMM A2 - https://phabricator.wikimedia.org/T214262 (10faidon) This is a super old server; it just crossed its 7-year mark (we typically refresh servers at 4.5-5 years), so we're way past its warranty and shelf life and... [10:15:45] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485616 [10:17:04] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Fully repool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485616 (owner: 10Marostegui) [10:18:09] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485616 (owner: 10Marostegui) [10:19:06] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool db1089 (duration: 00m 45s) [10:19:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:05] marostegui: o/ I have a backport to deploy to fix an UBN :) just want to make sure we don't step on each other [10:21:17] addshore: go for it! [10:21:23] marostegui: thanks, will ping when done too! [10:21:30] excellent! [10:23:45] (03PS1) 10Jcrespo: mariadb: Promote db2047 to master on configuration management [puppet] - 10https://gerrit.wikimedia.org/r/485617 (https://phabricator.wikimedia.org/T214264) [10:25:01] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485616 (owner: 10Marostegui) [10:25:21] syncing [10:25:40] (03PS2) 10Jcrespo: mariadb: Promote db2047 to master on configuration management [puppet] - 10https://gerrit.wikimedia.org/r/485617 (https://phabricator.wikimedia.org/T214264) [10:25:43] (03CR) 10Marostegui: "Are you doing the db2047.yaml file in a separate commit?" [puppet] - 10https://gerrit.wikimedia.org/r/485617 (https://phabricator.wikimedia.org/T214264) (owner: 10Jcrespo) [10:25:57] hehe [10:26:03] !log addshore@deploy1001 Synchronized php-1.33.0-wmf.13/extensions/ArticlePlaceholder/includes/AboutTopicRenderer.php: T213739 Pass a usageAccumulator to SidebarGenerator (duration: 00m 47s) [10:26:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:06] T213739: AboutTopicRenderer, OtherProjectsSidebarGeneratorFactory::getOtherProjectsSidebarGenerator() must be an instance of UsageAccumulator, undefined variable given - https://phabricator.wikimedia.org/T213739 [10:26:12] marostegui: all done! [10:26:16] (03CR) 10Jcrespo: "^" [puppet] - 10https://gerrit.wikimedia.org/r/485617 (https://phabricator.wikimedia.org/T214264) (owner: 10Jcrespo) [10:26:17] addshore: thanks! [10:26:32] (03CR) 10Marostegui: [C: 03+1] mariadb: Promote db2047 to master on configuration management [puppet] - 10https://gerrit.wikimedia.org/r/485617 (https://phabricator.wikimedia.org/T214264) (owner: 10Jcrespo) [10:27:37] PROBLEM - HHVM rendering on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:28:43] RECOVERY - HHVM rendering on mwdebug1002 is OK: HTTP OK: HTTP/1.1 200 OK - 79912 bytes in 0.210 second response time [10:30:58] (03CR) 10Gehel: [C: 04-1] wdqs: convert prom exporter script tp py3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/484974 (https://phabricator.wikimedia.org/T213305) (owner: 10Mathew.onipe) [10:32:46] (03PS3) 10Marostegui: Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485615 [10:32:48] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10Patch-For-Review: improve docker registry architecture - https://phabricator.wikimedia.org/T209271 (10fselles) [10:32:51] 10Operations, 10monitoring, 10Kubernetes: debianize docker-registry 2.7.0-rc0 and upload in stretch-wikimedia - https://phabricator.wikimedia.org/T210071 (10fselles) 05Open→03Resolved [10:33:24] !log upgrade and restart db2047 T214264 [10:33:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:27] T214264: BBU issues on codfw - https://phabricator.wikimedia.org/T214264 [10:34:01] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485615 (owner: 10Marostegui) [10:35:12] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485615 (owner: 10Marostegui) [10:35:23] (03PS37) 10Elukey: admin: allow users to be deployed without ssh keys configured [puppet] - 10https://gerrit.wikimedia.org/r/482275 (https://phabricator.wikimedia.org/T212949) [10:36:13] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1077 - T85757 (duration: 00m 44s) [10:36:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:17] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [10:38:30] (03CR) 10Gehel: [C: 04-1] "There is also a missing change to DHCP configuration to migrate maps1002 to stretch." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/485584 (https://phabricator.wikimedia.org/T198622) (owner: 10Mathew.onipe) [10:38:49] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485615 (owner: 10Marostegui) [10:42:25] RECOVERY - Disk space on notebook1003 is OK: DISK OK [10:43:17] (forced the remount) [10:51:32] !log disable puppet fleetwide to ease the merge/deploy of a puppet admin module change - T212949 [10:51:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:35] T212949: Allow the deployment of users without SSH access - https://phabricator.wikimedia.org/T212949 [10:52:01] RECOVERY - Host labstore2004 is UP: PING OK - Packet loss = 0%, RTA = 36.16 ms [10:55:02] 10Operations, 10ops-codfw, 10cloud-services-team (Kanban): labstore2004 - memory error on DIMM A2 - https://phabricator.wikimedia.org/T214262 (10GTirloni) @faidon thanks! After another reboot the system was able to get past that error and boot successfuly. Just for reference, it's DDR3 1333MHz memory type. [10:55:08] 10Operations, 10ops-codfw, 10cloud-services-team (Kanban): labstore2004 - memory error on DIMM A2 - https://phabricator.wikimedia.org/T214262 (10GTirloni) 05Open→03Resolved p:05Triage→03Low [10:56:32] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Make swift containers for docker registry cross replicated. - https://phabricator.wikimedia.org/T214289 (10fselles) p:05Triage→03Normal [11:02:43] (03PS1) 10Filippo Giunchedi: prometheus: move ::scripts to ::wmcs_scripts [puppet] - 10https://gerrit.wikimedia.org/r/485620 (https://phabricator.wikimedia.org/T214058) [11:02:45] (03PS1) 10Filippo Giunchedi: prometheus: fetch WMCS targets from all regions [puppet] - 10https://gerrit.wikimedia.org/r/485621 (https://phabricator.wikimedia.org/T214058) [11:02:54] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Make swift containers for docker registry cross replicated. - https://phabricator.wikimedia.org/T214289 (10fselles) I've replicated this using a local SAIO setup and it seems to work, however obviously we are avoiding network latency here hence t... [11:03:43] (03CR) 10Elukey: [C: 03+2] admin: allow users to be deployed without ssh keys configured [puppet] - 10https://gerrit.wikimedia.org/r/482275 (https://phabricator.wikimedia.org/T212949) (owner: 10Elukey) [11:04:54] all right merged my change, running puppet now [11:05:06] (on a few hosts to verify that it is a no op) [11:05:25] if anybody wants to verify their area of competence it would help a ton :) [11:05:30] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, and 3 others: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817 (10Gilles) [11:07:52] so far all good, ran puppet on some analytics nodes, didn't see anything strange [11:12:17] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "I didn't tested the code. My eyeball didn't detect any important issue, but you should make sure every code branch works as expected. Prob" [puppet] - 10https://gerrit.wikimedia.org/r/485621 (https://phabricator.wikimedia.org/T214058) (owner: 10Filippo Giunchedi) [11:13:13] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "Thanks for not introducing the 'labs' keyword again :-)" [puppet] - 10https://gerrit.wikimedia.org/r/485620 (https://phabricator.wikimedia.org/T214058) (owner: 10Filippo Giunchedi) [11:13:34] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] prometheus: update prometheus-labs-targets to use keystone/nova clients [puppet] - 10https://gerrit.wikimedia.org/r/485193 (https://phabricator.wikimedia.org/T214058) (owner: 10Filippo Giunchedi) [11:13:37] (03PS4) 10Giuseppe Lavagetto: profile::services_proxy: simple local proxying for remote services [puppet] - 10https://gerrit.wikimedia.org/r/483788 (https://phabricator.wikimedia.org/T210717) [11:14:58] (03PS1) 10Filippo Giunchedi: prometheus: use .format instead of % in prometheus-labs-targets [puppet] - 10https://gerrit.wikimedia.org/r/485623 (https://phabricator.wikimedia.org/T214058) [11:19:18] (03PS3) 10Filippo Giunchedi: prometheus: update prometheus-labs-targets to use keystone/nova clients [puppet] - 10https://gerrit.wikimedia.org/r/485193 (https://phabricator.wikimedia.org/T214058) [11:19:20] (03PS2) 10Filippo Giunchedi: prometheus: move ::scripts to ::wmcs_scripts [puppet] - 10https://gerrit.wikimedia.org/r/485620 (https://phabricator.wikimedia.org/T214058) [11:19:22] (03PS2) 10Filippo Giunchedi: prometheus: fetch WMCS targets from all regions [puppet] - 10https://gerrit.wikimedia.org/r/485621 (https://phabricator.wikimedia.org/T214058) [11:19:24] (03PS2) 10Filippo Giunchedi: prometheus: use .format instead of % in prometheus-labs-targets [puppet] - 10https://gerrit.wikimedia.org/r/485623 (https://phabricator.wikimedia.org/T214058) [11:20:53] (03PS4) 10Giuseppe Lavagetto: mediawiki::common: add proxy for services [puppet] - 10https://gerrit.wikimedia.org/r/483789 (https://phabricator.wikimedia.org/T210717) [11:25:10] !log depool maps1003 to fix replication lag issues [11:25:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:10] (03PS1) 10Hashar: doc: minor tweaks [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/485626 [11:31:12] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: update prometheus-labs-targets to use keystone/nova clients [puppet] - 10https://gerrit.wikimedia.org/r/485193 (https://phabricator.wikimedia.org/T214058) (owner: 10Filippo Giunchedi) [11:31:26] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: move ::scripts to ::wmcs_scripts [puppet] - 10https://gerrit.wikimedia.org/r/485620 (https://phabricator.wikimedia.org/T214058) (owner: 10Filippo Giunchedi) [11:31:36] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: fetch WMCS targets from all regions [puppet] - 10https://gerrit.wikimedia.org/r/485621 (https://phabricator.wikimedia.org/T214058) (owner: 10Filippo Giunchedi) [11:31:38] (03PS5) 10Giuseppe Lavagetto: profile::services_proxy: simple local proxying for remote services [puppet] - 10https://gerrit.wikimedia.org/r/483788 (https://phabricator.wikimedia.org/T210717) [11:31:40] (03PS5) 10Giuseppe Lavagetto: mediawiki::common: add proxy for services [puppet] - 10https://gerrit.wikimedia.org/r/483789 (https://phabricator.wikimedia.org/T210717) [11:31:43] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: use .format instead of % in prometheus-labs-targets [puppet] - 10https://gerrit.wikimedia.org/r/485623 (https://phabricator.wikimedia.org/T214058) (owner: 10Filippo Giunchedi) [11:32:06] (03PS1) 10Lucas Werkmeister (WMDE): Decrease WBQualityConstraintsTypeCheckMaxEntities [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485628 (https://phabricator.wikimedia.org/T209504) [11:36:06] (03PS6) 10Giuseppe Lavagetto: profile::services_proxy: simple local proxying for remote services [puppet] - 10https://gerrit.wikimedia.org/r/483788 (https://phabricator.wikimedia.org/T210717) [11:36:08] (03PS6) 10Giuseppe Lavagetto: mediawiki::common: add proxy for services [puppet] - 10https://gerrit.wikimedia.org/r/483789 (https://phabricator.wikimedia.org/T210717) [11:50:41] (03PS7) 10Giuseppe Lavagetto: profile::services_proxy: simple local proxying for remote services [puppet] - 10https://gerrit.wikimedia.org/r/483788 (https://phabricator.wikimedia.org/T210717) [11:50:43] (03PS7) 10Giuseppe Lavagetto: mediawiki::common: add proxy for services [puppet] - 10https://gerrit.wikimedia.org/r/483789 (https://phabricator.wikimedia.org/T210717) [11:54:25] (03PS1) 10Filippo Giunchedi: prometheus: install wmcs_scripts dependencies from jessie-backports [puppet] - 10https://gerrit.wikimedia.org/r/485635 (https://phabricator.wikimedia.org/T214058) [11:56:56] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: install wmcs_scripts dependencies from jessie-backports [puppet] - 10https://gerrit.wikimedia.org/r/485635 (https://phabricator.wikimedia.org/T214058) (owner: 10Filippo Giunchedi) [12:01:05] (03PS4) 10Elukey: role::analytics_cluster::hadoop: add groups without ssh access [puppet] - 10https://gerrit.wikimedia.org/r/484165 (https://phabricator.wikimedia.org/T212949) [12:05:48] mobrovac, akosiaris: looks like the update on thurs got rid of the worst of the memory spikes https://grafana.wikimedia.org/d/000000620/xxxx-zotero-debugging-kubernetes?orgId=1&from=now-7d&to=now [12:06:33] mvolz: it does look like it indeed. Nice! [12:07:10] how are we on other stuff, i.e. segfaults? I'm not sure how to look for those? [12:08:21] no pod restarts, that is good also :) [12:19:08] (03CR) 10Hashar: Add a prune action (036 comments) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/485499 (https://phabricator.wikimedia.org/T207703) (owner: 10Giuseppe Lavagetto) [12:28:45] (03CR) 10Alexandros Kosiaris: [C: 04-1] "minor nitpick, rest LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/485099 (owner: 10Dzahn) [12:31:11] 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Allow the deployment of users without SSH access - https://phabricator.wikimedia.org/T212949 (10elukey) [12:31:48] (03PS2) 10Alexandros Kosiaris: package_builder: Remove support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/485191 (owner: 10Muehlenhoff) [12:31:53] (03CR) 10Alexandros Kosiaris: [C: 03+2] package_builder: Remove support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/485191 (owner: 10Muehlenhoff) [12:32:22] (03CR) 10Effie Mouzeli: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/14407/mc1025.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/485172 (https://phabricator.wikimedia.org/T208844) (owner: 10Effie Mouzeli) [12:32:40] (03PS2) 10Effie Mouzeli: Apply -R 200 to memcached on mc1025 [puppet] - 10https://gerrit.wikimedia.org/r/485172 (https://phabricator.wikimedia.org/T208844) [12:32:43] (03PS1) 10Elukey: Remove unnecessary SSH keys from Hadoop masters (testing cluster) [puppet] - 10https://gerrit.wikimedia.org/r/485640 (https://phabricator.wikimedia.org/T212949) [12:33:31] (03PS2) 10Elukey: Remove unnecessary SSH keys from Hadoop masters (testing cluster) [puppet] - 10https://gerrit.wikimedia.org/r/485640 (https://phabricator.wikimedia.org/T212949) [12:34:58] (03PS3) 10Effie Mouzeli: Apply -R 200 to memcached on mc1025 [puppet] - 10https://gerrit.wikimedia.org/r/485172 (https://phabricator.wikimedia.org/T208844) [12:35:57] (03CR) 10Elukey: [C: 03+1] Apply -R 200 to memcached on mc1025 [puppet] - 10https://gerrit.wikimedia.org/r/485172 (https://phabricator.wikimedia.org/T208844) (owner: 10Effie Mouzeli) [12:36:05] !log Restarting memcached on mc1025 to apply '-R 200' - T208844 [12:36:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:08] T208844: Apply -R 200 to all the memcached mw object cache instances running in eqiad/codfw - https://phabricator.wikimedia.org/T208844 [12:36:17] (03CR) 10Elukey: [C: 03+2] Remove unnecessary SSH keys from Hadoop masters (testing cluster) [puppet] - 10https://gerrit.wikimedia.org/r/485640 (https://phabricator.wikimedia.org/T212949) (owner: 10Elukey) [12:36:24] (03PS3) 10Elukey: Remove unnecessary SSH keys from Hadoop masters (testing cluster) [puppet] - 10https://gerrit.wikimedia.org/r/485640 (https://phabricator.wikimedia.org/T212949) [12:36:27] (03CR) 10Elukey: [V: 03+2 C: 03+2] Remove unnecessary SSH keys from Hadoop masters (testing cluster) [puppet] - 10https://gerrit.wikimedia.org/r/485640 (https://phabricator.wikimedia.org/T212949) (owner: 10Elukey) [12:37:44] 10Operations, 10MediaWiki-Cache, 10serviceops, 10Patch-For-Review, and 3 others: Apply -R 200 to all the memcached mw object cache instances running in eqiad/codfw - https://phabricator.wikimedia.org/T208844 (10jijiki) [12:42:08] (03PS5) 10Elukey: role::analytics_cluster::hadoop: add groups without ssh access [puppet] - 10https://gerrit.wikimedia.org/r/484165 (https://phabricator.wikimedia.org/T212949) [12:42:42] (03CR) 10Elukey: [C: 03+2] role::analytics_cluster::hadoop: add groups without ssh access [puppet] - 10https://gerrit.wikimedia.org/r/484165 (https://phabricator.wikimedia.org/T212949) (owner: 10Elukey) [12:42:50] (03PS3) 10Daimona Eaytoy: Enable $wgAbuseFilterRuntimeProfile on every wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423945 (https://phabricator.wikimedia.org/T191039) [12:43:01] (03CR) 10Daimona Eaytoy: [C: 04-1] "Per commit message." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423945 (https://phabricator.wikimedia.org/T191039) (owner: 10Daimona Eaytoy) [12:55:24] (03PS7) 10Daimona Eaytoy: Enable $wgAbuseFilterProfile on every wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423660 (https://phabricator.wikimedia.org/T191039) [12:57:23] (03PS1) 10Jbond: Reimage analytics1001 to stretch (as an exercise) [puppet] - 10https://gerrit.wikimedia.org/r/485647 (https://phabricator.wikimedia.org/T214294) [13:11:40] marostegui: Can you check https://phabricator.wikimedia.org/P8014 [13:11:52] marostegui: related to, https://phabricator.wikimedia.org/T203059 [13:12:08] marostegui: also, it runs fine now, but those errors happened in anwiki. [13:13:46] OK. Not specific it seems. [13:16:08] (03CR) 10Muehlenhoff: [C: 03+1] Reimage analytics1001 to stretch (as an exercise) [puppet] - 10https://gerrit.wikimedia.org/r/485647 (https://phabricator.wikimedia.org/T214294) (owner: 10Jbond) [13:16:29] (03CR) 10Jbond: [C: 03+2] Reimage analytics1001 to stretch (as an exercise) [puppet] - 10https://gerrit.wikimedia.org/r/485647 (https://phabricator.wikimedia.org/T214294) (owner: 10Jbond) [13:25:59] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [13:27:11] there was a spike at 13:17 [13:28:16] however, I don't see anything on the logs [13:29:41] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [13:30:56] jynus: requesting your comment at, https://phabricator.wikimedia.org/P8014 - when you've time. [13:32:07] kart_: talk to AaronSchulz I already said to him I think our gtid implementation is not correct [13:32:43] Nikerabbit: ^ [13:32:52] jynus: Thanks. [13:36:50] PHP Fatal Error from line 131 of /srv/mediawiki/php-1.33.0-wmf.13/extensions/TemplateData/includes/api/ApiTemplateData.php: Argument 1 passed to ApiResult::setIndexedTagName() must be an instance of array, null given [13:37:02] ^this may be the issue (of current high mediawiki fatals) [13:38:51] that's https://phabricator.wikimedia.org/T213953 [13:42:19] kart_: one quick solution would be to GTID wait only on --, but that is not for me to decide [13:43:14] That's Nikerabbit :) [13:43:44] oh, sorry [13:44:12] but you are KartikMistry on phab, right? [13:45:27] so it is more of a "there wasn't any good solution and the impact was low at the time, so the decision was postponed" [13:47:35] Nikerabbit: there is some context at https://phabricator.wikimedia.org/T172497#4309959 but it gets mixed with other issues, so it is not really a ticket about the issue as much as a brainstorming mixing some architectural problems [13:48:03] (03PS1) 10Elukey: reportupdate: move all jobs to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/485656 (https://phabricator.wikimedia.org/T172532) [13:48:22] (03PS1) 10Arturo Borrero Gonzalez: cloudnet2001-dev: hiera cleanup for stretch/mitaka [puppet] - 10https://gerrit.wikimedia.org/r/485657 (https://phabricator.wikimedia.org/T214299) [13:48:45] (03CR) 10Muehlenhoff: package_builder: add data types to parameters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/485099 (owner: 10Dzahn) [13:48:51] (03CR) 10Muehlenhoff: [C: 04-1] package_builder: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/485099 (owner: 10Dzahn) [13:49:20] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudnet2001-dev: hiera cleanup for stretch/mitaka [puppet] - 10https://gerrit.wikimedia.org/r/485657 (https://phabricator.wikimedia.org/T214299) (owner: 10Arturo Borrero Gonzalez) [13:50:34] jynus: is there something that triggers this "issue" (which we could try changing) or is it just "tough luck"? [13:51:11] (03PS2) 10Elukey: reportupdater: move all jobs to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/485656 (https://phabricator.wikimedia.org/T172532) [13:51:19] for example: calling waitForReplication without performing writes first [13:51:23] certainly a master switch can make it more prevalent [13:51:34] which happened as an emergency last week [13:51:40] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/14408/" [puppet] - 10https://gerrit.wikimedia.org/r/485656 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [13:51:48] (03PS3) 10Elukey: reportupdater: move all jobs to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/485656 (https://phabricator.wikimedia.org/T172532) [13:51:51] (03CR) 10Elukey: [V: 03+2 C: 03+2] reportupdater: move all jobs to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/485656 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [13:51:55] but in reality it should happen all the time, the surprising part it works at times [13:52:03] *is [13:52:19] whenever chronology protector gets executed [13:52:40] Nikerabbit: o/ [13:53:53] elukey: o hi [13:54:43] PROBLEM - Host cloudnet2001-dev is DOWN: PING CRITICAL - Packet loss = 100% [13:57:03] RECOVERY - Host cloudnet2001-dev is UP: PING OK - Packet loss = 0%, RTA = 36.13 ms [13:58:21] !log Compress enwiki on dbstore1003:3311 - T210478 [13:58:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:25] T210478: Migrate dbstore1002 to a multi instance setup on dbstore100[3-5] - https://phabricator.wikimedia.org/T210478 [13:59:57] elukey: I saw your comment. It's okay for me if someone adds the debug logging (I can review and +2) [14:01:08] (03PS6) 10Daimona Eaytoy: Move all AbuseFilter config to abusefilter.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477063 (https://phabricator.wikimedia.org/T145931) [14:01:47] Nikerabbit: ah yes nice! I only wanted to say hi, no hidden pings! :) [14:02:18] I have no idea where/how to add the logging, I'll wait for AaronSchulz's option! [14:02:54] sure [14:03:13] It would be close to the code that he already modified [14:05:23] Nikerabbit: very long rabbit hole, thanks a lot for all the help! [14:06:25] PROBLEM - Check systemd state on cloudnet2001-dev is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:07:45] 10Operations, 10monitoring: Serve >= 50% of production Prometheus systems with Prometheus v2 - https://phabricator.wikimedia.org/T187987 (10fgiunchedi) [14:07:50] 10Operations, 10monitoring, 10Patch-For-Review: Convert prometheus-labs-targets to use nova API instead of wikitech's api.php - https://phabricator.wikimedia.org/T214058 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This is completed! Checked in tools targets are getting updated as expected: ` root... [14:09:08] 10Operations, 10monitoring, 10Patch-For-Review: Convert prometheus-labs-targets to use nova API instead of wikitech's api.php - https://phabricator.wikimedia.org/T214058 (10aborrero) Congratulations for managing your way out of the rabbit hole :-) [14:17:28] (03CR) 10Gehel: puppet: add is_disabled() method (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/485066 (owner: 10Volans) [14:17:34] (03CR) 10Gehel: [C: 04-1] puppet: add is_disabled() method [software/spicerack] - 10https://gerrit.wikimedia.org/r/485066 (owner: 10Volans) [14:18:47] (03PS7) 10Gehel: Add allocator metrics export for Blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/485135 (https://phabricator.wikimedia.org/T213372) (owner: 10Smalyshev) [14:23:24] (03CR) 10Gehel: [C: 03+1] "This looks trivial enough, but I'm not entirely sure about the implications. Take my +1 as "LGTM, but feel free to recheck with someone mo" [software/spicerack] - 10https://gerrit.wikimedia.org/r/484239 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [14:24:31] (03CR) 10Gehel: [C: 03+1] "LGTM, trivial enough" [cookbooks] - 10https://gerrit.wikimedia.org/r/484255 (owner: 10Volans) [14:31:02] (03CR) 10Gehel: [C: 04-1] puppet: add is_disabled() method (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/485066 (owner: 10Volans) [14:32:56] (03CR) 10Gehel: "minor comments inline, otherwise lgtm" (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/484432 (https://phabricator.wikimedia.org/T205886) (owner: 10Volans) [14:35:48] 10Operations, 10Citoid, 10SRE-Access-Requests: Requesting access to Citoid/Zotero production servers for MVOLZ - https://phabricator.wikimedia.org/T213269 (10Daimona) (Excuse me, typo upon committing) [14:37:05] (03CR) 10Gehel: [C: 03+1] dns: fix logging message [software/spicerack] - 10https://gerrit.wikimedia.org/r/484524 (owner: 10Volans) [14:40:44] (03PS1) 10Muehlenhoff: Remove obsolete Hadoop netboot entries and obsolete analytics-dell recipe [puppet] - 10https://gerrit.wikimedia.org/r/485667 (https://phabricator.wikimedia.org/T156955) [14:42:22] (03CR) 10Gehel: [C: 04-1] "minor comments inline" (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/484582 (owner: 10Volans) [14:43:43] (03PS1) 10Jbond: Add partman config for analytics1001 back [puppet] - 10https://gerrit.wikimedia.org/r/485668 (https://phabricator.wikimedia.org/T214294) [14:43:57] (03CR) 10Elukey: [C: 03+1] Remove obsolete Hadoop netboot entries and obsolete analytics-dell recipe [puppet] - 10https://gerrit.wikimedia.org/r/485667 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [14:44:48] (03CR) 10Jbond: [C: 03+2] Add partman config for analytics1001 back [puppet] - 10https://gerrit.wikimedia.org/r/485668 (https://phabricator.wikimedia.org/T214294) (owner: 10Jbond) [15:02:22] (03CR) 10Filippo Giunchedi: [C: 03+1] Remove obsolete Hadoop netboot entries and obsolete analytics-dell recipe [puppet] - 10https://gerrit.wikimedia.org/r/485667 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [15:11:29] !log closing frwikiquote_* indices on elasticsearch search-chi@eqiad (T214052) [15:11:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:33] T214052: Delete indices moved from chi to psi/omega - https://phabricator.wikimedia.org/T214052 [15:19:41] !log closing frwikiquote_* indices on elasticsearch search-chi@codfw (T214052) [15:19:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:44] T214052: Delete indices moved from chi to psi/omega - https://phabricator.wikimedia.org/T214052 [15:24:55] PROBLEM - Check systemd state on mendelevium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:25:01] PROBLEM - clamd running on mendelevium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (clamav), command name clamd [15:27:21] RECOVERY - Check systemd state on mendelevium is OK: OK - running: The system is fully operational [15:27:27] RECOVERY - clamd running on mendelevium is OK: PROCS OK: 1 process with UID = 111 (clamav), command name clamd [15:29:25] (03PS5) 10Elukey: profile::analytics::refinery: move sanitize_eventlogging_analytics to timer [puppet] - 10https://gerrit.wikimedia.org/r/483426 (https://phabricator.wikimedia.org/T172532) [15:31:04] (03CR) 10Elukey: "Marcel: do you think that we could deploy this?" [puppet] - 10https://gerrit.wikimedia.org/r/483426 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [15:34:17] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [15:37:20] single big spike afaics [15:37:22] already recovered [15:37:57] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [15:38:10] (03CR) 10Lucas Werkmeister (WMDE): "If no one objects, I’ll deploy this tomorrow (today is a US holiday so no deploys)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485628 (https://phabricator.wikimedia.org/T209504) (owner: 10Lucas Werkmeister (WMDE)) [15:44:21] (03CR) 10Mforns: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/483426 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [15:48:53] (03PS3) 10Hashar: doc: make published files group writable [puppet] - 10https://gerrit.wikimedia.org/r/484308 (https://phabricator.wikimedia.org/T137890) [15:48:58] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/484308 (https://phabricator.wikimedia.org/T137890) (owner: 10Hashar) [15:49:12] (03PS1) 10Arturo Borrero Gonzalez: labtestneutron2002: reimage in stretch + rename to cloudnet2002-dev [puppet] - 10https://gerrit.wikimedia.org/r/485679 (https://phabricator.wikimedia.org/T214303) [15:49:43] (03PS6) 10Elukey: profile::analytics::refinery: move sanitize_eventlogging_analytics to timer [puppet] - 10https://gerrit.wikimedia.org/r/483426 (https://phabricator.wikimedia.org/T172532) [15:52:45] !log stop and upgrade db2061 [15:52:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:33] (03PS1) 10Arturo Borrero Gonzalez: labtestneutron2002: rename to cloudnet2002-dev [dns] - 10https://gerrit.wikimedia.org/r/485680 (https://phabricator.wikimedia.org/T214303) [15:55:35] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] labtestneutron2002: reimage in stretch + rename to cloudnet2002-dev [puppet] - 10https://gerrit.wikimedia.org/r/485679 (https://phabricator.wikimedia.org/T214303) (owner: 10Arturo Borrero Gonzalez) [15:55:47] (03CR) 10jerkins-bot: [V: 04-1] labtestneutron2002: rename to cloudnet2002-dev [dns] - 10https://gerrit.wikimedia.org/r/485680 (https://phabricator.wikimedia.org/T214303) (owner: 10Arturo Borrero Gonzalez) [15:57:38] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] "Ignoring jenkins failed verification." [dns] - 10https://gerrit.wikimedia.org/r/485680 (https://phabricator.wikimedia.org/T214303) (owner: 10Arturo Borrero Gonzalez) [15:58:47] !log reinitializing slave replication(postgres) on maps1003 [15:58:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:47] !log T214303 reimaging/renaming labtestneutron2002.codfw.wmnet (jessie) to cloudnet2002-dev.codfw.wmnet (stretch) [16:03:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:50] T214303: labtestneutron2002: reimage to stretch & rename to cloudnet2002-dev - https://phabricator.wikimedia.org/T214303 [16:13:48] (03PS1) 10Hashar: aptrepo: change Jenkins upstream URL [puppet] - 10https://gerrit.wikimedia.org/r/485685 [16:14:48] (03CR) 10Hashar: "I am pretty sure reprepro is still affected by this. That is a regular complain when having to update the package on apt.wikimedia.org whi" [puppet] - 10https://gerrit.wikimedia.org/r/485685 (owner: 10Hashar) [16:15:47] 10Operations, 10Toolforge, 10Traffic, 10Wikimedia-Apache-configuration: Add new Tool Labs IPs to Varnish rate limit whitelist - https://phabricator.wikimedia.org/T214313 (10Nemo_bis) [16:17:35] 10Operations, 10Toolforge, 10Traffic, 10Wikimedia-Apache-configuration: Add new Tool Labs IPs to Varnish rate limit whitelist - https://phabricator.wikimedia.org/T214313 (10Nemo_bis) I created this more specific task for Tools as requested, but there is a (more general?) Labs task at T213475 [16:21:22] 10Operations, 10Cloud-VPS, 10Traffic, 10serviceops: Difficulties to create offline version of Wikipedia because of HTTP 429 response - https://phabricator.wikimedia.org/T213475 (10akosiaris) >>! In T213475#4883423, @Kelson wrote: > I'm not sure to fully understand the technical explanation. Is the problem... [16:24:51] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/14409/an-coord1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/483426 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [16:24:59] (03PS7) 10Elukey: profile::analytics::refinery: move sanitize_eventlogging_analytics to timer [puppet] - 10https://gerrit.wikimedia.org/r/483426 (https://phabricator.wikimedia.org/T172532) [16:25:02] (03CR) 10Elukey: [V: 03+2 C: 03+2] profile::analytics::refinery: move sanitize_eventlogging_analytics to timer [puppet] - 10https://gerrit.wikimedia.org/r/483426 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [16:26:16] 10Operations: wmf-auto-reimage-host: icinga downtime error - https://phabricator.wikimedia.org/T214314 (10aborrero) [16:26:35] 10Operations, 10Toolforge, 10Traffic, 10Wikimedia-Apache-configuration: Add new Tool Labs IPs to Varnish rate limit whitelist - https://phabricator.wikimedia.org/T214313 (10Nemo_bis) p:05Triage→03High [16:26:55] volans: opened T214314 you may be interested [16:26:56] T214314: wmf-auto-reimage-host: icinga downtime error - https://phabricator.wikimedia.org/T214314 [16:32:12] (03PS2) 10Alexandros Kosiaris: helm: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/485110 (owner: 10Dzahn) [16:42:55] 10Operations: wmf-auto-reimage-host: icinga downtime error - https://phabricator.wikimedia.org/T214314 (10MoritzMuehlenhoff) The FQDN where that server is being renamed to doesn't exist here yet, so it should simply skipped when setting downtime? [16:45:55] 10Operations, 10Toolforge, 10Traffic, 10Wikimedia-Apache-configuration: Add new Tool Labs IPs to Varnish rate limit whitelist - https://phabricator.wikimedia.org/T214313 (10Krenair) Tools cannot be done separately, it does not have an IP space of it's own, tools instances are scattered around the same netw... [16:46:43] 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Replace Torrus with Prometheus snmp_exporter for PDUs monitoring - https://phabricator.wikimedia.org/T148541 (10fgiunchedi) [16:54:40] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [16:55:24] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [16:56:53] 10Operations: wmf-auto-reimage-host: icinga downtime error - https://phabricator.wikimedia.org/T214314 (10aborrero) [16:59:09] 10Operations: wmf-auto-reimage-host: icinga downtime error - https://phabricator.wikimedia.org/T214314 (10aborrero) >>! In T214314#4896937, @MoritzMuehlenhoff wrote: > The FQDN where that server is being renamed to doesn't exist here yet, so it should simply skipped when setting downtime? Then perhaps this can... [16:59:55] 10Operations, 10Continuous-Integration-Infrastructure, 10Packaging: Upgrade jenkins-debian-glue to v0.20.0 - https://phabricator.wikimedia.org/T212774 (10hashar) I have managed to build the package for both jessie and stretch without any issues! :) To clarify from discussions I had: * the packages are not... [16:59:58] PROBLEM - MariaDB Slave Lag: s6 on db2076 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.02 seconds [17:00:02] PROBLEM - MariaDB Slave Lag: s6 on db2046 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.06 seconds [17:00:12] PROBLEM - MariaDB Slave Lag: s6 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.05 seconds [17:00:20] PROBLEM - MariaDB Slave Lag: s6 on db2089 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.39 seconds [17:00:20] PROBLEM - MariaDB Slave Lag: s6 on db2039 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.55 seconds [17:00:20] PROBLEM - MariaDB Slave Lag: s6 on db2067 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.55 seconds [17:00:54] PROBLEM - MariaDB Slave Lag: s6 on db2060 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 307.45 seconds [17:01:02] PROBLEM - MariaDB Slave Lag: s6 on db2053 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 308.57 seconds [17:01:10] PROBLEM - MariaDB Slave Lag: s6 on db2087 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 310.43 seconds [17:01:52] (03CR) 10Alexandros Kosiaris: [C: 03+2] helm: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/485110 (owner: 10Dzahn) [17:03:22] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:04:59] checking --^ [17:09:20] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational [17:10:50] 10Operations, 10ops-codfw, 10cloud-services-team (Kanban): cloudnet2002-dev: ACPI error - https://phabricator.wikimedia.org/T214322 (10aborrero) p:05Triage→03Normal [17:11:08] (03PS1) 10Elukey: profile::refinery::job::spark_job: add shebang to sh template [puppet] - 10https://gerrit.wikimedia.org/r/485689 (https://phabricator.wikimedia.org/T172532) [17:14:32] (03CR) 10Elukey: [C: 03+2] profile::refinery::job::spark_job: add shebang to sh template [puppet] - 10https://gerrit.wikimedia.org/r/485689 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [17:14:43] (03CR) 10Mforns: [C: 03+1] profile::refinery::job::spark_job: add shebang to sh template [puppet] - 10https://gerrit.wikimedia.org/r/485689 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [17:14:51] (03PS8) 10Giuseppe Lavagetto: profile::services_proxy: simple local proxying for remote services [puppet] - 10https://gerrit.wikimedia.org/r/483788 (https://phabricator.wikimedia.org/T210717) [17:14:53] (03PS8) 10Giuseppe Lavagetto: mediawiki::common: add proxy for services [puppet] - 10https://gerrit.wikimedia.org/r/483789 (https://phabricator.wikimedia.org/T210717) [17:15:44] (03CR) 10jerkins-bot: [V: 04-1] profile::services_proxy: simple local proxying for remote services [puppet] - 10https://gerrit.wikimedia.org/r/483788 (https://phabricator.wikimedia.org/T210717) (owner: 10Giuseppe Lavagetto) [17:16:54] !log stop and upgrade db2054 [17:16:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:38] (03PS9) 10Giuseppe Lavagetto: mediawiki::common: add proxy for services [puppet] - 10https://gerrit.wikimedia.org/r/483789 (https://phabricator.wikimedia.org/T210717) [17:30:53] 10Operations, 10ops-codfw, 10cloud-services-team (Kanban): cloudnet2002-dev: ACPI error - https://phabricator.wikimedia.org/T214322 (10fgiunchedi) This is known/expected, it is due to the `acpi_power_meter` kernel module which we are blacklisting, a reboot or manually unloading the module stops the messages [17:42:31] (03PS3) 10Jcrespo: mariadb: Promote db2047 to master on configuration management [puppet] - 10https://gerrit.wikimedia.org/r/485617 (https://phabricator.wikimedia.org/T214264) [17:44:33] !log stop replication on db2040 for master switch T214264 [17:44:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:36] T214264: BBU issues on codfw - https://phabricator.wikimedia.org/T214264 [17:44:45] (03PS1) 10Jbond: use gpt schema [puppet] - 10https://gerrit.wikimedia.org/r/485693 [17:45:41] (03CR) 10Jbond: [C: 03+2] use gpt schema [puppet] - 10https://gerrit.wikimedia.org/r/485693 (owner: 10Jbond) [17:47:38] (03PS4) 10Jcrespo: mariadb: Promote db2047 to master on configuration management [puppet] - 10https://gerrit.wikimedia.org/r/485617 (https://phabricator.wikimedia.org/T214264) [17:49:40] (03CR) 10Jcrespo: "Suggestion:" [puppet] - 10https://gerrit.wikimedia.org/r/485693 (owner: 10Jbond) [17:50:14] (03CR) 10Jcrespo: [C: 03+2] mariadb: Promote db2047 to master on configuration management [puppet] - 10https://gerrit.wikimedia.org/r/485617 (https://phabricator.wikimedia.org/T214264) (owner: 10Jcrespo) [17:51:33] !log stop and apply puppet changes to db2047 T214264 [17:51:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:36] T214264: BBU issues on codfw - https://phabricator.wikimedia.org/T214264 [18:06:10] we may have some low noise on mw logs from mw2- this is expexted for a few minutes, as I am double checking the topology changes [18:08:04] (I am leaving things unconfigured properly until I am sure the 2 migrated hosts are in a good state) [18:10:04] PROBLEM - MariaDB Slave Lag: s8 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 600.14 seconds [18:14:41] that is not me, but dbstore1002 is not preciselly too reliable [18:17:46] (03PS1) 10Jcrespo: mariadb: Depool db2040, promote db2047 to master of s7 section [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485696 (https://phabricator.wikimedia.org/T214264) [18:21:11] (03CR) 10Jcrespo: [C: 03+2] mariadb: Depool db2040, promote db2047 to master of s7 section [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485696 (https://phabricator.wikimedia.org/T214264) (owner: 10Jcrespo) [18:22:22] (03Merged) 10jenkins-bot: mariadb: Depool db2040, promote db2047 to master of s7 section [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485696 (https://phabricator.wikimedia.org/T214264) (owner: 10Jcrespo) [18:24:10] !log jynus@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2040, promote db2047 to s7 master (duration: 00m 46s) [18:24:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:27] (03PS1) 10BryanDavis: toolforge: Rotate SGE accounting file from NFS master [puppet] - 10https://gerrit.wikimedia.org/r/485697 (https://phabricator.wikimedia.org/T168701) [18:25:56] (03CR) 10jerkins-bot: [V: 04-1] toolforge: Rotate SGE accounting file from NFS master [puppet] - 10https://gerrit.wikimedia.org/r/485697 (https://phabricator.wikimedia.org/T168701) (owner: 10BryanDavis) [18:26:00] (03PS1) 10Jcrespo: mariadb: Depool db2040 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485698 (https://phabricator.wikimedia.org/T214264) [18:26:00] PROBLEM - tilerator on maps1003 is CRITICAL: connect to address 10.64.32.117 and port 6534: Connection refused [18:26:04] PROBLEM - Maps HTTPS on maps1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.009 second response time [18:27:14] RECOVERY - tilerator on maps1003 is OK: HTTP OK: HTTP/1.1 200 OK - 304 bytes in 0.027 second response time [18:27:18] RECOVERY - Maps HTTPS on maps1003 is OK: HTTP OK: HTTP/1.1 200 OK - 1323 bytes in 0.090 second response time [18:29:32] (03PS2) 10BryanDavis: toolforge: Rotate SGE accounting file from NFS master [puppet] - 10https://gerrit.wikimedia.org/r/485697 (https://phabricator.wikimedia.org/T168701) [18:32:21] (03CR) 10jenkins-bot: mariadb: Depool db2040, promote db2047 to master of s7 section [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485696 (https://phabricator.wikimedia.org/T214264) (owner: 10Jcrespo) [18:34:06] (03PS1) 10Jcrespo: mariadb: Demote db2040 from being an s7 master to just a replica [puppet] - 10https://gerrit.wikimedia.org/r/485701 (https://phabricator.wikimedia.org/T214264) [18:34:09] PROBLEM - Disk space on cloudvirt1021 is CRITICAL: DISK CRITICAL - /tmp/builder01 is not accessible: Permission denied [18:34:58] who may be around, arturo maybe? [18:35:13] RECOVERY - Disk space on cloudvirt1021 is OK: DISK OK [18:35:25] yes [18:35:43] but apparently ongoing operation by fsero and gtirloni [18:35:45] that /tmp check is weird [18:35:56] ok sorry for pinging you [18:36:29] it's ok jynus, it was the page what pinged me :-P [18:37:08] Well we were debugging an issue on an horizon VM it seems weird that action create a page [18:37:31] ah so that's the page [18:38:03] I got the problem pge (delayed) but not the recovery (thanks, my provider) [18:40:49] I'll make a t-shirt "I created a directory in /tmp and woke up half my coworkers" ;) [18:40:51] sorry about the noise [18:42:01] and there's the recovery page at last [18:47:23] PROBLEM - Disk space on cloudvirt1021 is CRITICAL: DISK CRITICAL - /root/builder01 is not accessible: Permission denied [18:47:43] here you go again [18:47:50] are you kidding [18:47:53] wth [18:47:53] ho hum [18:48:16] i'll ack it and leave it like that [18:48:31] and open a task for investigate it later [18:48:39] gtirloni: did you or fsero create a dir there? [18:48:50] arturo: yes, I created it [18:48:58] this is absurd [18:49:22] oh this time is under /root/ [18:49:34] arturo: gtirloni is investigating an issue i reported check #cloud-operations for more info, in any case this is NOT urgent so it doesnt merit paging people at all [18:49:55] ACKNOWLEDGEMENT - Disk space on cloudvirt1021 is CRITICAL: DISK CRITICAL - /root/builder01 is not accessible: Permission denied GTirloni Expected. [18:50:03] gtirloni: thanks for the help but we should ack this and wait untill tomorrow to diagnose it [18:50:08] :) [18:50:31] ok :-) [18:51:29] RECOVERY - Disk space on cloudvirt1021 is OK: DISK OK [18:52:21] !log pool maps1003 - postgresql sql lag issues has been fixed [18:52:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:32] (03CR) 10Jcrespo: [C: 03+2] mariadb: Demote db2040 from being an s7 master to just a replica [puppet] - 10https://gerrit.wikimedia.org/r/485701 (https://phabricator.wikimedia.org/T214264) (owner: 10Jcrespo) [18:55:56] !log stop and upgrade db2040 T214264 [18:55:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:59] T214264: BBU issues on codfw - https://phabricator.wikimedia.org/T214264 [19:02:44] 10Operations, 10Icinga, 10monitoring, 10cloud-services-team (Kanban): cloudvirt1021/Disk space is CRITICAL - https://phabricator.wikimedia.org/T214325 (10GTirloni) [19:03:29] 10Operations, 10Gerrit, 10Release-Engineering-Team: Install "healthcheck" plugin - https://phabricator.wikimedia.org/T214326 (10Paladox) [19:08:11] 10Operations, 10Gerrit, 10Release-Engineering-Team: Install "healthcheck" plugin on gerrit - https://phabricator.wikimedia.org/T214326 (10jcrespo) [19:23:23] !log mysql.py -h db1115 zarcillo -e "UPDATE masters SET instance = 'db2047' WHERE section = 's7' and dc = 'codfw'" T214264 [19:23:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:26] T214264: BBU issues on codfw - https://phabricator.wikimedia.org/T214264 [19:24:23] RECOVERY - MariaDB Slave Lag: s6 on db2039 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [19:24:27] RECOVERY - MariaDB Slave Lag: s6 on db2095 is OK: OK slave_sql_lag Replication lag: 0.08 seconds [19:24:33] RECOVERY - MariaDB Slave Lag: s6 on db2076 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [19:24:39] RECOVERY - MariaDB Slave Lag: s6 on db2087 is OK: OK slave_sql_lag Replication lag: 0.03 seconds [19:24:51] RECOVERY - MariaDB Slave Lag: s6 on db2046 is OK: OK slave_sql_lag Replication lag: 0.21 seconds [19:24:55] RECOVERY - MariaDB Slave Lag: s6 on db2089 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [19:27:02] RECOVERY - MariaDB Slave Lag: s6 on db2060 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [19:28:05] (03PS2) 10Jcrespo: mariadb: Repool db2040 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485698 (https://phabricator.wikimedia.org/T214264) [19:32:20] (03CR) 10Jcrespo: [C: 03+2] mariadb: Repool db2040 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485698 (https://phabricator.wikimedia.org/T214264) (owner: 10Jcrespo) [19:33:25] (03Merged) 10jenkins-bot: mariadb: Repool db2040 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485698 (https://phabricator.wikimedia.org/T214264) (owner: 10Jcrespo) [19:34:00] RECOVERY - MariaDB Slave Lag: s8 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 288.80 seconds [19:34:14] RECOVERY - MariaDB Slave Lag: s6 on db2067 is OK: OK slave_sql_lag Replication lag: 0.22 seconds [19:35:53] !log jynus@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2040 (duration: 00m 45s) [19:35:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:04] RECOVERY - MariaDB Slave Lag: s6 on db2053 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [19:39:00] (03CR) 10jenkins-bot: mariadb: Repool db2040 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485698 (https://phabricator.wikimedia.org/T214264) (owner: 10Jcrespo) [20:07:38] 10Operations, 10Toolforge, 10Traffic, 10Wikimedia-Apache-configuration: Add new Tool Labs IPs to Varnish rate limit whitelist - https://phabricator.wikimedia.org/T214313 (10faidon) Per our earlier conversations (T208986, T174596, T209011), I think we should just use the WMCS public IP space to make these k... [20:10:17] 10Operations, 10Toolforge, 10Traffic, 10Wikimedia-Apache-configuration: Add new Tool Labs IPs to Varnish rate limit whitelist - https://phabricator.wikimedia.org/T214313 (10Cyberpower678) >>! In T214313#4897303, @faidon wrote: > Per our earlier conversations (T208986, T174596, T209011), I think we should j... [20:12:34] 10Operations, 10Cloud-VPS, 10Traffic, 10serviceops: Wikimedia varnish rules no longer exempt all Cloud VPS/Toolforge IPs from rate limits (HTTP 429 response) - https://phabricator.wikimedia.org/T213475 (10bd808) [20:12:58] 10Operations, 10Cloud-VPS, 10Traffic, 10serviceops: Wikimedia varnish rules no longer exempt all Cloud VPS/Toolforge IPs from rate limits (HTTP 429 response) - https://phabricator.wikimedia.org/T213475 (10bd808) [20:23:42] 10Operations, 10Cloud-VPS, 10Traffic, 10serviceops: Wikimedia varnish rules no longer exempt all Cloud VPS/Toolforge IPs from rate limits (HTTP 429 response) - https://phabricator.wikimedia.org/T213475 (10Cyberpower678) @bd808 just invited me here. Ever since the Cloud VPS migration, Cyberbot has been hit... [20:25:16] 10Operations, 10Cloud-VPS, 10Traffic, 10serviceops: Wikimedia varnish rules no longer exempt all Cloud VPS/Toolforge IPs from rate limits (HTTP 429 response) - https://phabricator.wikimedia.org/T213475 (10Cyberpower678) p:05Normal→03High I'm also boldly raising the priority as from what I gather I'm li... [20:34:04] (03PS1) 10Ammarpad: Add new namespace abbreviation for Swedish (sv) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485704 (https://phabricator.wikimedia.org/T214329) [20:35:38] (03PS1) 10Faidon Liambotis: protocol.compat: disable a couple of pylint errors [software/keyholder] - 10https://gerrit.wikimedia.org/r/485705 [20:35:40] (03PS1) 10Faidon Liambotis: Bump minimum Python to 3.5; also test with 3.7 [software/keyholder] - 10https://gerrit.wikimedia.org/r/485706 [20:35:42] (03PS1) 10Faidon Liambotis: Add a pylint tox environment [software/keyholder] - 10https://gerrit.wikimedia.org/r/485707 [20:35:44] (03PS1) 10Faidon Liambotis: Add a tox environment for Construct 2.8.16 [software/keyholder] - 10https://gerrit.wikimedia.org/r/485708 [20:35:46] (03PS1) 10Faidon Liambotis: Update tox.ini to facilitate parallel builds [software/keyholder] - 10https://gerrit.wikimedia.org/r/485709 [20:49:08] (03PS2) 10Ammarpad: Add new namespace abbreviation for Swedish (sv) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485704 (https://phabricator.wikimedia.org/T214329) [21:46:17] (03CR) 10Bstorm: "Since I once manually truncated the log and found that it basically broke the grid for a bit, my first thought is to use the system script" [puppet] - 10https://gerrit.wikimedia.org/r/485697 (https://phabricator.wikimedia.org/T168701) (owner: 10BryanDavis) [21:47:42] (03PS1) 10Faidon Liambotis: Move tests/unit -> tests [software/keyholder] - 10https://gerrit.wikimedia.org/r/485714 [21:47:44] (03PS1) 10Faidon Liambotis: Add a bunch more tests [software/keyholder] - 10https://gerrit.wikimedia.org/r/485715 [21:47:46] (03PS1) 10Faidon Liambotis: Test key and config file parsing using test data [software/keyholder] - 10https://gerrit.wikimedia.org/r/485716 [21:47:48] (03PS1) 10Faidon Liambotis: Add a (very basic) test using OpenSSH's ssh-add [software/keyholder] - 10https://gerrit.wikimedia.org/r/485717 [21:47:50] (03PS1) 10Faidon Liambotis: Add tests for OSError when loading config files [software/keyholder] - 10https://gerrit.wikimedia.org/r/485718 [21:47:52] (03PS1) 10Faidon Liambotis: Make all SshAgentConfig's methods instance methods [software/keyholder] - 10https://gerrit.wikimedia.org/r/485719 [21:47:54] (03PS1) 10Faidon Liambotis: Add SshKeyBlob per RFC 4253 [software/keyholder] - 10https://gerrit.wikimedia.org/r/485720 [21:51:24] (03CR) 10jerkins-bot: [V: 04-1] Add a bunch more tests [software/keyholder] - 10https://gerrit.wikimedia.org/r/485715 (owner: 10Faidon Liambotis) [21:51:28] (03CR) 10jerkins-bot: [V: 04-1] Test key and config file parsing using test data [software/keyholder] - 10https://gerrit.wikimedia.org/r/485716 (owner: 10Faidon Liambotis) [21:52:11] (03CR) 10jerkins-bot: [V: 04-1] Add tests for OSError when loading config files [software/keyholder] - 10https://gerrit.wikimedia.org/r/485718 (owner: 10Faidon Liambotis) [21:52:16] (03CR) 10jerkins-bot: [V: 04-1] Add a (very basic) test using OpenSSH's ssh-add [software/keyholder] - 10https://gerrit.wikimedia.org/r/485717 (owner: 10Faidon Liambotis) [21:52:19] (03CR) 10jerkins-bot: [V: 04-1] Make all SshAgentConfig's methods instance methods [software/keyholder] - 10https://gerrit.wikimedia.org/r/485719 (owner: 10Faidon Liambotis) [21:52:29] (03CR) 10jerkins-bot: [V: 04-1] Add SshKeyBlob per RFC 4253 [software/keyholder] - 10https://gerrit.wikimedia.org/r/485720 (owner: 10Faidon Liambotis) [21:55:02] 10Operations, 10MediaWiki-Cache, 10Performance-Team, 10User-Elukey: Consider removing the last traces of nutcracker in Mediawiki configs - https://phabricator.wikimedia.org/T214275 (10Krinkle) [21:58:04] 10Operations, 10MediaWiki-Cache, 10Performance-Team, 10User-Elukey: Consider removing the last traces of nutcracker in Mediawiki configs - https://phabricator.wikimedia.org/T214275 (10Krinkle) (Tagging Performance-Team to track Aaron's implicit involvement through CC, as this appears implicitly blocked on... [21:59:09] * Krinkle is considering to deploy a UBN fix [21:59:15] https://phabricator.wikimedia.org/T213953#4897402 [22:14:47] 10Operations, 10DBA, 10Patch-For-Review, 10Performance-Team (Radar): Increase parsercache keys TTL from 22 days back to 30 days - https://phabricator.wikimedia.org/T210992 (10Krinkle) [22:29:24] * Krinkle staging on mwdebug1002 [22:33:56] !log krinkle@deploy1001 Synchronized php-1.33.0-wmf.13/extensions/TemplateData/includes/api/ApiTemplateData.php: I7647ddfc47 - T213953 (duration: 00m 47s) [22:33:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:34:00] T213953: $data->paramOrder is null on pages edited since MediaWiki 1.33/wmf.13 was deployed - https://phabricator.wikimedia.org/T213953 [22:35:54] (03PS2) 10Faidon Liambotis: Add a bunch more tests [software/keyholder] - 10https://gerrit.wikimedia.org/r/485715 [22:35:56] (03PS1) 10Faidon Liambotis: Properly setup logging when /dev/log doesn't exist [software/keyholder] - 10https://gerrit.wikimedia.org/r/485724 [22:41:25] (03CR) 10Krinkle: "In particular, zero.wikimedia.org (the internal Zero wiki, as opposed to Wikipedia Zero traffic itself) should probably be made inaccessib" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483193 (owner: 10Jforrester) [22:44:23] (03CR) 10Alex Monk: "If we're talking about preventing information leaks by shutting down the wiki, removing the domain from DNS isn't enough. You'd need to ac" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483193 (owner: 10Jforrester) [22:45:15] (03CR) 10Krinkle: "Ah, you mean from apache config, given Host headers. Good point." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483193 (owner: 10Jforrester) [22:45:29] (03CR) 10Krinkle: "so -dns, and -main.conf" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483193 (owner: 10Jforrester) [22:46:40] (03CR) 10Alex Monk: "I wouldn't want to trust VCL with that, so yes, Apache or a MediaWiki config change." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483193 (owner: 10Jforrester) [22:53:06] (03PS2) 10Faidon Liambotis: Test key and config file parsing using test data [software/keyholder] - 10https://gerrit.wikimedia.org/r/485716 [22:53:08] (03PS2) 10Faidon Liambotis: Add a (very basic) test using OpenSSH's ssh-add [software/keyholder] - 10https://gerrit.wikimedia.org/r/485717 [22:53:10] (03PS2) 10Faidon Liambotis: Add tests for OSError when loading config files [software/keyholder] - 10https://gerrit.wikimedia.org/r/485718 [22:53:12] (03PS2) 10Faidon Liambotis: Make all SshAgentConfig's methods instance methods [software/keyholder] - 10https://gerrit.wikimedia.org/r/485719 [22:53:14] (03PS2) 10Faidon Liambotis: Add SshKeyBlob per RFC 4253 [software/keyholder] - 10https://gerrit.wikimedia.org/r/485720 [23:26:48] 10Operations, 10Continuous-Integration-Infrastructure (shipyard), 10Release-Engineering-Team (Kanban), 10cloud-services-team (Kanban): Phase out Nodepool from production - https://phabricator.wikimedia.org/T209361 (10Krinkle) [23:36:53] 10Operations, 10Patch-For-Review: Reimage analytics1001 to stretch (as an exercise) - https://phabricator.wikimedia.org/T214294 (10Peachey88) [23:38:45] 10Operations, 10Gerrit, 10Icinga, 10Release-Engineering-Team, 10monitoring: Install "healthcheck" plugin on gerrit - https://phabricator.wikimedia.org/T214326 (10Peachey88) [23:41:03] (03CR) 10BryanDavis: "> Since I once manually truncated the log and found that it basically" [puppet] - 10https://gerrit.wikimedia.org/r/485697 (https://phabricator.wikimedia.org/T168701) (owner: 10BryanDavis)