[00:08:57] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db2090 is OK: OK slave_sql_lag Replication lag: 58.21 seconds
[00:09:29] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db2095 is OK: OK slave_sql_lag Replication lag: 55.54 seconds
[00:09:49] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db2051 is OK: OK slave_sql_lag Replication lag: 52.53 seconds
[00:09:51] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db2091 is OK: OK slave_sql_lag Replication lag: 52.38 seconds
[00:09:55] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db2058 is OK: OK slave_sql_lag Replication lag: 50.01 seconds
[00:10:01] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db2073 is OK: OK slave_sql_lag Replication lag: 45.17 seconds
[00:10:15] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db2065 is OK: OK slave_sql_lag Replication lag: 41.12 seconds
[00:10:15] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db2084 is OK: OK slave_sql_lag Replication lag: 41.24 seconds
[00:31:41] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s8 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 672.62 seconds
[00:36:35] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s8 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 109.95 seconds
[00:55:01] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic: Determine cause of upload.wikimedia.org requests routed to text-lb (404 Not Found) - https://phabricator.wikimedia.org/T207340 (10Krinkle)
[01:42:19] <icinga-wm>	 PROBLEM - MD RAID on cp5010 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 0, Spare: 0
[01:42:21] <icinga-wm>	 ACKNOWLEDGEMENT - MD RAID on cp5010 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T214274
[01:42:25] <wikibugs>	 10Operations, 10ops-eqsin: Degraded RAID on cp5010 - https://phabricator.wikimedia.org/T214274 (10ops-monitoring-bot)
[02:39:17] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on db2091 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.63 seconds
[02:39:19] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on db2056 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.26 seconds
[02:39:25] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on db2041 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.42 seconds
[02:39:27] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on db2063 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.09 seconds
[02:39:37] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on db2049 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 305.17 seconds
[02:39:53] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on db2035 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 308.41 seconds
[02:40:01] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on db2088 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.37 seconds
[02:40:15] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 312.42 seconds
[03:04:27] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s2 on db2049 is OK: OK slave_sql_lag Replication lag: 59.70 seconds
[03:04:41] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s2 on db2035 is OK: OK slave_sql_lag Replication lag: 18.79 seconds
[03:04:49] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s2 on db2088 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[03:05:03] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s2 on db2095 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[03:05:19] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s2 on db2091 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[03:05:23] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s2 on db2056 is OK: OK slave_sql_lag Replication lag: 0.48 seconds
[03:05:29] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s2 on db2041 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[03:05:31] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s2 on db2063 is OK: OK slave_sql_lag Replication lag: 0.47 seconds
[03:32:15] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s8 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1030.54 seconds
[03:35:09] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on db2063 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.51 seconds
[03:35:17] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on db2049 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.98 seconds
[03:35:33] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on db2035 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 305.10 seconds
[03:35:39] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on db2088 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 306.54 seconds
[03:35:53] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 307.06 seconds
[03:36:07] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on db2091 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 308.94 seconds
[03:36:11] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on db2056 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 308.89 seconds
[03:36:19] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on db2041 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.61 seconds
[03:49:31] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s8 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 267.27 seconds
[04:08:33] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 601.87 seconds
[04:46:37] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[05:20:57] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[05:34:31] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 664.70 seconds
[06:06:13] <icinga-wm>	 ACKNOWLEDGEMENT - Memory correctable errors -EDAC- on db1068 is CRITICAL: 7.002 ge 4 Marostegui known - The acknowledgement expires at: 2019-01-25 06:05:36. https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=db1068&var-datasource=eqiad+prometheus/ops
[06:08:59] <marostegui>	 !log tag_summary table from s8 - T212255
[06:09:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:09:02] <stashbot>	 T212255: Drop tag_summary table - https://phabricator.wikimedia.org/T212255
[06:12:26] <marostegui>	 !log Drop tag_summary table from s3 codfw - T212255
[06:12:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:15:13] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 296.79 seconds
[06:27:39] <marostegui>	 !log Drop tag_summary table from dbstore1002:s3 - T212255
[06:27:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:27:42] <stashbot>	 T212255: Drop tag_summary table - https://phabricator.wikimedia.org/T212255
[06:28:11] <icinga-wm>	 PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[06:28:39] <elukey>	 running puppet --^
[06:31:53] <icinga-wm>	 RECOVERY - Check systemd state on netmon2001 is OK: OK - running: The system is fully operational
[06:32:35] <marostegui>	 !log Drop tag_summary table from db1095:3313 - T212255
[06:32:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:33:03] <icinga-wm>	 PROBLEM - puppet last run on dbmonitor2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/apache2/conf-available/00-defaults.conf]
[06:33:05] <icinga-wm>	 PROBLEM - puppet last run on cp1090 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/varnishmtail-backend/varnishbackend.mtail]
[06:42:34] <wikibugs>	 10Operations, 10ops-eqsin: Degraded RAID on cp5010 - https://phabricator.wikimedia.org/T214274 (10Vgutierrez) kern.log is reporting multiple failures in /dev/sdb3 as well ` Jan 21 06:41:47 cp5010 kernel: [7490330.204759] EXT4-fs error (device sdb3) in ext4_reserve_inode_write:5448: IO failure Jan 21 06:41:48 c...
[06:45:01] <logmsgbot>	 !log vgutierrez@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5010.eqsin.wmnet
[06:45:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:47:43] <marostegui>	 !log Drop tag_summary table from db1023, db1077, db1075 and db1078 T212255
[06:47:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:47:48] <stashbot>	 T212255: Drop tag_summary table - https://phabricator.wikimedia.org/T212255
[06:51:07] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485580 (https://phabricator.wikimedia.org/T85757)
[06:52:15] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485580 (https://phabricator.wikimedia.org/T85757) (owner: 10Marostegui)
[06:52:19] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s2 on db2091 is OK: OK slave_sql_lag Replication lag: 50.74 seconds
[06:52:25] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s2 on db2056 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[06:52:35] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s2 on db2041 is OK: OK slave_sql_lag Replication lag: 0.44 seconds
[06:52:37] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s2 on db2063 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[06:52:49] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s2 on db2049 is OK: OK slave_sql_lag Replication lag: 0.18 seconds
[06:53:03] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s2 on db2088 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[06:53:05] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s2 on db2035 is OK: OK slave_sql_lag Replication lag: 0.17 seconds
[06:53:17] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s2 on db2095 is OK: OK slave_sql_lag Replication lag: 0.45 seconds
[06:53:21] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485580 (https://phabricator.wikimedia.org/T85757) (owner: 10Marostegui)
[06:54:30] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1123 - T85757 (duration: 00m 50s)
[06:54:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:54:33] <stashbot>	 T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757
[06:54:35] <marostegui>	 !log Deploy schema change on db1123 - T85757
[06:54:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:56:23] <icinga-wm>	 PROBLEM - Check systemd state on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused
[06:56:29] <icinga-wm>	 PROBLEM - Disk space on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused
[06:56:33] <icinga-wm>	 PROBLEM - dhclient process on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused
[06:56:46] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485581 (https://phabricator.wikimedia.org/T210478)
[06:56:57] <icinga-wm>	 PROBLEM - configured eth on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused
[06:56:57] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485580 (https://phabricator.wikimedia.org/T85757) (owner: 10Marostegui)
[06:57:03] <icinga-wm>	 PROBLEM - DPKG on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused
[06:57:05] <icinga-wm>	 PROBLEM - MD RAID on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused
[06:57:11] <elukey>	 checking --^
[06:57:31] <icinga-wm>	 PROBLEM - puppet last run on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused
[06:58:01] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485581 (https://phabricator.wikimedia.org/T210478) (owner: 10Marostegui)
[06:59:02] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485581 (https://phabricator.wikimedia.org/T210478) (owner: 10Marostegui)
[06:59:09] <icinga-wm>	 RECOVERY - puppet last run on dbmonitor2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[06:59:11] <icinga-wm>	 RECOVERY - puppet last run on cp1090 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:00:05] <icinga-wm>	 RECOVERY - Check systemd state on notebook1003 is OK: OK - running: The system is fully operational
[07:00:11] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1089 T210478 (duration: 00m 47s)
[07:00:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:00:15] <stashbot>	 T210478: Migrate dbstore1002 to a multi instance setup on dbstore100[3-5] - https://phabricator.wikimedia.org/T210478
[07:00:15] <icinga-wm>	 RECOVERY - dhclient process on notebook1003 is OK: PROCS OK: 0 processes with command name dhclient
[07:00:23] <wikibugs>	 10Operations, 10ops-eqsin, 10Traffic: Degraded RAID on cp5010 - https://phabricator.wikimedia.org/T214274 (10Vgutierrez) initial failure at 01:39: ` vgutierrez@cp5010:~$ grep sdb /var/log/kern.log |grep -v "__ext4_get_inode_loc" |grep -v "IO failure" Jan 21 01:39:17 cp5010 kernel: [7472180.491194] blk_update...
[07:00:31] <marostegui>	 !log Stop MySQL on db1089 to clone dbstore1003 - T210478
[07:00:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:00:41] <icinga-wm>	 RECOVERY - configured eth on notebook1003 is OK: OK - interfaces up
[07:00:47] <icinga-wm>	 RECOVERY - DPKG on notebook1003 is OK: All packages OK
[07:00:49] <icinga-wm>	 RECOVERY - MD RAID on notebook1003 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0
[07:05:05] <icinga-wm>	 PROBLEM - Check systemd state on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused
[07:05:15] <icinga-wm>	 PROBLEM - dhclient process on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused
[07:05:39] <icinga-wm>	 PROBLEM - configured eth on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused
[07:05:45] <icinga-wm>	 PROBLEM - DPKG on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused
[07:05:49] <icinga-wm>	 PROBLEM - MD RAID on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused
[07:08:01] <icinga-wm>	 RECOVERY - puppet last run on notebook1003 is OK: OK: Puppet is currently enabled, last run 21 minutes ago with 0 failures
[07:08:07] <icinga-wm>	 RECOVERY - configured eth on notebook1003 is OK: OK - interfaces up
[07:08:13] <icinga-wm>	 RECOVERY - DPKG on notebook1003 is OK: All packages OK
[07:08:15] <icinga-wm>	 RECOVERY - MD RAID on notebook1003 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0
[07:08:47] <icinga-wm>	 RECOVERY - Check systemd state on notebook1003 is OK: OK - running: The system is fully operational
[07:08:57] <icinga-wm>	 RECOVERY - dhclient process on notebook1003 is OK: PROCS OK: 0 processes with command name dhclient
[07:10:15] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485581 (https://phabricator.wikimedia.org/T210478) (owner: 10Marostegui)
[07:28:21] <wikibugs>	 (03PS1) 10Marostegui: InitialiseSettings.php: Increase parsercache TTL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485582 (https://phabricator.wikimedia.org/T210992)
[07:29:59] <wikibugs>	 (03PS1) 10Marostegui: parsercachepurging: Increase keys TTL [puppet] - 10https://gerrit.wikimedia.org/r/485583 (https://phabricator.wikimedia.org/T210992)
[07:32:11] <wikibugs>	 10Operations, 10DBA, 10Performance-Team, 10Patch-For-Review: Increase parsercache keys TTL  from 22 days back to 30 days - https://phabricator.wikimedia.org/T210992 (10Marostegui) @aaron @Joe @jcrespo  I have made the first small change, to go from 22 to 24 days:  https://gerrit.wikimedia.org/r/#/c/operati...
[07:36:43] <wikibugs>	 (03PS4) 10Mathew.onipe: wdqs: convert prom exporter script tp py3 [puppet] - 10https://gerrit.wikimedia.org/r/484974 (https://phabricator.wikimedia.org/T213305)
[07:36:57] <wikibugs>	 (03CR) 10Mathew.onipe: wdqs: convert prom exporter script tp py3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/484974 (https://phabricator.wikimedia.org/T213305) (owner: 10Mathew.onipe)
[07:39:39] <marostegui>	 !log Stop replication on db1124:3313 to fix triggers - T85757
[07:39:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:39:42] <stashbot>	 T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757
[07:40:18] <wikibugs>	 10Operations, 10MediaWiki-Cache, 10User-Elukey: Consider removing the last traces of nutcracker in Mediawiki configs - https://phabricator.wikimedia.org/T214275 (10elukey) p:05Triage→03Normal
[07:54:37] <wikibugs>	 (03PS1) 10Mathew.onipe: maps: migrate maps1002 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/485584 (https://phabricator.wikimedia.org/T198622)
[07:55:07] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] maps: migrate maps1002 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/485584 (https://phabricator.wikimedia.org/T198622) (owner: 10Mathew.onipe)
[07:56:34] <wikibugs>	 (03PS2) 10Mathew.onipe: maps: migrate maps1002 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/485584 (https://phabricator.wikimedia.org/T198622)
[08:10:19] <moritzm>	 !log installing OpenSSL security updates
[08:10:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:20:48] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1123" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485587
[08:20:50] <wikibugs>	 (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1123" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485587
[08:21:58] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1123" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485587 (owner: 10Marostegui)
[08:23:03] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1123" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485587 (owner: 10Marostegui)
[08:24:03] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1123 - T85757 (duration: 00m 48s)
[08:24:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:24:06] <stashbot>	 T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757
[08:25:27] <wikibugs>	 10Operations, 10ops-eqiad, 10RESTBase, 10RESTBase-Cassandra, and 3 others: Memory error on restbase1016 - https://phabricator.wikimedia.org/T212418 (10fgiunchedi) a:05Cmjohnson→03None >>! In T212418#4893779, @mobrovac wrote: > All of the instances have joined the ring (thnx @fgiunchedi!) and the latest...
[08:27:26] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485588 (https://phabricator.wikimedia.org/T85757)
[08:29:19] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485588 (https://phabricator.wikimedia.org/T85757) (owner: 10Marostegui)
[08:30:24] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485588 (https://phabricator.wikimedia.org/T85757) (owner: 10Marostegui)
[08:31:59] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1077 - T85757 (duration: 00m 46s)
[08:32:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:32:03] <stashbot>	 T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757
[08:35:51] <marostegui>	 !log Stop replication db1077 to deploy schema change - T85757
[08:35:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:35:57] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1123" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485587 (owner: 10Marostegui)
[08:35:59] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485588 (https://phabricator.wikimedia.org/T85757) (owner: 10Marostegui)
[08:45:50] <wikibugs>	 (03PS4) 10Elukey: profile::reportupdater::jobs::hadoop: move jobs to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/485210 (https://phabricator.wikimedia.org/T172532)
[08:48:42] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/14402/" [puppet] - 10https://gerrit.wikimedia.org/r/485210 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey)
[08:54:35] <icinga-wm>	 PROBLEM - puppet last run on stat1007 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 1 minute ago with 2 failures. Failed resources (up to 3 shown): Service[reportupdater-browser.timer],Service[reportupdater-interlanguage.timer]
[08:56:09] <elukey>	 this is me --^
[08:56:20] <wikibugs>	 (03PS1) 10Elukey: reportupdater::job: use absolute paths in timer's definition [puppet] - 10https://gerrit.wikimedia.org/r/485591 (https://phabricator.wikimedia.org/T172532)
[08:57:04] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] reportupdater::job: use absolute paths in timer's definition [puppet] - 10https://gerrit.wikimedia.org/r/485591 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey)
[08:59:32] <addshore>	 jouncebot: next
[08:59:32] <jouncebot>	 In 18 hour(s) and 0 minute(s): ContentTranslation Draft Purge Script Run (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190122T0300)
[08:59:47] <icinga-wm>	 RECOVERY - puppet last run on stat1007 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures
[09:00:07] <addshore>	 aah, monday is a no deploy day, heh
[09:01:25] * addshore will be backporting a line line fix for ArticlePlaceholder which has been broken since last week :(
[09:01:30] <addshore>	 once it is merged on the branch
[09:03:07] <icinga-wm>	 PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[09:06:07] <wikibugs>	 (03PS1) 10Vgutierrez: certcentral: Implement staging time [software/certcentral] - 10https://gerrit.wikimedia.org/r/485594 (https://phabricator.wikimedia.org/T213737)
[09:06:25] <icinga-wm>	 PROBLEM - DPKG on cumin2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[09:07:55] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] certcentral: Implement staging time [software/certcentral] - 10https://gerrit.wikimedia.org/r/485594 (https://phabricator.wikimedia.org/T213737) (owner: 10Vgutierrez)
[09:15:29] <icinga-wm>	 PROBLEM - puppet last run on cumin2001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[tshark],Exec[set debconf flag seen for wireshark-common/install-setuid]
[09:15:48] <wikibugs>	 (03PS2) 10Vgutierrez: certcentral: Implement staging time [software/certcentral] - 10https://gerrit.wikimedia.org/r/485594 (https://phabricator.wikimedia.org/T213737)
[09:17:34] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] certcentral: Implement staging time [software/certcentral] - 10https://gerrit.wikimedia.org/r/485594 (https://phabricator.wikimedia.org/T213737) (owner: 10Vgutierrez)
[09:19:10] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Remove Diamond from Kafka hosts [puppet] - 10https://gerrit.wikimedia.org/r/485004 (https://phabricator.wikimedia.org/T212231) (owner: 10Muehlenhoff)
[09:27:15] <icinga-wm>	 RECOVERY - DPKG on cumin2001 is OK: All packages OK
[09:30:06] <marostegui>	 !log Compress a few tables on dbstore1003:3315 - T210478
[09:30:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:30:10] <stashbot>	 T210478: Migrate dbstore1002 to a multi instance setup on dbstore100[3-5] - https://phabricator.wikimedia.org/T210478
[09:31:57] <wikibugs>	 10Operations, 10DC-Ops, 10Discovery: Memory correctable errors -EDAC- elastic1029 - https://phabricator.wikimedia.org/T214283 (10Mathew.onipe)
[09:33:14] <wikibugs>	 (03PS2) 10Filippo Giunchedi: prometheus: update prometheus-labs-targets to use keystone/nova clients [puppet] - 10https://gerrit.wikimedia.org/r/485193 (https://phabricator.wikimedia.org/T214058)
[09:34:31] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Discovery: Memory correctable errors -EDAC- elastic1029 - https://phabricator.wikimedia.org/T214283 (10Mathew.onipe)
[09:34:39] <icinga-wm>	 RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational
[09:35:45] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485607
[09:39:52] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly repool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485607 (owner: 10Marostegui)
[09:40:07] <wikibugs>	 (03PS1) 10DCausse: [cirrus] autocomplete: enable subphrase matching for wikitech and mw.org (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485608 (https://phabricator.wikimedia.org/T212788)
[09:40:10] <wikibugs>	 (03PS1) 10DCausse: [cirrus] autocomplete: enable subphrase matching for wikitech and mw.org (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485609 (https://phabricator.wikimedia.org/T212788)
[09:41:08] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485607 (owner: 10Marostegui)
[09:42:10] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly Repool db1089 T210478 (duration: 00m 45s)
[09:42:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:42:13] <stashbot>	 T210478: Migrate dbstore1002 to a multi instance setup on dbstore100[3-5] - https://phabricator.wikimedia.org/T210478
[09:44:14] <wikibugs>	 (03PS3) 10Vgutierrez: certcentral: Implement staging time [software/certcentral] - 10https://gerrit.wikimedia.org/r/485594 (https://phabricator.wikimedia.org/T213737)
[09:44:46] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485607 (owner: 10Marostegui)
[09:44:50] <wikibugs>	 (03CR) 10Filippo Giunchedi: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/485193 (https://phabricator.wikimedia.org/T214058) (owner: 10Filippo Giunchedi)
[09:45:58] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] certcentral: Implement staging time [software/certcentral] - 10https://gerrit.wikimedia.org/r/485594 (https://phabricator.wikimedia.org/T213737) (owner: 10Vgutierrez)
[09:46:13] <icinga-wm>	 RECOVERY - puppet last run on cumin2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[09:48:26] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "> > Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/485193 (https://phabricator.wikimedia.org/T214058) (owner: 10Filippo Giunchedi)
[09:52:01] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: labtestneutron2001: cleanup [dns] - 10https://gerrit.wikimedia.org/r/485613 (https://phabricator.wikimedia.org/T214167)
[09:52:21] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] labtestneutron2001: cleanup [dns] - 10https://gerrit.wikimedia.org/r/485613 (https://phabricator.wikimedia.org/T214167) (owner: 10Arturo Borrero Gonzalez)
[09:53:52] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485614
[09:55:09] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Increase traffic for db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485614 (owner: 10Marostegui)
[09:56:14] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485614 (owner: 10Marostegui)
[09:57:16] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Give more traffic to db1089 (duration: 00m 45s)
[09:57:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:58:09] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485614 (owner: 10Marostegui)
[10:00:48] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485615
[10:01:31] <wikibugs>	 (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485615
[10:01:32] <wikibugs>	 10Operations, 10ops-codfw, 10cloud-services-team (Kanban): labstore2004 - memory error on DIMM A2 - https://phabricator.wikimedia.org/T214262 (10faidon) This is a super old server; it just crossed its 7-year mark (we typically refresh servers at 4.5-5 years), so we're way past its warranty and shelf life and...
[10:15:45] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485616
[10:17:04] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Fully repool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485616 (owner: 10Marostegui)
[10:18:09] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485616 (owner: 10Marostegui)
[10:19:06] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool db1089 (duration: 00m 45s)
[10:19:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:21:05] <addshore>	 marostegui: o/ I have a backport to deploy to fix an UBN :) just want to make sure we don't step on each other
[10:21:17] <marostegui>	 addshore: go for it!
[10:21:23] <addshore>	 marostegui: thanks, will ping when done too!
[10:21:30] <marostegui>	 excellent!
[10:23:45] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Promote db2047 to master on configuration management [puppet] - 10https://gerrit.wikimedia.org/r/485617 (https://phabricator.wikimedia.org/T214264)
[10:25:01] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485616 (owner: 10Marostegui)
[10:25:21] <addshore>	 syncing
[10:25:40] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Promote db2047 to master on configuration management [puppet] - 10https://gerrit.wikimedia.org/r/485617 (https://phabricator.wikimedia.org/T214264)
[10:25:43] <wikibugs>	 (03CR) 10Marostegui: "Are you doing the db2047.yaml file in a separate commit?" [puppet] - 10https://gerrit.wikimedia.org/r/485617 (https://phabricator.wikimedia.org/T214264) (owner: 10Jcrespo)
[10:25:57] <marostegui>	 hehe
[10:26:03] <logmsgbot>	 !log addshore@deploy1001 Synchronized php-1.33.0-wmf.13/extensions/ArticlePlaceholder/includes/AboutTopicRenderer.php: T213739 Pass a usageAccumulator to SidebarGenerator (duration: 00m 47s)
[10:26:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:26:06] <stashbot>	 T213739: AboutTopicRenderer, OtherProjectsSidebarGeneratorFactory::getOtherProjectsSidebarGenerator() must be an instance of UsageAccumulator, undefined variable given - https://phabricator.wikimedia.org/T213739
[10:26:12] <addshore>	 marostegui: all done!
[10:26:16] <wikibugs>	 (03CR) 10Jcrespo: "^" [puppet] - 10https://gerrit.wikimedia.org/r/485617 (https://phabricator.wikimedia.org/T214264) (owner: 10Jcrespo)
[10:26:17] <marostegui>	 addshore: thanks!
[10:26:32] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] mariadb: Promote db2047 to master on configuration management [puppet] - 10https://gerrit.wikimedia.org/r/485617 (https://phabricator.wikimedia.org/T214264) (owner: 10Jcrespo)
[10:27:37] <icinga-wm>	 PROBLEM - HHVM rendering on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[10:28:43] <icinga-wm>	 RECOVERY - HHVM rendering on mwdebug1002 is OK: HTTP OK: HTTP/1.1 200 OK - 79912 bytes in 0.210 second response time
[10:30:58] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] wdqs: convert prom exporter script tp py3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/484974 (https://phabricator.wikimedia.org/T213305) (owner: 10Mathew.onipe)
[10:32:46] <wikibugs>	 (03PS3) 10Marostegui: Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485615
[10:32:48] <wikibugs>	 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10Patch-For-Review: improve docker registry architecture - https://phabricator.wikimedia.org/T209271 (10fselles)
[10:32:51] <wikibugs>	 10Operations, 10monitoring, 10Kubernetes: debianize docker-registry 2.7.0-rc0 and upload in stretch-wikimedia - https://phabricator.wikimedia.org/T210071 (10fselles) 05Open→03Resolved
[10:33:24] <jynus>	 !log upgrade and restart db2047 T214264
[10:33:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:33:27] <stashbot>	 T214264: BBU issues on codfw - https://phabricator.wikimedia.org/T214264
[10:34:01] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485615 (owner: 10Marostegui)
[10:35:12] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485615 (owner: 10Marostegui)
[10:35:23] <wikibugs>	 (03PS37) 10Elukey: admin: allow users to be deployed without ssh keys configured [puppet] - 10https://gerrit.wikimedia.org/r/482275 (https://phabricator.wikimedia.org/T212949)
[10:36:13] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1077 - T85757 (duration: 00m 44s)
[10:36:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:36:17] <stashbot>	 T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757
[10:38:30] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] "There is also a missing change to DHCP configuration to migrate maps1002 to stretch." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/485584 (https://phabricator.wikimedia.org/T198622) (owner: 10Mathew.onipe)
[10:38:49] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485615 (owner: 10Marostegui)
[10:42:25] <icinga-wm>	 RECOVERY - Disk space on notebook1003 is OK: DISK OK
[10:43:17] <elukey>	 (forced the remount)
[10:51:32] <elukey>	 !log disable puppet fleetwide to ease the merge/deploy of a puppet admin module change - T212949
[10:51:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:51:35] <stashbot>	 T212949: Allow the deployment of users without SSH access - https://phabricator.wikimedia.org/T212949
[10:52:01] <icinga-wm>	 RECOVERY - Host labstore2004 is UP: PING OK - Packet loss = 0%, RTA = 36.16 ms
[10:55:02] <wikibugs>	 10Operations, 10ops-codfw, 10cloud-services-team (Kanban): labstore2004 - memory error on DIMM A2 - https://phabricator.wikimedia.org/T214262 (10GTirloni) @faidon thanks! After another reboot the system was able to get past that error and boot successfuly.   Just for reference, it's DDR3 1333MHz memory type.
[10:55:08] <wikibugs>	 10Operations, 10ops-codfw, 10cloud-services-team (Kanban): labstore2004 - memory error on DIMM A2 - https://phabricator.wikimedia.org/T214262 (10GTirloni) 05Open→03Resolved p:05Triage→03Low
[10:56:32] <wikibugs>	 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Make swift containers for docker registry cross replicated. - https://phabricator.wikimedia.org/T214289 (10fselles) p:05Triage→03Normal
[11:02:43] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: move ::scripts to ::wmcs_scripts [puppet] - 10https://gerrit.wikimedia.org/r/485620 (https://phabricator.wikimedia.org/T214058)
[11:02:45] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: fetch WMCS targets from all regions [puppet] - 10https://gerrit.wikimedia.org/r/485621 (https://phabricator.wikimedia.org/T214058)
[11:02:54] <wikibugs>	 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Make swift containers for docker registry cross replicated. - https://phabricator.wikimedia.org/T214289 (10fselles) I've replicated this using a local SAIO setup and it seems to work, however obviously we are avoiding network latency here hence t...
[11:03:43] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] admin: allow users to be deployed without ssh keys configured [puppet] - 10https://gerrit.wikimedia.org/r/482275 (https://phabricator.wikimedia.org/T212949) (owner: 10Elukey)
[11:04:54] <elukey>	 all right merged my change, running puppet now
[11:05:06] <elukey>	 (on a few hosts to verify that it is a no op)
[11:05:25] <elukey>	 if anybody wants to verify their area of competence it would help a ton :)
[11:05:30] <wikibugs>	 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, and 3 others: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817 (10Gilles)
[11:07:52] <elukey>	 so far all good, ran puppet on some analytics nodes, didn't see anything strange
[11:12:17] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "I didn't tested the code. My eyeball didn't detect any important issue, but you should make sure every code branch works as expected. Prob" [puppet] - 10https://gerrit.wikimedia.org/r/485621 (https://phabricator.wikimedia.org/T214058) (owner: 10Filippo Giunchedi)
[11:13:13] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "Thanks for not introducing the 'labs' keyword again :-)" [puppet] - 10https://gerrit.wikimedia.org/r/485620 (https://phabricator.wikimedia.org/T214058) (owner: 10Filippo Giunchedi)
[11:13:34] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] prometheus: update prometheus-labs-targets to use keystone/nova clients [puppet] - 10https://gerrit.wikimedia.org/r/485193 (https://phabricator.wikimedia.org/T214058) (owner: 10Filippo Giunchedi)
[11:13:37] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: profile::services_proxy: simple local proxying for remote services [puppet] - 10https://gerrit.wikimedia.org/r/483788 (https://phabricator.wikimedia.org/T210717)
[11:14:58] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: use .format instead of % in prometheus-labs-targets [puppet] - 10https://gerrit.wikimedia.org/r/485623 (https://phabricator.wikimedia.org/T214058)
[11:19:18] <wikibugs>	 (03PS3) 10Filippo Giunchedi: prometheus: update prometheus-labs-targets to use keystone/nova clients [puppet] - 10https://gerrit.wikimedia.org/r/485193 (https://phabricator.wikimedia.org/T214058)
[11:19:20] <wikibugs>	 (03PS2) 10Filippo Giunchedi: prometheus: move ::scripts to ::wmcs_scripts [puppet] - 10https://gerrit.wikimedia.org/r/485620 (https://phabricator.wikimedia.org/T214058)
[11:19:22] <wikibugs>	 (03PS2) 10Filippo Giunchedi: prometheus: fetch WMCS targets from all regions [puppet] - 10https://gerrit.wikimedia.org/r/485621 (https://phabricator.wikimedia.org/T214058)
[11:19:24] <wikibugs>	 (03PS2) 10Filippo Giunchedi: prometheus: use .format instead of % in prometheus-labs-targets [puppet] - 10https://gerrit.wikimedia.org/r/485623 (https://phabricator.wikimedia.org/T214058)
[11:20:53] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: mediawiki::common: add proxy for services [puppet] - 10https://gerrit.wikimedia.org/r/483789 (https://phabricator.wikimedia.org/T210717)
[11:25:10] <onimisionipe>	 !log depool maps1003 to fix replication lag issues
[11:25:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:30:10] <wikibugs>	 (03PS1) 10Hashar: doc: minor tweaks [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/485626
[11:31:12] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: update prometheus-labs-targets to use keystone/nova clients [puppet] - 10https://gerrit.wikimedia.org/r/485193 (https://phabricator.wikimedia.org/T214058) (owner: 10Filippo Giunchedi)
[11:31:26] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: move ::scripts to ::wmcs_scripts [puppet] - 10https://gerrit.wikimedia.org/r/485620 (https://phabricator.wikimedia.org/T214058) (owner: 10Filippo Giunchedi)
[11:31:36] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: fetch WMCS targets from all regions [puppet] - 10https://gerrit.wikimedia.org/r/485621 (https://phabricator.wikimedia.org/T214058) (owner: 10Filippo Giunchedi)
[11:31:38] <wikibugs>	 (03PS5) 10Giuseppe Lavagetto: profile::services_proxy: simple local proxying for remote services [puppet] - 10https://gerrit.wikimedia.org/r/483788 (https://phabricator.wikimedia.org/T210717)
[11:31:40] <wikibugs>	 (03PS5) 10Giuseppe Lavagetto: mediawiki::common: add proxy for services [puppet] - 10https://gerrit.wikimedia.org/r/483789 (https://phabricator.wikimedia.org/T210717)
[11:31:43] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: use .format instead of % in prometheus-labs-targets [puppet] - 10https://gerrit.wikimedia.org/r/485623 (https://phabricator.wikimedia.org/T214058) (owner: 10Filippo Giunchedi)
[11:32:06] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Decrease WBQualityConstraintsTypeCheckMaxEntities [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485628 (https://phabricator.wikimedia.org/T209504)
[11:36:06] <wikibugs>	 (03PS6) 10Giuseppe Lavagetto: profile::services_proxy: simple local proxying for remote services [puppet] - 10https://gerrit.wikimedia.org/r/483788 (https://phabricator.wikimedia.org/T210717)
[11:36:08] <wikibugs>	 (03PS6) 10Giuseppe Lavagetto: mediawiki::common: add proxy for services [puppet] - 10https://gerrit.wikimedia.org/r/483789 (https://phabricator.wikimedia.org/T210717)
[11:50:41] <wikibugs>	 (03PS7) 10Giuseppe Lavagetto: profile::services_proxy: simple local proxying for remote services [puppet] - 10https://gerrit.wikimedia.org/r/483788 (https://phabricator.wikimedia.org/T210717)
[11:50:43] <wikibugs>	 (03PS7) 10Giuseppe Lavagetto: mediawiki::common: add proxy for services [puppet] - 10https://gerrit.wikimedia.org/r/483789 (https://phabricator.wikimedia.org/T210717)
[11:54:25] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: install wmcs_scripts dependencies from jessie-backports [puppet] - 10https://gerrit.wikimedia.org/r/485635 (https://phabricator.wikimedia.org/T214058)
[11:56:56] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: install wmcs_scripts dependencies from jessie-backports [puppet] - 10https://gerrit.wikimedia.org/r/485635 (https://phabricator.wikimedia.org/T214058) (owner: 10Filippo Giunchedi)
[12:01:05] <wikibugs>	 (03PS4) 10Elukey: role::analytics_cluster::hadoop: add groups without ssh access [puppet] - 10https://gerrit.wikimedia.org/r/484165 (https://phabricator.wikimedia.org/T212949)
[12:05:48] <mvolz>	 mobrovac, akosiaris: looks like the update on thurs got rid of the worst of the memory spikes https://grafana.wikimedia.org/d/000000620/xxxx-zotero-debugging-kubernetes?orgId=1&from=now-7d&to=now
[12:06:33] <akosiaris>	 mvolz: it does look like it indeed. Nice!
[12:07:10] <mvolz>	 how are we on other stuff, i.e. segfaults? I'm not sure how to look for those? 
[12:08:21] <fsero>	 no pod restarts, that is good also :)
[12:19:08] <wikibugs>	 (03CR) 10Hashar: Add a prune action (036 comments) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/485499 (https://phabricator.wikimedia.org/T207703) (owner: 10Giuseppe Lavagetto)
[12:28:45] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "minor nitpick, rest LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/485099 (owner: 10Dzahn)
[12:31:11] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Allow the deployment of users without SSH access - https://phabricator.wikimedia.org/T212949 (10elukey)
[12:31:48] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: package_builder: Remove support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/485191 (owner: 10Muehlenhoff)
[12:31:53] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] package_builder: Remove support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/485191 (owner: 10Muehlenhoff)
[12:32:22] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/14407/mc1025.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/485172 (https://phabricator.wikimedia.org/T208844) (owner: 10Effie Mouzeli)
[12:32:40] <wikibugs>	 (03PS2) 10Effie Mouzeli: Apply -R 200 to memcached on mc1025 [puppet] - 10https://gerrit.wikimedia.org/r/485172 (https://phabricator.wikimedia.org/T208844)
[12:32:43] <wikibugs>	 (03PS1) 10Elukey: Remove unnecessary SSH keys from Hadoop masters (testing cluster) [puppet] - 10https://gerrit.wikimedia.org/r/485640 (https://phabricator.wikimedia.org/T212949)
[12:33:31] <wikibugs>	 (03PS2) 10Elukey: Remove unnecessary SSH keys from Hadoop masters (testing cluster) [puppet] - 10https://gerrit.wikimedia.org/r/485640 (https://phabricator.wikimedia.org/T212949)
[12:34:58] <wikibugs>	 (03PS3) 10Effie Mouzeli: Apply -R 200 to memcached on mc1025 [puppet] - 10https://gerrit.wikimedia.org/r/485172 (https://phabricator.wikimedia.org/T208844)
[12:35:57] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Apply -R 200 to memcached on mc1025 [puppet] - 10https://gerrit.wikimedia.org/r/485172 (https://phabricator.wikimedia.org/T208844) (owner: 10Effie Mouzeli)
[12:36:05] <jijiki>	 !log Restarting memcached on mc1025 to apply '-R 200' - T208844
[12:36:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:36:08] <stashbot>	 T208844: Apply -R 200 to all the memcached mw object cache instances running in eqiad/codfw - https://phabricator.wikimedia.org/T208844
[12:36:17] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Remove unnecessary SSH keys from Hadoop masters (testing cluster) [puppet] - 10https://gerrit.wikimedia.org/r/485640 (https://phabricator.wikimedia.org/T212949) (owner: 10Elukey)
[12:36:24] <wikibugs>	 (03PS3) 10Elukey: Remove unnecessary SSH keys from Hadoop masters (testing cluster) [puppet] - 10https://gerrit.wikimedia.org/r/485640 (https://phabricator.wikimedia.org/T212949)
[12:36:27] <wikibugs>	 (03CR) 10Elukey: [V: 03+2 C: 03+2] Remove unnecessary SSH keys from Hadoop masters (testing cluster) [puppet] - 10https://gerrit.wikimedia.org/r/485640 (https://phabricator.wikimedia.org/T212949) (owner: 10Elukey)
[12:37:44] <wikibugs>	 10Operations, 10MediaWiki-Cache, 10serviceops, 10Patch-For-Review, and 3 others: Apply -R 200 to all the memcached mw object cache instances running in eqiad/codfw - https://phabricator.wikimedia.org/T208844 (10jijiki)
[12:42:08] <wikibugs>	 (03PS5) 10Elukey: role::analytics_cluster::hadoop: add groups without ssh access [puppet] - 10https://gerrit.wikimedia.org/r/484165 (https://phabricator.wikimedia.org/T212949)
[12:42:42] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] role::analytics_cluster::hadoop: add groups without ssh access [puppet] - 10https://gerrit.wikimedia.org/r/484165 (https://phabricator.wikimedia.org/T212949) (owner: 10Elukey)
[12:42:50] <wikibugs>	 (03PS3) 10Daimona Eaytoy: Enable $wgAbuseFilterRuntimeProfile on every wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423945 (https://phabricator.wikimedia.org/T191039)
[12:43:01] <wikibugs>	 (03CR) 10Daimona Eaytoy: [C: 04-1] "Per commit message." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423945 (https://phabricator.wikimedia.org/T191039) (owner: 10Daimona Eaytoy)
[12:55:24] <wikibugs>	 (03PS7) 10Daimona Eaytoy: Enable $wgAbuseFilterProfile on every wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423660 (https://phabricator.wikimedia.org/T191039)
[12:57:23] <wikibugs>	 (03PS1) 10Jbond: Reimage analytics1001 to stretch (as an exercise) [puppet] - 10https://gerrit.wikimedia.org/r/485647 (https://phabricator.wikimedia.org/T214294)
[13:11:40] <kart_>	 marostegui: Can you check https://phabricator.wikimedia.org/P8014
[13:11:52] <kart_>	 marostegui: related to, https://phabricator.wikimedia.org/T203059
[13:12:08] <kart_>	 marostegui: also, it runs fine now, but those errors happened in anwiki.
[13:13:46] <kart_>	 OK. Not specific it seems.
[13:16:08] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] Reimage analytics1001 to stretch (as an exercise) [puppet] - 10https://gerrit.wikimedia.org/r/485647 (https://phabricator.wikimedia.org/T214294) (owner: 10Jbond)
[13:16:29] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] Reimage analytics1001 to stretch (as an exercise) [puppet] - 10https://gerrit.wikimedia.org/r/485647 (https://phabricator.wikimedia.org/T214294) (owner: 10Jbond)
[13:25:59] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[13:27:11] <jynus>	 there was a spike at 13:17
[13:28:16] <jynus>	 however, I don't see anything on the logs
[13:29:41] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[13:30:56] <kart_>	 jynus: requesting your comment at, https://phabricator.wikimedia.org/P8014 - when you've time.
[13:32:07] <jynus>	 kart_: talk to AaronSchulz I already said to him I think our gtid implementation is not correct
[13:32:43] <kart_>	 Nikerabbit: ^
[13:32:52] <kart_>	 jynus: Thanks.
[13:36:50] <jynus>	 PHP Fatal Error from line 131 of /srv/mediawiki/php-1.33.0-wmf.13/extensions/TemplateData/includes/api/ApiTemplateData.php: Argument 1 passed to ApiResult::setIndexedTagName() must be an instance of array, null given
[13:37:02] <jynus>	 ^this may be the issue (of current high mediawiki fatals)
[13:38:51] <Nikerabbit>	 that's https://phabricator.wikimedia.org/T213953
[13:42:19] <jynus>	 kart_: one quick solution would be to GTID wait only on <ipserver of master>-<ipserver of master>-<current transaction id>, but that is not for me to decide
[13:43:14] <kart_>	 That's Nikerabbit :)
[13:43:44] <jynus>	 oh, sorry
[13:44:12] <jynus>	 but you are KartikMistry on phab, right?
[13:45:27] <jynus>	 so it is more of a "there wasn't any good solution and the impact was low at the time, so the decision was postponed"
[13:47:35] <jynus>	 Nikerabbit: there is some context at https://phabricator.wikimedia.org/T172497#4309959 but it gets mixed with other issues, so it is not really a ticket about the issue as much as a brainstorming mixing some architectural problems
[13:48:03] <wikibugs>	 (03PS1) 10Elukey: reportupdate: move all jobs to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/485656 (https://phabricator.wikimedia.org/T172532)
[13:48:22] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudnet2001-dev: hiera cleanup for stretch/mitaka [puppet] - 10https://gerrit.wikimedia.org/r/485657 (https://phabricator.wikimedia.org/T214299)
[13:48:45] <wikibugs>	 (03CR) 10Muehlenhoff: package_builder: add data types to parameters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/485099 (owner: 10Dzahn)
[13:48:51] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 04-1] package_builder: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/485099 (owner: 10Dzahn)
[13:49:20] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudnet2001-dev: hiera cleanup for stretch/mitaka [puppet] - 10https://gerrit.wikimedia.org/r/485657 (https://phabricator.wikimedia.org/T214299) (owner: 10Arturo Borrero Gonzalez)
[13:50:34] <Nikerabbit>	 jynus: is there something that triggers this "issue" (which we could try changing) or is it just "tough luck"?
[13:51:11] <wikibugs>	 (03PS2) 10Elukey: reportupdater: move all jobs to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/485656 (https://phabricator.wikimedia.org/T172532)
[13:51:19] <Nikerabbit>	 for example: calling waitForReplication without performing writes first
[13:51:23] <jynus>	 certainly a master switch can make it more prevalent
[13:51:34] <jynus>	 which happened as an emergency last week
[13:51:40] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/14408/" [puppet] - 10https://gerrit.wikimedia.org/r/485656 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey)
[13:51:48] <wikibugs>	 (03PS3) 10Elukey: reportupdater: move all jobs to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/485656 (https://phabricator.wikimedia.org/T172532)
[13:51:51] <wikibugs>	 (03CR) 10Elukey: [V: 03+2 C: 03+2] reportupdater: move all jobs to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/485656 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey)
[13:51:55] <jynus>	 but in reality it should happen all the time, the surprising part it works at times
[13:52:03] <jynus>	 *is
[13:52:19] <jynus>	 whenever chronology protector gets executed
[13:52:40] <elukey>	 Nikerabbit: o/
[13:53:53] <Nikerabbit>	 elukey: o hi
[13:54:43] <icinga-wm>	 PROBLEM - Host cloudnet2001-dev is DOWN: PING CRITICAL - Packet loss = 100%
[13:57:03] <icinga-wm>	 RECOVERY - Host cloudnet2001-dev is UP: PING OK - Packet loss = 0%, RTA = 36.13 ms
[13:58:21] <marostegui>	 !log Compress enwiki on dbstore1003:3311 - T210478
[13:58:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:58:25] <stashbot>	 T210478: Migrate dbstore1002 to a multi instance setup on dbstore100[3-5] - https://phabricator.wikimedia.org/T210478
[13:59:57] <Nikerabbit>	 elukey: I saw your comment. It's okay for me if someone adds the debug logging (I can review and +2)
[14:01:08] <wikibugs>	 (03PS6) 10Daimona Eaytoy: Move all AbuseFilter config to abusefilter.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477063 (https://phabricator.wikimedia.org/T145931)
[14:01:47] <elukey>	 Nikerabbit: ah yes nice! I only wanted to say hi, no hidden pings! :)
[14:02:18] <elukey>	 I have no idea where/how to add the logging, I'll wait for AaronSchulz's option!
[14:02:54] <Nikerabbit>	 sure
[14:03:13] <Nikerabbit>	 It would be close to the code that he already modified
[14:05:23] <elukey>	 Nikerabbit: very long rabbit hole, thanks a lot for all the help!
[14:06:25] <icinga-wm>	 PROBLEM - Check systemd state on cloudnet2001-dev is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[14:07:45] <wikibugs>	 10Operations, 10monitoring: Serve >= 50% of production Prometheus systems with Prometheus v2 - https://phabricator.wikimedia.org/T187987 (10fgiunchedi)
[14:07:50] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review: Convert prometheus-labs-targets to use nova API instead of wikitech's api.php - https://phabricator.wikimedia.org/T214058 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This is completed! Checked in tools targets are getting updated as expected:   ` root...
[14:09:08] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review: Convert prometheus-labs-targets to use nova API instead of wikitech's api.php - https://phabricator.wikimedia.org/T214058 (10aborrero) Congratulations for managing your way out of the rabbit hole :-)
[14:17:28] <wikibugs>	 (03CR) 10Gehel: puppet: add is_disabled() method (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/485066 (owner: 10Volans)
[14:17:34] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] puppet: add is_disabled() method [software/spicerack] - 10https://gerrit.wikimedia.org/r/485066 (owner: 10Volans)
[14:18:47] <wikibugs>	 (03PS7) 10Gehel: Add allocator metrics export for Blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/485135 (https://phabricator.wikimedia.org/T213372) (owner: 10Smalyshev)
[14:23:24] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] "This looks trivial enough, but I'm not entirely sure about the implications. Take my +1 as "LGTM, but feel free to recheck with someone mo" [software/spicerack] - 10https://gerrit.wikimedia.org/r/484239 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans)
[14:24:31] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] "LGTM, trivial enough" [cookbooks] - 10https://gerrit.wikimedia.org/r/484255 (owner: 10Volans)
[14:31:02] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] puppet: add is_disabled() method (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/485066 (owner: 10Volans)
[14:32:56] <wikibugs>	 (03CR) 10Gehel: "minor comments inline, otherwise lgtm" (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/484432 (https://phabricator.wikimedia.org/T205886) (owner: 10Volans)
[14:35:48] <wikibugs>	 10Operations, 10Citoid, 10SRE-Access-Requests: Requesting access to Citoid/Zotero production servers for MVOLZ - https://phabricator.wikimedia.org/T213269 (10Daimona) (Excuse me, typo upon committing)
[14:37:05] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] dns: fix logging message [software/spicerack] - 10https://gerrit.wikimedia.org/r/484524 (owner: 10Volans)
[14:40:44] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove obsolete Hadoop netboot entries and obsolete analytics-dell recipe [puppet] - 10https://gerrit.wikimedia.org/r/485667 (https://phabricator.wikimedia.org/T156955)
[14:42:22] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] "minor comments inline" (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/484582 (owner: 10Volans)
[14:43:43] <wikibugs>	 (03PS1) 10Jbond: Add partman config for analytics1001 back [puppet] - 10https://gerrit.wikimedia.org/r/485668 (https://phabricator.wikimedia.org/T214294)
[14:43:57] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Remove obsolete Hadoop netboot entries and obsolete analytics-dell recipe [puppet] - 10https://gerrit.wikimedia.org/r/485667 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff)
[14:44:48] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] Add partman config for analytics1001 back [puppet] - 10https://gerrit.wikimedia.org/r/485668 (https://phabricator.wikimedia.org/T214294) (owner: 10Jbond)
[15:02:22] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Remove obsolete Hadoop netboot entries and obsolete analytics-dell recipe [puppet] - 10https://gerrit.wikimedia.org/r/485667 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff)
[15:11:29] <dcausse>	 !log closing frwikiquote_* indices on elasticsearch search-chi@eqiad (T214052)
[15:11:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:11:33] <stashbot>	 T214052: Delete indices moved from chi to psi/omega - https://phabricator.wikimedia.org/T214052
[15:19:41] <dcausse>	 !log closing frwikiquote_* indices on elasticsearch search-chi@codfw (T214052)
[15:19:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:19:44] <stashbot>	 T214052: Delete indices moved from chi to psi/omega - https://phabricator.wikimedia.org/T214052
[15:24:55] <icinga-wm>	 PROBLEM - Check systemd state on mendelevium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[15:25:01] <icinga-wm>	 PROBLEM - clamd running on mendelevium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (clamav), command name clamd
[15:27:21] <icinga-wm>	 RECOVERY - Check systemd state on mendelevium is OK: OK - running: The system is fully operational
[15:27:27] <icinga-wm>	 RECOVERY - clamd running on mendelevium is OK: PROCS OK: 1 process with UID = 111 (clamav), command name clamd
[15:29:25] <wikibugs>	 (03PS5) 10Elukey: profile::analytics::refinery: move sanitize_eventlogging_analytics to timer [puppet] - 10https://gerrit.wikimedia.org/r/483426 (https://phabricator.wikimedia.org/T172532)
[15:31:04] <wikibugs>	 (03CR) 10Elukey: "Marcel: do you think that we could deploy this?" [puppet] - 10https://gerrit.wikimedia.org/r/483426 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey)
[15:34:17] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[15:37:20] <elukey>	 single big spike afaics
[15:37:22] <elukey>	 already recovered
[15:37:57] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[15:38:10] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "If no one objects, I’ll deploy this tomorrow (today is a US holiday so no deploys)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485628 (https://phabricator.wikimedia.org/T209504) (owner: 10Lucas Werkmeister (WMDE))
[15:44:21] <wikibugs>	 (03CR) 10Mforns: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/483426 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey)
[15:48:53] <wikibugs>	 (03PS3) 10Hashar: doc: make published files group writable [puppet] - 10https://gerrit.wikimedia.org/r/484308 (https://phabricator.wikimedia.org/T137890)
[15:48:58] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/484308 (https://phabricator.wikimedia.org/T137890) (owner: 10Hashar)
[15:49:12] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: labtestneutron2002: reimage in stretch + rename to cloudnet2002-dev [puppet] - 10https://gerrit.wikimedia.org/r/485679 (https://phabricator.wikimedia.org/T214303)
[15:49:43] <wikibugs>	 (03PS6) 10Elukey: profile::analytics::refinery: move sanitize_eventlogging_analytics to timer [puppet] - 10https://gerrit.wikimedia.org/r/483426 (https://phabricator.wikimedia.org/T172532)
[15:52:45] <jynus>	 !log stop and upgrade db2061
[15:52:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:55:33] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: labtestneutron2002: rename to cloudnet2002-dev [dns] - 10https://gerrit.wikimedia.org/r/485680 (https://phabricator.wikimedia.org/T214303)
[15:55:35] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] labtestneutron2002: reimage in stretch + rename to cloudnet2002-dev [puppet] - 10https://gerrit.wikimedia.org/r/485679 (https://phabricator.wikimedia.org/T214303) (owner: 10Arturo Borrero Gonzalez)
[15:55:47] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] labtestneutron2002: rename to cloudnet2002-dev [dns] - 10https://gerrit.wikimedia.org/r/485680 (https://phabricator.wikimedia.org/T214303) (owner: 10Arturo Borrero Gonzalez)
[15:57:38] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] "Ignoring jenkins failed verification." [dns] - 10https://gerrit.wikimedia.org/r/485680 (https://phabricator.wikimedia.org/T214303) (owner: 10Arturo Borrero Gonzalez)
[15:58:47] <onimisionipe>	 !log reinitializing slave replication(postgres) on maps1003 
[15:58:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:03:47] <arturo>	 !log T214303 reimaging/renaming labtestneutron2002.codfw.wmnet (jessie) to cloudnet2002-dev.codfw.wmnet (stretch)
[16:03:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:03:50] <stashbot>	 T214303: labtestneutron2002: reimage to stretch & rename to cloudnet2002-dev - https://phabricator.wikimedia.org/T214303
[16:13:48] <wikibugs>	 (03PS1) 10Hashar: aptrepo: change Jenkins upstream URL [puppet] - 10https://gerrit.wikimedia.org/r/485685
[16:14:48] <wikibugs>	 (03CR) 10Hashar: "I am pretty sure reprepro is still affected by this. That is a regular complain when having to update the package on apt.wikimedia.org whi" [puppet] - 10https://gerrit.wikimedia.org/r/485685 (owner: 10Hashar)
[16:15:47] <wikibugs>	 10Operations, 10Toolforge, 10Traffic, 10Wikimedia-Apache-configuration: Add new Tool Labs IPs to Varnish rate limit whitelist - https://phabricator.wikimedia.org/T214313 (10Nemo_bis)
[16:17:35] <wikibugs>	 10Operations, 10Toolforge, 10Traffic, 10Wikimedia-Apache-configuration: Add new Tool Labs IPs to Varnish rate limit whitelist - https://phabricator.wikimedia.org/T214313 (10Nemo_bis) I created this more specific task for Tools as requested, but there is a (more general?) Labs task at T213475
[16:21:22] <wikibugs>	 10Operations, 10Cloud-VPS, 10Traffic, 10serviceops: Difficulties to create offline version of Wikipedia because of HTTP 429 response - https://phabricator.wikimedia.org/T213475 (10akosiaris) >>! In T213475#4883423, @Kelson wrote: > I'm not sure to fully understand the technical explanation. Is the problem...
[16:24:51] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/14409/an-coord1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/483426 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey)
[16:24:59] <wikibugs>	 (03PS7) 10Elukey: profile::analytics::refinery: move sanitize_eventlogging_analytics to timer [puppet] - 10https://gerrit.wikimedia.org/r/483426 (https://phabricator.wikimedia.org/T172532)
[16:25:02] <wikibugs>	 (03CR) 10Elukey: [V: 03+2 C: 03+2] profile::analytics::refinery: move sanitize_eventlogging_analytics to timer [puppet] - 10https://gerrit.wikimedia.org/r/483426 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey)
[16:26:16] <wikibugs>	 10Operations: wmf-auto-reimage-host: icinga downtime error - https://phabricator.wikimedia.org/T214314 (10aborrero)
[16:26:35] <wikibugs>	 10Operations, 10Toolforge, 10Traffic, 10Wikimedia-Apache-configuration: Add new Tool Labs IPs to Varnish rate limit whitelist - https://phabricator.wikimedia.org/T214313 (10Nemo_bis) p:05Triage→03High
[16:26:55] <arturo>	 volans: opened T214314 you may be interested
[16:26:56] <stashbot>	 T214314: wmf-auto-reimage-host: icinga downtime error - https://phabricator.wikimedia.org/T214314
[16:32:12] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: helm: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/485110 (owner: 10Dzahn)
[16:42:55] <wikibugs>	 10Operations: wmf-auto-reimage-host: icinga downtime error - https://phabricator.wikimedia.org/T214314 (10MoritzMuehlenhoff) The FQDN where that server is being renamed to doesn't exist here yet, so it should simply skipped when setting downtime?
[16:45:55] <wikibugs>	 10Operations, 10Toolforge, 10Traffic, 10Wikimedia-Apache-configuration: Add new Tool Labs IPs to Varnish rate limit whitelist - https://phabricator.wikimedia.org/T214313 (10Krenair) Tools cannot be done separately, it does not have an IP space of it's own, tools instances are scattered around the same netw...
[16:46:43] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Replace Torrus with Prometheus snmp_exporter for PDUs monitoring - https://phabricator.wikimedia.org/T148541 (10fgiunchedi)
[16:54:40] <wikibugs>	 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui)
[16:55:24] <wikibugs>	 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui)
[16:56:53] <wikibugs>	 10Operations: wmf-auto-reimage-host: icinga downtime error - https://phabricator.wikimedia.org/T214314 (10aborrero)
[16:59:09] <wikibugs>	 10Operations: wmf-auto-reimage-host: icinga downtime error - https://phabricator.wikimedia.org/T214314 (10aborrero) >>! In T214314#4896937, @MoritzMuehlenhoff wrote: > The FQDN where that server is being renamed to doesn't exist here yet, so it should simply skipped when setting downtime?  Then perhaps this can...
[16:59:55] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10Packaging: Upgrade jenkins-debian-glue to v0.20.0 - https://phabricator.wikimedia.org/T212774 (10hashar) I have managed to build the package for both jessie and stretch without any issues! :)  To clarify from discussions I had:  * the packages are not...
[16:59:58] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s6 on db2076 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.02 seconds
[17:00:02] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s6 on db2046 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.06 seconds
[17:00:12] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s6 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.05 seconds
[17:00:20] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s6 on db2089 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.39 seconds
[17:00:20] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s6 on db2039 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.55 seconds
[17:00:20] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s6 on db2067 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.55 seconds
[17:00:54] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s6 on db2060 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 307.45 seconds
[17:01:02] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s6 on db2053 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 308.57 seconds
[17:01:10] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s6 on db2087 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 310.43 seconds
[17:01:52] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] helm: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/485110 (owner: 10Dzahn)
[17:03:22] <icinga-wm>	 PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[17:04:59] <elukey>	 checking --^
[17:09:20] <icinga-wm>	 RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational
[17:10:50] <wikibugs>	 10Operations, 10ops-codfw, 10cloud-services-team (Kanban): cloudnet2002-dev: ACPI error - https://phabricator.wikimedia.org/T214322 (10aborrero) p:05Triage→03Normal
[17:11:08] <wikibugs>	 (03PS1) 10Elukey: profile::refinery::job::spark_job: add shebang to sh template [puppet] - 10https://gerrit.wikimedia.org/r/485689 (https://phabricator.wikimedia.org/T172532)
[17:14:32] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] profile::refinery::job::spark_job: add shebang to sh template [puppet] - 10https://gerrit.wikimedia.org/r/485689 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey)
[17:14:43] <wikibugs>	 (03CR) 10Mforns: [C: 03+1] profile::refinery::job::spark_job: add shebang to sh template [puppet] - 10https://gerrit.wikimedia.org/r/485689 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey)
[17:14:51] <wikibugs>	 (03PS8) 10Giuseppe Lavagetto: profile::services_proxy: simple local proxying for remote services [puppet] - 10https://gerrit.wikimedia.org/r/483788 (https://phabricator.wikimedia.org/T210717)
[17:14:53] <wikibugs>	 (03PS8) 10Giuseppe Lavagetto: mediawiki::common: add proxy for services [puppet] - 10https://gerrit.wikimedia.org/r/483789 (https://phabricator.wikimedia.org/T210717)
[17:15:44] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] profile::services_proxy: simple local proxying for remote services [puppet] - 10https://gerrit.wikimedia.org/r/483788 (https://phabricator.wikimedia.org/T210717) (owner: 10Giuseppe Lavagetto)
[17:16:54] <jynus>	 !log stop and upgrade db2054
[17:16:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:25:38] <wikibugs>	 (03PS9) 10Giuseppe Lavagetto: mediawiki::common: add proxy for services [puppet] - 10https://gerrit.wikimedia.org/r/483789 (https://phabricator.wikimedia.org/T210717)
[17:30:53] <wikibugs>	 10Operations, 10ops-codfw, 10cloud-services-team (Kanban): cloudnet2002-dev: ACPI error - https://phabricator.wikimedia.org/T214322 (10fgiunchedi) This is known/expected, it is due to the `acpi_power_meter` kernel module which we are blacklisting, a reboot or manually unloading the module stops the messages
[17:42:31] <wikibugs>	 (03PS3) 10Jcrespo: mariadb: Promote db2047 to master on configuration management [puppet] - 10https://gerrit.wikimedia.org/r/485617 (https://phabricator.wikimedia.org/T214264)
[17:44:33] <jynus>	 !log stop replication on db2040 for master switch T214264
[17:44:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:44:36] <stashbot>	 T214264: BBU issues on codfw - https://phabricator.wikimedia.org/T214264
[17:44:45] <wikibugs>	 (03PS1) 10Jbond: use gpt schema [puppet] - 10https://gerrit.wikimedia.org/r/485693
[17:45:41] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] use gpt schema [puppet] - 10https://gerrit.wikimedia.org/r/485693 (owner: 10Jbond)
[17:47:38] <wikibugs>	 (03PS4) 10Jcrespo: mariadb: Promote db2047 to master on configuration management [puppet] - 10https://gerrit.wikimedia.org/r/485617 (https://phabricator.wikimedia.org/T214264)
[17:49:40] <wikibugs>	 (03CR) 10Jcrespo: "Suggestion:" [puppet] - 10https://gerrit.wikimedia.org/r/485693 (owner: 10Jbond)
[17:50:14] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] mariadb: Promote db2047 to master on configuration management [puppet] - 10https://gerrit.wikimedia.org/r/485617 (https://phabricator.wikimedia.org/T214264) (owner: 10Jcrespo)
[17:51:33] <jynus>	 !log stop and apply puppet changes to db2047 T214264
[17:51:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:51:36] <stashbot>	 T214264: BBU issues on codfw - https://phabricator.wikimedia.org/T214264
[18:06:10] <jynus>	 we may have some low noise on mw logs from mw2- this is expexted for a few minutes, as I am double checking the topology changes
[18:08:04] <jynus>	 (I am leaving things unconfigured properly until I am sure the 2 migrated hosts are in a good state)
[18:10:04] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s8 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 600.14 seconds
[18:14:41] <jynus>	 that is not me, but dbstore1002 is not preciselly too reliable
[18:17:46] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Depool db2040, promote db2047 to master of s7 section [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485696 (https://phabricator.wikimedia.org/T214264)
[18:21:11] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] mariadb: Depool db2040, promote db2047 to master of s7 section [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485696 (https://phabricator.wikimedia.org/T214264) (owner: 10Jcrespo)
[18:22:22] <wikibugs>	 (03Merged) 10jenkins-bot: mariadb: Depool db2040, promote db2047 to master of s7 section [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485696 (https://phabricator.wikimedia.org/T214264) (owner: 10Jcrespo)
[18:24:10] <logmsgbot>	 !log jynus@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2040, promote db2047 to s7 master (duration: 00m 46s)
[18:24:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:25:27] <wikibugs>	 (03PS1) 10BryanDavis: toolforge: Rotate SGE accounting file from NFS master [puppet] - 10https://gerrit.wikimedia.org/r/485697 (https://phabricator.wikimedia.org/T168701)
[18:25:56] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] toolforge: Rotate SGE accounting file from NFS master [puppet] - 10https://gerrit.wikimedia.org/r/485697 (https://phabricator.wikimedia.org/T168701) (owner: 10BryanDavis)
[18:26:00] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Depool db2040 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485698 (https://phabricator.wikimedia.org/T214264)
[18:26:00] <icinga-wm>	 PROBLEM - tilerator on maps1003 is CRITICAL: connect to address 10.64.32.117 and port 6534: Connection refused
[18:26:04] <icinga-wm>	 PROBLEM - Maps HTTPS on maps1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.009 second response time
[18:27:14] <icinga-wm>	 RECOVERY - tilerator on maps1003 is OK: HTTP OK: HTTP/1.1 200 OK - 304 bytes in 0.027 second response time
[18:27:18] <icinga-wm>	 RECOVERY - Maps HTTPS on maps1003 is OK: HTTP OK: HTTP/1.1 200 OK - 1323 bytes in 0.090 second response time
[18:29:32] <wikibugs>	 (03PS2) 10BryanDavis: toolforge: Rotate SGE accounting file from NFS master [puppet] - 10https://gerrit.wikimedia.org/r/485697 (https://phabricator.wikimedia.org/T168701)
[18:32:21] <wikibugs>	 (03CR) 10jenkins-bot: mariadb: Depool db2040, promote db2047 to master of s7 section [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485696 (https://phabricator.wikimedia.org/T214264) (owner: 10Jcrespo)
[18:34:06] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Demote db2040 from being an s7 master to just a replica [puppet] - 10https://gerrit.wikimedia.org/r/485701 (https://phabricator.wikimedia.org/T214264)
[18:34:09] <icinga-wm>	 PROBLEM - Disk space on cloudvirt1021 is CRITICAL: DISK CRITICAL - /tmp/builder01 is not accessible: Permission denied
[18:34:58] <jynus>	 who may be around, arturo maybe?
[18:35:13] <icinga-wm>	 RECOVERY - Disk space on cloudvirt1021 is OK: DISK OK
[18:35:25] <arturo>	 yes
[18:35:43] <arturo>	 but apparently ongoing operation by fsero and gtirloni 
[18:35:45] <jynus>	 that /tmp check is weird
[18:35:56] <jynus>	 ok sorry for pinging you
[18:36:29] <arturo>	 it's ok jynus, it was the page what pinged me :-P
[18:37:08] <fsero>	 Well we were debugging an issue on an horizon VM it seems weird that action create a page 
[18:37:31] <apergos>	 ah so that's the page
[18:38:03] <apergos>	 I got the problem pge (delayed) but not the recovery (thanks, my provider)
[18:40:49] <gtirloni>	 I'll make a t-shirt "I created a directory in /tmp and woke up half my coworkers" ;)
[18:40:51] <gtirloni>	 sorry about the noise
[18:42:01] <apergos>	 and there's the recovery page at last
[18:47:23] <icinga-wm>	 PROBLEM - Disk space on cloudvirt1021 is CRITICAL: DISK CRITICAL - /root/builder01 is not accessible: Permission denied
[18:47:43] <vgutierrez>	 here you go again
[18:47:50] <gtirloni>	 are you kidding
[18:47:53] <gtirloni>	 wth
[18:47:53] <apergos>	 ho hum
[18:48:16] <gtirloni>	 i'll ack it and leave it like that
[18:48:31] <gtirloni>	 and open a task for investigate it later
[18:48:39] <arturo>	 gtirloni: did you or fsero create a dir there?
[18:48:50] <gtirloni>	 arturo: yes, I created it
[18:48:58] <gtirloni>	 this is absurd
[18:49:22] <arturo>	 oh this time is under /root/
[18:49:34] <fsero>	 arturo:  gtirloni is investigating an issue i reported check #cloud-operations for more info, in any case this is NOT urgent so it doesnt merit paging people at all
[18:49:55] <icinga-wm>	 ACKNOWLEDGEMENT - Disk space on cloudvirt1021 is CRITICAL: DISK CRITICAL - /root/builder01 is not accessible: Permission denied GTirloni Expected.
[18:50:03] <fsero>	 gtirloni: thanks for the help but we should ack this and wait untill tomorrow to diagnose it
[18:50:08] <fsero>	 :)
[18:50:31] <arturo>	 ok :-)
[18:51:29] <icinga-wm>	 RECOVERY - Disk space on cloudvirt1021 is OK: DISK OK
[18:52:21] <onimisionipe>	 !log pool maps1003 - postgresql sql lag issues has been fixed
[18:52:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:52:32] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] mariadb: Demote db2040 from being an s7 master to just a replica [puppet] - 10https://gerrit.wikimedia.org/r/485701 (https://phabricator.wikimedia.org/T214264) (owner: 10Jcrespo)
[18:55:56] <jynus>	 !log stop and upgrade db2040 T214264
[18:55:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:55:59] <stashbot>	 T214264: BBU issues on codfw - https://phabricator.wikimedia.org/T214264
[19:02:44] <wikibugs>	 10Operations, 10Icinga, 10monitoring, 10cloud-services-team (Kanban): cloudvirt1021/Disk space is CRITICAL - https://phabricator.wikimedia.org/T214325 (10GTirloni)
[19:03:29] <wikibugs>	 10Operations, 10Gerrit, 10Release-Engineering-Team: Install "healthcheck" plugin - https://phabricator.wikimedia.org/T214326 (10Paladox)
[19:08:11] <wikibugs>	 10Operations, 10Gerrit, 10Release-Engineering-Team: Install "healthcheck" plugin on gerrit - https://phabricator.wikimedia.org/T214326 (10jcrespo)
[19:23:23] <jynus>	 !log mysql.py -h db1115 zarcillo -e "UPDATE masters SET instance = 'db2047' WHERE section = 's7' and dc = 'codfw'" T214264
[19:23:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:23:26] <stashbot>	 T214264: BBU issues on codfw - https://phabricator.wikimedia.org/T214264
[19:24:23] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s6 on db2039 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[19:24:27] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s6 on db2095 is OK: OK slave_sql_lag Replication lag: 0.08 seconds
[19:24:33] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s6 on db2076 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[19:24:39] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s6 on db2087 is OK: OK slave_sql_lag Replication lag: 0.03 seconds
[19:24:51] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s6 on db2046 is OK: OK slave_sql_lag Replication lag: 0.21 seconds
[19:24:55] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s6 on db2089 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[19:27:02] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s6 on db2060 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[19:28:05] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Repool db2040 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485698 (https://phabricator.wikimedia.org/T214264)
[19:32:20] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] mariadb: Repool db2040 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485698 (https://phabricator.wikimedia.org/T214264) (owner: 10Jcrespo)
[19:33:25] <wikibugs>	 (03Merged) 10jenkins-bot: mariadb: Repool db2040 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485698 (https://phabricator.wikimedia.org/T214264) (owner: 10Jcrespo)
[19:34:00] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s8 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 288.80 seconds
[19:34:14] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s6 on db2067 is OK: OK slave_sql_lag Replication lag: 0.22 seconds
[19:35:53] <logmsgbot>	 !log jynus@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2040 (duration: 00m 45s)
[19:35:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:36:04] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s6 on db2053 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[19:39:00] <wikibugs>	 (03CR) 10jenkins-bot: mariadb: Repool db2040 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485698 (https://phabricator.wikimedia.org/T214264) (owner: 10Jcrespo)
[20:07:38] <wikibugs>	 10Operations, 10Toolforge, 10Traffic, 10Wikimedia-Apache-configuration: Add new Tool Labs IPs to Varnish rate limit whitelist - https://phabricator.wikimedia.org/T214313 (10faidon) Per our earlier conversations (T208986, T174596, T209011), I think we should just use the WMCS public IP space to make these k...
[20:10:17] <wikibugs>	 10Operations, 10Toolforge, 10Traffic, 10Wikimedia-Apache-configuration: Add new Tool Labs IPs to Varnish rate limit whitelist - https://phabricator.wikimedia.org/T214313 (10Cyberpower678) >>! In T214313#4897303, @faidon wrote: > Per our earlier conversations (T208986, T174596, T209011), I think we should j...
[20:12:34] <wikibugs>	 10Operations, 10Cloud-VPS, 10Traffic, 10serviceops: Wikimedia varnish rules no longer exempt all Cloud VPS/Toolforge IPs from rate limits (HTTP 429 response) - https://phabricator.wikimedia.org/T213475 (10bd808)
[20:12:58] <wikibugs>	 10Operations, 10Cloud-VPS, 10Traffic, 10serviceops: Wikimedia varnish rules no longer exempt all Cloud VPS/Toolforge IPs from rate limits (HTTP 429 response) - https://phabricator.wikimedia.org/T213475 (10bd808)
[20:23:42] <wikibugs>	 10Operations, 10Cloud-VPS, 10Traffic, 10serviceops: Wikimedia varnish rules no longer exempt all Cloud VPS/Toolforge IPs from rate limits (HTTP 429 response) - https://phabricator.wikimedia.org/T213475 (10Cyberpower678) @bd808 just invited me here.  Ever since the Cloud VPS migration, Cyberbot has been hit...
[20:25:16] <wikibugs>	 10Operations, 10Cloud-VPS, 10Traffic, 10serviceops: Wikimedia varnish rules no longer exempt all Cloud VPS/Toolforge IPs from rate limits (HTTP 429 response) - https://phabricator.wikimedia.org/T213475 (10Cyberpower678) p:05Normal→03High I'm also boldly raising the priority as from what I gather I'm li...
[20:34:04] <wikibugs>	 (03PS1) 10Ammarpad: Add new namespace abbreviation for Swedish (sv) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485704 (https://phabricator.wikimedia.org/T214329)
[20:35:38] <wikibugs>	 (03PS1) 10Faidon Liambotis: protocol.compat: disable a couple of pylint errors [software/keyholder] - 10https://gerrit.wikimedia.org/r/485705
[20:35:40] <wikibugs>	 (03PS1) 10Faidon Liambotis: Bump minimum Python to 3.5; also test with 3.7 [software/keyholder] - 10https://gerrit.wikimedia.org/r/485706
[20:35:42] <wikibugs>	 (03PS1) 10Faidon Liambotis: Add a pylint tox environment [software/keyholder] - 10https://gerrit.wikimedia.org/r/485707
[20:35:44] <wikibugs>	 (03PS1) 10Faidon Liambotis: Add a tox environment for Construct 2.8.16 [software/keyholder] - 10https://gerrit.wikimedia.org/r/485708
[20:35:46] <wikibugs>	 (03PS1) 10Faidon Liambotis: Update tox.ini to facilitate parallel builds [software/keyholder] - 10https://gerrit.wikimedia.org/r/485709
[20:49:08] <wikibugs>	 (03PS2) 10Ammarpad: Add new namespace abbreviation for Swedish (sv) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485704 (https://phabricator.wikimedia.org/T214329)
[21:46:17] <wikibugs>	 (03CR) 10Bstorm: "Since I once manually truncated the log and found that it basically broke the grid for a bit, my first thought is to use the system script" [puppet] - 10https://gerrit.wikimedia.org/r/485697 (https://phabricator.wikimedia.org/T168701) (owner: 10BryanDavis)
[21:47:42] <wikibugs>	 (03PS1) 10Faidon Liambotis: Move tests/unit -> tests [software/keyholder] - 10https://gerrit.wikimedia.org/r/485714
[21:47:44] <wikibugs>	 (03PS1) 10Faidon Liambotis: Add a bunch more tests [software/keyholder] - 10https://gerrit.wikimedia.org/r/485715
[21:47:46] <wikibugs>	 (03PS1) 10Faidon Liambotis: Test key and config file parsing using test data [software/keyholder] - 10https://gerrit.wikimedia.org/r/485716
[21:47:48] <wikibugs>	 (03PS1) 10Faidon Liambotis: Add a (very basic) test using OpenSSH's ssh-add [software/keyholder] - 10https://gerrit.wikimedia.org/r/485717
[21:47:50] <wikibugs>	 (03PS1) 10Faidon Liambotis: Add tests for OSError when loading config files [software/keyholder] - 10https://gerrit.wikimedia.org/r/485718
[21:47:52] <wikibugs>	 (03PS1) 10Faidon Liambotis: Make all SshAgentConfig's methods instance methods [software/keyholder] - 10https://gerrit.wikimedia.org/r/485719
[21:47:54] <wikibugs>	 (03PS1) 10Faidon Liambotis: Add SshKeyBlob per RFC 4253 [software/keyholder] - 10https://gerrit.wikimedia.org/r/485720
[21:51:24] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add a bunch more tests [software/keyholder] - 10https://gerrit.wikimedia.org/r/485715 (owner: 10Faidon Liambotis)
[21:51:28] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Test key and config file parsing using test data [software/keyholder] - 10https://gerrit.wikimedia.org/r/485716 (owner: 10Faidon Liambotis)
[21:52:11] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add tests for OSError when loading config files [software/keyholder] - 10https://gerrit.wikimedia.org/r/485718 (owner: 10Faidon Liambotis)
[21:52:16] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add a (very basic) test using OpenSSH's ssh-add [software/keyholder] - 10https://gerrit.wikimedia.org/r/485717 (owner: 10Faidon Liambotis)
[21:52:19] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Make all SshAgentConfig's methods instance methods [software/keyholder] - 10https://gerrit.wikimedia.org/r/485719 (owner: 10Faidon Liambotis)
[21:52:29] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add SshKeyBlob per RFC 4253 [software/keyholder] - 10https://gerrit.wikimedia.org/r/485720 (owner: 10Faidon Liambotis)
[21:55:02] <wikibugs>	 10Operations, 10MediaWiki-Cache, 10Performance-Team, 10User-Elukey: Consider removing the last traces of nutcracker in Mediawiki configs - https://phabricator.wikimedia.org/T214275 (10Krinkle)
[21:58:04] <wikibugs>	 10Operations, 10MediaWiki-Cache, 10Performance-Team, 10User-Elukey: Consider removing the last traces of nutcracker in Mediawiki configs - https://phabricator.wikimedia.org/T214275 (10Krinkle) (Tagging Performance-Team to track Aaron's implicit involvement through CC, as this appears implicitly blocked on...
[21:59:09] * Krinkle is considering to deploy a UBN fix
[21:59:15] <Krinkle>	 https://phabricator.wikimedia.org/T213953#4897402
[22:14:47] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review, 10Performance-Team (Radar): Increase parsercache keys TTL  from 22 days back to 30 days - https://phabricator.wikimedia.org/T210992 (10Krinkle)
[22:29:24] * Krinkle staging on mwdebug1002
[22:33:56] <logmsgbot>	 !log krinkle@deploy1001 Synchronized php-1.33.0-wmf.13/extensions/TemplateData/includes/api/ApiTemplateData.php: I7647ddfc47 - T213953 (duration: 00m 47s)
[22:33:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:34:00] <stashbot>	 T213953: $data->paramOrder is null on pages edited since MediaWiki 1.33/wmf.13 was deployed - https://phabricator.wikimedia.org/T213953
[22:35:54] <wikibugs>	 (03PS2) 10Faidon Liambotis: Add a bunch more tests [software/keyholder] - 10https://gerrit.wikimedia.org/r/485715
[22:35:56] <wikibugs>	 (03PS1) 10Faidon Liambotis: Properly setup logging when /dev/log doesn't exist [software/keyholder] - 10https://gerrit.wikimedia.org/r/485724
[22:41:25] <wikibugs>	 (03CR) 10Krinkle: "In particular, zero.wikimedia.org (the internal Zero wiki, as opposed to Wikipedia Zero traffic itself) should probably be made inaccessib" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483193 (owner: 10Jforrester)
[22:44:23] <wikibugs>	 (03CR) 10Alex Monk: "If we're talking about preventing information leaks by shutting down the wiki, removing the domain from DNS isn't enough. You'd need to ac" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483193 (owner: 10Jforrester)
[22:45:15] <wikibugs>	 (03CR) 10Krinkle: "Ah, you mean from apache config, given Host headers. Good point." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483193 (owner: 10Jforrester)
[22:45:29] <wikibugs>	 (03CR) 10Krinkle: "so -dns, and -main.conf" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483193 (owner: 10Jforrester)
[22:46:40] <wikibugs>	 (03CR) 10Alex Monk: "I wouldn't want to trust VCL with that, so yes, Apache or a MediaWiki config change." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483193 (owner: 10Jforrester)
[22:53:06] <wikibugs>	 (03PS2) 10Faidon Liambotis: Test key and config file parsing using test data [software/keyholder] - 10https://gerrit.wikimedia.org/r/485716
[22:53:08] <wikibugs>	 (03PS2) 10Faidon Liambotis: Add a (very basic) test using OpenSSH's ssh-add [software/keyholder] - 10https://gerrit.wikimedia.org/r/485717
[22:53:10] <wikibugs>	 (03PS2) 10Faidon Liambotis: Add tests for OSError when loading config files [software/keyholder] - 10https://gerrit.wikimedia.org/r/485718
[22:53:12] <wikibugs>	 (03PS2) 10Faidon Liambotis: Make all SshAgentConfig's methods instance methods [software/keyholder] - 10https://gerrit.wikimedia.org/r/485719
[22:53:14] <wikibugs>	 (03PS2) 10Faidon Liambotis: Add SshKeyBlob per RFC 4253 [software/keyholder] - 10https://gerrit.wikimedia.org/r/485720
[23:26:48] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure (shipyard), 10Release-Engineering-Team (Kanban), 10cloud-services-team (Kanban): Phase out Nodepool from production - https://phabricator.wikimedia.org/T209361 (10Krinkle)
[23:36:53] <wikibugs>	 10Operations, 10Patch-For-Review: Reimage analytics1001 to stretch (as an exercise) - https://phabricator.wikimedia.org/T214294 (10Peachey88)
[23:38:45] <wikibugs>	 10Operations, 10Gerrit, 10Icinga, 10Release-Engineering-Team, 10monitoring: Install "healthcheck" plugin on gerrit - https://phabricator.wikimedia.org/T214326 (10Peachey88)
[23:41:03] <wikibugs>	 (03CR) 10BryanDavis: "> Since I once manually truncated the log and found that it basically" [puppet] - 10https://gerrit.wikimedia.org/r/485697 (https://phabricator.wikimedia.org/T168701) (owner: 10BryanDavis)