[00:30:36] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw2220 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:32:16] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw2221 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:32:37] <icinga-wm>	 PROBLEM - puppet last run on mw2219 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:32:37] <icinga-wm>	 PROBLEM - nutcracker process on mw2220 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:32:37] <icinga-wm>	 PROBLEM - nutcracker port on mw2221 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:34:16] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw2219 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:34:16] <icinga-wm>	 PROBLEM - nutcracker process on mw2221 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:34:16] <icinga-wm>	 PROBLEM - puppet last run on mw2220 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:35:56] <icinga-wm>	 PROBLEM - Check systemd state on mw2219 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:35:56] <icinga-wm>	 PROBLEM - MD RAID on mw2219 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:35:56] <icinga-wm>	 PROBLEM - puppet last run on mw2221 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:35:56] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw2220 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:35:57] <icinga-wm>	 RECOVERY - MegaRAID on analytics1032 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy
[00:37:36] <icinga-wm>	 PROBLEM - Check the NTP synchronisation status of timesyncd on mw2219 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:37:36] <icinga-wm>	 PROBLEM - Check systemd state on mw2220 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:37:36] <icinga-wm>	 PROBLEM - MD RAID on mw2220 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:37:37] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw2221 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:38:56] <icinga-wm>	 PROBLEM - HHVM rendering on mw2221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[00:39:26] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw2219 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[00:41:06] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw2220 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[00:41:16] <icinga-wm>	 PROBLEM - Apache HTTP on mw2221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[00:41:36] <icinga-wm>	 PROBLEM - Apache HTTP on mw2220 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[00:42:06] <icinga-wm>	 PROBLEM - Apache HTTP on mw2219 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[00:42:46] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw2221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[00:45:46] <icinga-wm>	 PROBLEM - Check systemd state on mw2221 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:45:46] <icinga-wm>	 PROBLEM - MD RAID on mw2221 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:45:47] <icinga-wm>	 PROBLEM - nutcracker process on mw2221 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:45:47] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw2221 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:45:56] <icinga-wm>	 PROBLEM - configured eth on mw2219 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:45:57] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw2219 is CRITICAL: Host mw2219 is not in mediawiki-installation dsh group
[00:45:57] <icinga-wm>	 PROBLEM - HHVM processes on mw2219 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:46:07] <icinga-wm>	 PROBLEM - Check systemd state on mw2220 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:46:07] <icinga-wm>	 PROBLEM - MD RAID on mw2220 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:46:07] <icinga-wm>	 RECOVERY - Check size of conntrack table on mw2221 is OK: OK: nf_conntrack is 0 % full
[00:46:16] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw2220 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:46:16] <icinga-wm>	 PROBLEM - nutcracker process on mw2220 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:46:17] <icinga-wm>	 PROBLEM - nutcracker port on mw2221 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused
[00:46:36] <icinga-wm>	 RECOVERY - Check size of conntrack table on mw2220 is OK: OK: nf_conntrack is 0 % full
[00:46:36] <icinga-wm>	 PROBLEM - Check systemd state on mw2219 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[00:46:36] <icinga-wm>	 RECOVERY - MD RAID on mw2219 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0
[00:46:56] <icinga-wm>	 RECOVERY - MD RAID on mw2221 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0
[00:46:56] <icinga-wm>	 RECOVERY - Check size of conntrack table on mw2219 is OK: OK: nf_conntrack is 1 % full
[00:46:56] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw2221 is OK: OK - load average: 4.89, 5.37, 3.57
[00:46:57] <icinga-wm>	 RECOVERY - configured eth on mw2219 is OK: OK - interfaces up
[00:47:06] <icinga-wm>	 RECOVERY - HHVM processes on mw2219 is OK: PROCS OK: 6 processes with command name hhvm
[00:47:16] <icinga-wm>	 RECOVERY - MD RAID on mw2220 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0
[00:47:17] <icinga-wm>	 RECOVERY - Apache HTTP on mw2219 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 3.908 second response time
[00:47:26] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw2220 is OK: OK - load average: 4.68, 5.49, 3.64
[00:47:46] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw2220 is CRITICAL: Host mw2220 is not in mediawiki-installation dsh group
[00:47:46] <icinga-wm>	 PROBLEM - nutcracker port on mw2219 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused
[00:49:06] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw2219 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 3.070 second response time
[00:49:16] <icinga-wm>	 PROBLEM - HHVM rendering on mw2220 is CRITICAL: connect to address 10.192.0.45 and port 80: Connection refused
[00:49:17] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw2221 is CRITICAL: Host mw2221 is not in mediawiki-installation dsh group
[00:49:17] <icinga-wm>	 PROBLEM - nutcracker port on mw2220 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused
[00:49:17] <icinga-wm>	 PROBLEM - nutcracker process on mw2219 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (nutcracker), command name nutcracker
[00:50:57] <icinga-wm>	 RECOVERY - Check systemd state on mw2219 is OK: OK - unknown: The operational state could not be determined, due to lack of resources or another error cause.
[00:53:27] <icinga-wm>	 RECOVERY - nutcracker process on mw2221 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker
[00:53:47] <icinga-wm>	 RECOVERY - nutcracker port on mw2221 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212
[00:53:57] <icinga-wm>	 RECOVERY - Apache HTTP on mw2221 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.291 second response time
[00:54:08] <icinga-wm>	 RECOVERY - nutcracker port on mw2219 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212
[00:54:18] <icinga-wm>	 RECOVERY - Check systemd state on mw2221 is OK: OK - running: The system is fully operational
[00:54:27] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw2221 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.247 second response time
[00:54:38] <icinga-wm>	 RECOVERY - Check systemd state on mw2220 is OK: OK - running: The system is fully operational
[00:54:47] <icinga-wm>	 RECOVERY - nutcracker process on mw2220 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker
[00:55:18] <icinga-wm>	 RECOVERY - Apache HTTP on mw2220 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.144 second response time
[00:55:47] <icinga-wm>	 RECOVERY - HHVM rendering on mw2221 is OK: HTTP OK: HTTP/1.1 200 OK - 74933 bytes in 2.153 second response time
[00:56:48] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw2220 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 0.366 second response time
[00:56:58] <icinga-wm>	 RECOVERY - HHVM rendering on mw2220 is OK: HTTP OK: HTTP/1.1 200 OK - 74933 bytes in 3.597 second response time
[00:57:47] <icinga-wm>	 RECOVERY - puppet last run on mw2219 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[00:59:31] <icinga-wm>	 RECOVERY - puppet last run on mw2220 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[01:00:00] <icinga-wm>	 PROBLEM - HP RAID on db2067 is CRITICAL: CRITICAL: Slot 0: Failed: 1I:1:10 - OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:11, 1I:1:12 - Controller: OK - Battery/Capacitor: OK
[01:00:01] <icinga-wm>	 ACKNOWLEDGEMENT - HP RAID on db2067 is CRITICAL: CRITICAL: Slot 0: Failed: 1I:1:10 - OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:11, 1I:1:12 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T194103
[01:00:29] <wikibugs_>	 10Operations, 10ops-codfw: Degraded RAID on db2067 - https://phabricator.wikimedia.org/T194103#4188843 (10ops-monitoring-bot)
[01:00:31] <icinga-wm>	 RECOVERY - nutcracker port on mw2220 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212
[01:01:10] <icinga-wm>	 RECOVERY - puppet last run on mw2221 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[01:03:50] <icinga-wm>	 RECOVERY - nutcracker process on mw2219 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker
[01:06:01] <icinga-wm>	 PROBLEM - MegaRAID on analytics1032 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough
[01:06:40] <icinga-wm>	 PROBLEM - Device not healthy -SMART- on db2067 is CRITICAL: cluster=mysql device=cciss,11 instance=db2067:9100 job=node site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2067&var-datasource=codfw%2520prometheus%252Fops
[01:07:40] <icinga-wm>	 RECOVERY - Check the NTP synchronisation status of timesyncd on mw2219 is OK: OK: synced at Tue 2018-05-08 01:07:33 UTC.
[01:16:10] <icinga-wm>	 RECOVERY - MegaRAID on analytics1032 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy
[01:46:21] <icinga-wm>	 PROBLEM - MegaRAID on analytics1032 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough
[02:33:09] <logmsgbot>	 !log l10nupdate@tin scap sync-l10n completed (1.32.0-wmf.2) (duration: 05m 45s)
[02:33:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:46:51] <icinga-wm>	 RECOVERY - MegaRAID on analytics1032 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy
[03:28:00] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 700.40 seconds
[03:47:20] <icinga-wm>	 PROBLEM - MegaRAID on analytics1032 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough
[04:15:00] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 200.10 seconds
[04:24:42] <logmsgbot>	 !log dzahn@neodymium conftool action : set/pooled=yes; selector: name=mw2219.codfw.wmnet
[04:24:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:26:20] <logmsgbot>	 !log dzahn@neodymium conftool action : set/pooled=yes; selector: name=mw2220.codfw.wmnet
[04:26:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:27:53] <logmsgbot>	 !log dzahn@neodymium conftool action : set/pooled=yes; selector: name=mw2221.codfw.wmnet
[04:27:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:30:20] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw2219 is OK: OK
[04:30:40] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw2220 is OK: OK
[04:31:00] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw2221 is OK: OK
[05:12:30] <icinga-wm>	 RECOVERY - Maps - OSM synchronization lag - eqiad on einsteinium is OK: (C)1.728e+05 ge (W)9e+04 ge 1.874e+04 https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=11&fullscreen&orgId=1
[05:14:21] <wikibugs_>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1097:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431698
[05:16:00] <wikibugs_>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1097:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431698 (owner: 10Marostegui)
[05:16:43] <wikibugs_>	 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2067 - https://phabricator.wikimedia.org/T194103#4189150 (10Marostegui) a:03Papaul @Papaul can we get a new disk for this one? Thanks!
[05:17:15] <wikibugs_>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1097:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431698 (owner: 10Marostegui)
[05:17:19] <wikibugs_>	 (03PS2) 10Marostegui: Revert "wiki replicas: Depool labsdb1011 for MCR table changes" [puppet] - 10https://gerrit.wikimedia.org/r/431672 (owner: 10Bstorm)
[05:18:35] <icinga-wm>	 ACKNOWLEDGEMENT - Device not healthy -SMART- on db2067 is CRITICAL: cluster=mysql device=cciss,11 instance=db2067:9100 job=node site=codfw Marostegui https://phabricator.wikimedia.org/T194103 https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2067&var-datasource=codfw%2520prometheus%252Fops
[05:18:41] <wikibugs_>	 (03CR) 10Marostegui: [C: 032] Revert "wiki replicas: Depool labsdb1011 for MCR table changes" [puppet] - 10https://gerrit.wikimedia.org/r/431672 (owner: 10Bstorm)
[05:18:59] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1097:3314 after alter table (duration: 01m 00s)
[05:19:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:19:30] <wikibugs_>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1097:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431698 (owner: 10Marostegui)
[05:19:33] <marostegui>	 !log Reload haproxy on dbproxy1010 to repool labsdb1011 - https://phabricator.wikimedia.org/T174047
[05:19:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:22:11] <wikibugs_>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431701 (https://phabricator.wikimedia.org/T190148)
[05:23:39] <wikibugs_>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431701 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui)
[05:24:56] <wikibugs_>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431701 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui)
[05:25:35] <wikibugs_>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431701 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui)
[05:26:10] <wikibugs_>	 (03PS2) 10Marostegui: mariadb: db1069 is now x1 master [puppet] - 10https://gerrit.wikimedia.org/r/431568 (https://phabricator.wikimedia.org/T186320)
[05:26:12] <wikibugs_>	 (03PS2) 10Marostegui: db-eqiad.php: Promote db1069 to be x1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431566 (https://phabricator.wikimedia.org/T186320)
[05:26:18] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1121 for alter table (duration: 01m 00s)
[05:26:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:26:22] <marostegui>	 !log Deploy schema change on db1121 with replication (this will generate lag on labs on s4) - T191519 T188299 T190148
[05:26:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:26:27] <stashbot>	 T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519
[05:26:28] <stashbot>	 T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148
[05:26:28] <stashbot>	 T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299
[05:28:22] <marostegui>	 !log Disable gtid on db1069 an db2034 before x1 failover - T186320
[05:28:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:28:26] <stashbot>	 T186320: Decommission db1051-db1060 (DBA tracking) - https://phabricator.wikimedia.org/T186320
[05:29:48] <marostegui>	 !log Disable puppet on db1055 and db1069 before x1 failover - T186320
[05:29:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:36:35] <marostegui>	 !log Move dbstore1002:x1 under db1069 for x1 failover - T186320
[05:36:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:36:39] <stashbot>	 T186320: Decommission db1051-db1060 (DBA tracking) - https://phabricator.wikimedia.org/T186320
[05:41:22] <marostegui>	 !log Move db2034 under db1069 for x1 failover - T186320
[05:41:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:42:30] <wikibugs_>	 (03CR) 10Marostegui: [C: 032] mariadb: db1069 is now x1 master [puppet] - 10https://gerrit.wikimedia.org/r/431568 (https://phabricator.wikimedia.org/T186320) (owner: 10Marostegui)
[05:58:07] <marostegui>	 We are starting the x1 failover in 2 minutes
[05:59:08] <marostegui>	 Going to merge: https://gerrit.wikimedia.org/r/#/c/431566/ without deploying
[05:59:37] <wikibugs_>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Promote db1069 to be x1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431566 (https://phabricator.wikimedia.org/T186320) (owner: 10Marostegui)
[06:00:04] <jouncebot>	 marostegui and jynus: Your horoscope predicts another unfortunate x1 master switchover deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180508T0600).
[06:00:05] <marostegui>	 jynus: ready?
[06:00:10] <marostegui>	 hahaha
[06:00:12] <jynus>	 yes
[06:00:16] <marostegui>	 let's go then
[06:00:24] <marostegui>	 !log Start x1 failover
[06:00:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:00:29] <marostegui>	 !log Set db1055 ready only
[06:00:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:00:45] <marostegui>	 done
[06:00:54] <wikibugs_>	 (03Merged) 10jenkins-bot: db-eqiad.php: Promote db1069 to be x1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431566 (https://phabricator.wikimedia.org/T186320) (owner: 10Marostegui)
[06:01:09] <wikibugs_>	 (03CR) 10jenkins-bot: db-eqiad.php: Promote db1069 to be x1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431566 (https://phabricator.wikimedia.org/T186320) (owner: 10Marostegui)
[06:02:11] <jynus>	 db1069-bin.000008:928243708
[06:02:21] <marostegui>	 yep!
[06:03:23] <marostegui>	 running puppet
[06:03:57] <marostegui>	 all looking good
[06:03:59] <jynus>	 I see db1055 advancing its master log
[06:04:01] <marostegui>	 going to deploy mediawiki
[06:04:23] <marostegui>	 good from your side?
[06:04:31] <jynus>	 sure
[06:04:35] <marostegui>	 deploying
[06:05:38] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Promote db1069 as new x1 master (duration: 01m 00s)
[06:05:39] <marostegui>	 going to to disable read_only on db1069
[06:05:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:05:52] <marostegui>	 !log Read_only=off on db1069 to finish with the x1 failover
[06:05:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:05:56] <marostegui>	 done
[06:06:16] <jynus>	 I'm going to update tendril to confirm
[06:06:20] <marostegui>	 good!
[06:07:18] <marostegui>	 I can see writes coming to db1069
[06:07:58] <jynus>	 replication errors stopped
[06:08:10] <marostegui>	 fatals also
[06:08:43] <jynus>	 errors from 6:01:30 to 6:05:30
[06:10:05] <marostegui>	 matches the read only times yep
[06:10:17] <wikibugs_>	 (03CR) 10Marostegui: [C: 032] x1.hosts: db1069 is the new x1 master [software] - 10https://gerrit.wikimedia.org/r/431567 (https://phabricator.wikimedia.org/T186320) (owner: 10Marostegui)
[06:11:05] <wikibugs_>	 (03Merged) 10jenkins-bot: x1.hosts: db1069 is the new x1 master [software] - 10https://gerrit.wikimedia.org/r/431567 (https://phabricator.wikimedia.org/T186320) (owner: 10Marostegui)
[06:21:11] <wikibugs_>	 (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Remove db1060 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431703 (https://phabricator.wikimedia.org/T193732)
[06:22:50] <wikibugs_>	 (03CR) 10Marostegui: [C: 032] db-eqiad,db-codfw.php: Remove db1060 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431703 (https://phabricator.wikimedia.org/T193732) (owner: 10Marostegui)
[06:24:05] <wikibugs_>	 (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db1060 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431703 (https://phabricator.wikimedia.org/T193732) (owner: 10Marostegui)
[06:25:30] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Remove db1060 from config - T193732 (duration: 00m 59s)
[06:25:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:25:34] <stashbot>	 T193732: Decommission db1060 - https://phabricator.wikimedia.org/T193732
[06:26:37] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-codfw.php: Remove db1060 from config - T193732 (duration: 01m 01s)
[06:26:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:29:39] <wikibugs_>	 (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db1060 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431703 (https://phabricator.wikimedia.org/T193732) (owner: 10Marostegui)
[06:29:51] <icinga-wm>	 PROBLEM - puppet last run on labstore1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/apparmor.d/abstractions/ssl_certs]
[06:31:02] <wikibugs_>	 (03PS1) 10Marostegui: mariadb: Set db1060 as spare [puppet] - 10https://gerrit.wikimedia.org/r/431704 (https://phabricator.wikimedia.org/T193732)
[06:31:40] <wikibugs_>	 (03PS2) 10Marostegui: mariadb: Set db1060 as spare [puppet] - 10https://gerrit.wikimedia.org/r/431704 (https://phabricator.wikimedia.org/T193732)
[06:39:29] <wikibugs_>	 (03CR) 10Marostegui: [C: 032] mariadb: Set db1060 as spare [puppet] - 10https://gerrit.wikimedia.org/r/431704 (https://phabricator.wikimedia.org/T193732) (owner: 10Marostegui)
[06:47:55] <wikibugs_>	 (03PS3) 10Jcrespo: tendril: Move cron jobs to dbmonitor, remove proxysql from terbium [puppet] - 10https://gerrit.wikimedia.org/r/431529 (https://phabricator.wikimedia.org/T193919)
[06:50:04] <moritzm>	 !log reimaging mw1313, mw1343, mw1344 to stretch
[06:50:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:51:16] <marostegui>	 !log Stop MySQL on db1060 as it will be decommissioned - T193732
[06:51:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:51:20] <stashbot>	 T193732: Decommission db1060 - https://phabricator.wikimedia.org/T193732
[06:51:54] <wikibugs_>	 (03CR) 10Jcrespo: [C: 032] tendril: Move cron jobs to dbmonitor, remove proxysql from terbium [puppet] - 10https://gerrit.wikimedia.org/r/431529 (https://phabricator.wikimedia.org/T193919) (owner: 10Jcrespo)
[06:53:38] <wikibugs_>	 (03PS1) 10Marostegui: s2.hosts: Remove db1060 [software] - 10https://gerrit.wikimedia.org/r/431705 (https://phabricator.wikimedia.org/T193732)
[06:55:11] <wikibugs_>	 (03CR) 10Marostegui: [C: 032] s2.hosts: Remove db1060 [software] - 10https://gerrit.wikimedia.org/r/431705 (https://phabricator.wikimedia.org/T193732) (owner: 10Marostegui)
[06:55:56] <wikibugs_>	 (03Merged) 10jenkins-bot: s2.hosts: Remove db1060 [software] - 10https://gerrit.wikimedia.org/r/431705 (https://phabricator.wikimedia.org/T193732) (owner: 10Marostegui)
[06:59:00] <wikibugs_>	 (03CR) 10Nikerabbit: [C: 031] cawiki: remove gendered namespace aliases, already on MW core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429989 (https://phabricator.wikimedia.org/T113616) (owner: 10MarcoAurelio)
[07:00:08] <icinga-wm>	 RECOVERY - puppet last run on labstore1003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[07:02:08] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received: /en.wikipedia.org/v1/page/metadata/{title}{/revision} (Get extended metadata of a test page) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for Apri
[07:02:08] <icinga-wm>	 ut before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received
[07:02:28] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received
[07:03:08] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy
[07:03:29] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy
[07:05:41] <wikibugs_>	 10Operations, 10ops-eqiad, 10DBA, 10decommission, 10Patch-For-Review: Decommission db1060 - https://phabricator.wikimedia.org/T193732#4189275 (10Marostegui) a:05Marostegui>03RobH This is ready for @RobH and DC-Ops to take over
[07:06:14] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received
[07:06:27] <wikibugs_>	 (03PS1) 10Muehlenhoff: Move scap proxy in A3 to mw2216 [puppet] - 10https://gerrit.wikimedia.org/r/431706
[07:06:43] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received
[07:06:43] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received
[07:07:21] <wikibugs_>	 10Operations, 10HHVM, 10Patch-For-Review, 10User-Elukey: Upgrade mw* servers to Debian Stretch (using HHVM) - https://phabricator.wikimedia.org/T174431#4189280 (10MoritzMuehlenhoff) All application servers are now running stretch (excluding job runners and API servers).
[07:07:33] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy
[07:07:34] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy
[07:07:45] <wikibugs_>	 10Operations, 10HHVM, 10Patch-For-Review, 10User-Elukey: Upgrade mw* servers to Debian Stretch (using HHVM) - https://phabricator.wikimedia.org/T174431#4189281 (10MoritzMuehlenhoff)
[07:08:13] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy
[07:08:44] <wikibugs_>	 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4189282 (10Marostegui) What if we temporarily convert db2092 (s1) to codfw sanitarium, copy db1116's data to db2092. Once the n...
[07:09:13] <icinga-wm>	 PROBLEM - eventstreams on scb1002 is CRITICAL: connect to address 10.64.16.21 and port 8092: Connection refused
[07:11:06] <wikibugs_>	 (03CR) 10Muehlenhoff: [C: 032] Move scap proxy in A3 to mw2216 [puppet] - 10https://gerrit.wikimedia.org/r/431706 (owner: 10Muehlenhoff)
[07:11:43] <wikibugs_>	 (03PS1) 10Jcrespo: tendril: Explicit perl package dependencies on maintenance [puppet] - 10https://gerrit.wikimedia.org/r/431707 (https://phabricator.wikimedia.org/T184797)
[07:12:08] <wikibugs_>	 (03PS2) 10Jcrespo: tendril: Explicit perl package dependencies on maintenance [puppet] - 10https://gerrit.wikimedia.org/r/431707 (https://phabricator.wikimedia.org/T184797)
[07:12:13] <icinga-wm>	 RECOVERY - eventstreams on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 1066 bytes in 0.023 second response time
[07:13:32] <wikibugs_>	 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4189285 (10jcrespo) But one host will not be enough, we need 2.
[07:14:30] <wikibugs_>	 (03CR) 10Jcrespo: [C: 032] tendril: Explicit perl package dependencies on maintenance [puppet] - 10https://gerrit.wikimedia.org/r/431707 (https://phabricator.wikimedia.org/T184797) (owner: 10Jcrespo)
[07:15:29] <wikibugs_>	 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4189287 (10Marostegui) >>! In T190704#4189285, @jcrespo wrote: > But one host will not be enough, we need 2.  Yes, but for that...
[07:30:57] <jynus>	 !log cleaning up maintenance hosts (terbium, etc.) from tendril maintenance files
[07:31:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:34:21] <wikibugs_>	 (03PS1) 10Marostegui: db-codfw.php: Depool db2092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431708 (https://phabricator.wikimedia.org/T190704)
[07:41:17] <wikibugs_>	 10Operations, 10Dumps-Generation, 10Patch-For-Review: data retrieval/write issues via NFS on dumpsdata1001, impacting some dump jobs - https://phabricator.wikimedia.org/T191177#4189336 (10ArielGlenn) 05Open>03Resolved This month's run looks good, no nulls in stub files, no other weirdness either so I'm g...
[07:41:35] <wikibugs_>	 (03PS1) 10Marostegui: mariadb: Convert db2092 to sanitarium multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/431709 (https://phabricator.wikimedia.org/T190704)
[07:41:48] <wikibugs_>	 10Operations, 10Patch-For-Review: setup replacement for terbium (maintenance_server) on stretch - https://phabricator.wikimedia.org/T192092#4189342 (10jcrespo)
[07:41:50] <wikibugs_>	 10Operations, 10Patch-For-Review: provide proxysql for stretch, add package to puppet - https://phabricator.wikimedia.org/T193919#4189339 (10jcrespo) 05Open>03Resolved a:03jcrespo * proxysql and tendril maintenance have been removed from mediawiki maintenance * proxysql for stretch package has been uploa...
[07:44:27] <wikibugs_>	 (03PS8) 10Elukey: role::aqs: deprecate cassandra-metrics-collector [puppet] - 10https://gerrit.wikimedia.org/r/431546 (https://phabricator.wikimedia.org/T186567)
[07:44:36] <wikibugs_>	 (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/compiler02/11155/" [puppet] - 10https://gerrit.wikimedia.org/r/431709 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui)
[07:45:13] <wikibugs_>	 (03PS2) 10Marostegui: mariadb: Convert db2092 to sanitarium multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/431709 (https://phabricator.wikimedia.org/T190704)
[07:45:26] <wikibugs_>	 (03CR) 10Elukey: [C: 032] role::aqs: deprecate cassandra-metrics-collector [puppet] - 10https://gerrit.wikimedia.org/r/431546 (https://phabricator.wikimedia.org/T186567) (owner: 10Elukey)
[07:46:33] <wikibugs_>	 (03PS3) 10Marostegui: mariadb: Convert db2092 to sanitarium multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/431709 (https://phabricator.wikimedia.org/T190704)
[07:47:00] <wikibugs_>	 (03PS4) 10Marostegui: mariadb: Convert db2092 to sanitarium multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/431709 (https://phabricator.wikimedia.org/T190704)
[07:48:03] <wikibugs_>	 (03CR) 10Marostegui: [C: 032] mariadb: Convert db2092 to sanitarium multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/431709 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui)
[07:48:40] <icinga-wm>	 PROBLEM - Check systemd state on aqs1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[07:49:44] <elukey>	 this is me --^
[07:49:49] <icinga-wm>	 RECOVERY - Check systemd state on aqs1004 is OK: OK - running: The system is fully operational
[07:49:52] <elukey>	 I am removing cassandra metrics collector
[07:51:31] <wikibugs_>	 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4081506 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` db209...
[07:52:10] <icinga-wm>	 PROBLEM - Check systemd state on aqs1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[07:52:21] <moritzm>	 elukey: you'll also need to remove the wmf_auto_restart for the cassandra-metrics-collector
[07:52:44] <elukey>	 moritzm: yep yep
[07:53:19] <icinga-wm>	 RECOVERY - Check systemd state on aqs1008 is OK: OK - running: The system is fully operational
[07:53:25] <elukey>	 !log second attempt to remove the cassandra-metrics-collector (+ cleanup) from aqs*
[07:53:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:54:25] <elukey>	 moritzm: ah I didn't see the ensure for that class! Amending puppet now
[07:55:44] <moritzm>	 ack!
[07:56:19] <wikibugs_>	 (03PS1) 10Elukey: cassandra::metrics: propagate ensure parameter to wmf-auto-restart [puppet] - 10https://gerrit.wikimedia.org/r/431710 (https://phabricator.wikimedia.org/T186567)
[07:56:31] <wikibugs_>	 (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431708 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui)
[07:56:44] <wikibugs_>	 (03PS1) 10Marostegui: sX.hosts: db2092 is now multiinstance [software] - 10https://gerrit.wikimedia.org/r/431711 (https://phabricator.wikimedia.org/T190704)
[07:56:54] <wikibugs_>	 (03CR) 10Elukey: [C: 032] cassandra::metrics: propagate ensure parameter to wmf-auto-restart [puppet] - 10https://gerrit.wikimedia.org/r/431710 (https://phabricator.wikimedia.org/T186567) (owner: 10Elukey)
[07:57:42] <wikibugs_>	 (03CR) 10Marostegui: [C: 032] sX.hosts: db2092 is now multiinstance [software] - 10https://gerrit.wikimedia.org/r/431711 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui)
[07:57:51] <wikibugs_>	 (03Merged) 10jenkins-bot: db-codfw.php: Depool db2092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431708 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui)
[07:58:28] <wikibugs_>	 (03Merged) 10jenkins-bot: sX.hosts: db2092 is now multiinstance [software] - 10https://gerrit.wikimedia.org/r/431711 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui)
[07:59:04] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2092 T190704 (duration: 00m 57s)
[07:59:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:59:08] <stashbot>	 T190704: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704
[07:59:39] <wikibugs_>	 (03CR) 10jenkins-bot: db-codfw.php: Depool db2092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431708 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui)
[08:03:24] <marostegui>	 !log Stop MySQL on db1116 to transfer its content to db2092 - T190704
[08:03:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:04:07] <wikibugs_>	 10Operations, 10Puppet, 10DBA, 10Patch-For-Review: Move mariadb_maintenance away from terbium/wasat (mediawiki_maintenance) - https://phabricator.wikimedia.org/T184797#4189395 (10jcrespo) 05Open>03Resolved a:03jcrespo Done, no maintenance code yet for database maintenance, but that is still on terbiu...
[08:12:12] <wikibugs_>	 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4189416 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db2092.codfw.wmnet'] ```  and were **ALL** successful.
[08:14:22] <wikibugs_>	 10Operations, 10MediaWiki-Parser, 10MediaWiki-Platform-Team, 10Parsing-Team, and 2 others: Different production servers have different versions of tidy installed, resulting in varying output - https://phabricator.wikimedia.org/T193414#4168635 (10MoritzMuehlenhoff) I've created some test packages at https:/...
[08:25:14] <wikibugs_>	 (03PS1) 10Vgutierrez: mtail: Fix varnishrls regex [WIP] [puppet] - 10https://gerrit.wikimedia.org/r/431712 (https://phabricator.wikimedia.org/T184942)
[08:25:20] <wikibugs_>	 (03PS2) 10Jcrespo: proxysql: require proxysql package installation for module proxysql [puppet] - 10https://gerrit.wikimedia.org/r/431584 (https://phabricator.wikimedia.org/T193919)
[08:25:42] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] mtail: Fix varnishrls regex [WIP] [puppet] - 10https://gerrit.wikimedia.org/r/431712 (https://phabricator.wikimedia.org/T184942) (owner: 10Vgutierrez)
[08:26:52] <wikibugs_>	 (03PS8) 10Jcrespo: Install parallel gzip (pigz) and parallel xz (pxz) on all servers [puppet] - 10https://gerrit.wikimedia.org/r/419709
[08:27:11] <moritzm>	 !log reimaging mw1308, mw1309 (job runners) to stretch
[08:27:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:28:32] <wikibugs_>	 (03CR) 10Jcrespo: [C: 032] Install parallel gzip (pigz) and parallel xz (pxz) on all servers [puppet] - 10https://gerrit.wikimedia.org/r/419709 (owner: 10Jcrespo)
[08:28:45] <wikibugs_>	 (03PS2) 10Vgutierrez: mtail: Fix varnishrls regex [WIP] [puppet] - 10https://gerrit.wikimedia.org/r/431712 (https://phabricator.wikimedia.org/T184942)
[08:29:12] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] mtail: Fix varnishrls regex [WIP] [puppet] - 10https://gerrit.wikimedia.org/r/431712 (https://phabricator.wikimedia.org/T184942) (owner: 10Vgutierrez)
[08:30:13] <moritzm>	 !log reimaging mw2156, mw2157, mw2158 (job runners) to stretch
[08:30:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:42:19] <wikibugs_>	 (03PS3) 10Vgutierrez: mtail: Fix varnishrls regex [WIP] [puppet] - 10https://gerrit.wikimedia.org/r/431712 (https://phabricator.wikimedia.org/T184942)
[08:43:22] <wikibugs_>	 (03PS4) 10Vgutierrez: mtail: Fix varnishrls regex [puppet] - 10https://gerrit.wikimedia.org/r/431712 (https://phabricator.wikimedia.org/T184942)
[08:45:10] <kart_>	 akosiaris: any issue with scb1002?
[08:49:25] <icinga-wm>	 RECOVERY - MegaRAID on analytics1032 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy
[08:53:27] <wikibugs_>	 (03PS1) 10Marostegui: mariadb: Enable innodb_strict_mode on the last two roles [puppet] - 10https://gerrit.wikimedia.org/r/431715 (https://phabricator.wikimedia.org/T150949)
[08:55:22] <wikibugs_>	 (03CR) 10Marostegui: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/11156/" [puppet] - 10https://gerrit.wikimedia.org/r/431715 (https://phabricator.wikimedia.org/T150949) (owner: 10Marostegui)
[08:56:40] <moritzm>	 !log reimaging mw1345, mw1346 (API servers) to stretch
[08:56:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:59:24] <wikibugs_>	 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942#4189645 (10Vgutierrez) >>! In T184942#4187502, @Krinkle wrote: > @Vgutierrez @ema I'm working on using the Prometheus metrics for the ResourceLoader dash...
[09:04:51] <wikibugs_>	 (03PS3) 10MarcoAurelio: cawiki: remove gendered namespace aliases, already on MW core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429989 (https://phabricator.wikimedia.org/T113616)
[09:08:24] <wikibugs_>	 (03PS1) 10Jcrespo: admin: Adjustments to jynus' defaults and aliases [puppet] - 10https://gerrit.wikimedia.org/r/431716
[09:10:23] <wikibugs_>	 (03CR) 10Jcrespo: [C: 032] admin: Adjustments to jynus' defaults and aliases [puppet] - 10https://gerrit.wikimedia.org/r/431716 (owner: 10Jcrespo)
[09:12:12] <wikibugs_>	 10Operations, 10Traffic: Consider adding expect-CT: header to enforce certificate transparency - https://phabricator.wikimedia.org/T193521#4189695 (10Vgutierrez) p:05Triage>03Normal
[09:17:39] <gehel>	 !log reducing replication factor on cassandra v3 (unused) keyspace for maps
[09:17:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:19:35] <icinga-wm>	 PROBLEM - MegaRAID on analytics1032 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough
[09:20:22] <elukey>	 !log forced a BBU re-learn cycle on analytics1032
[09:20:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:20:45] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 33821672
[09:21:32] <wikibugs_>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1121" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431717
[09:21:33] <wikibugs_>	 (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1121" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431717
[09:21:35] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 49390072
[09:21:45] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 794840
[09:23:17] <wikibugs_>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1121" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431717 (owner: 10Marostegui)
[09:23:55] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 45290216
[09:24:28] <wikibugs_>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1121" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431717 (owner: 10Marostegui)
[09:24:46] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 21906160
[09:25:05] <icinga-wm>	 PROBLEM - Disk space on maps2004 is CRITICAL: DISK CRITICAL - free space: /srv 53039 MB (3% inode=99%)
[09:25:44] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1121 after alter table (duration: 01m 00s)
[09:25:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:25:58] <wikibugs_>	 (03Abandoned) 10Jcrespo: [WIP] Move all misc db scripts to db_maintenance module [puppet] - 10https://gerrit.wikimedia.org/r/295654 (owner: 10Jcrespo)
[09:26:06] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 47955288
[09:27:15] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 122681864
[09:28:06] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 154968
[09:29:51] <wikibugs_>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1121" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431717 (owner: 10Marostegui)
[09:32:17] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[09:33:16] <elukey>	 I am guessing that these are the last reimages --^
[09:33:40] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw2157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[09:33:58] <elukey>	 mw1309/08 are the offenders
[09:34:20] <elukey>	 so yeah good :)
[09:34:31] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw2156 is CRITICAL: Host mw2156 is not in mediawiki-installation dsh group
[09:34:40] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 86122968
[09:35:40] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0
[09:36:20] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw2158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[09:37:05] <wikibugs_>	 (03PS1) 10Jcrespo: proxysql: Changes added (simplifications) to the proxysql class [puppet] - 10https://gerrit.wikimedia.org/r/431720 (https://phabricator.wikimedia.org/T171071)
[09:37:24] <wikibugs_>	 (03Abandoned) 10Jcrespo: proxysql: Changes added (simplifications) to the proxysql class [puppet] - 10https://gerrit.wikimedia.org/r/404154 (https://phabricator.wikimedia.org/T171071) (owner: 10Jcrespo)
[09:38:21] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[09:38:31] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw2156 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[09:39:42] <wikibugs_>	 (03CR) 10Jcrespo: [C: 032] proxysql: Changes added (simplifications) to the proxysql class [puppet] - 10https://gerrit.wikimedia.org/r/431720 (https://phabricator.wikimedia.org/T171071) (owner: 10Jcrespo)
[09:40:30] <moritzm>	 yeah, silencing
[09:41:10] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw2157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[09:41:56] <wikibugs_>	 (03Abandoned) 10Jcrespo: mw-maintenance: move mariadb maintenance to tendril [puppet] - 10https://gerrit.wikimedia.org/r/403978 (https://phabricator.wikimedia.org/T184797) (owner: 10Dzahn)
[09:46:50] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 29262672
[09:47:00] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2004 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 181536
[09:47:51] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 455056
[09:47:57] <wikibugs_>	 (03CR) 10Ema: [C: 031] mtail: Fix varnishrls regex [puppet] - 10https://gerrit.wikimedia.org/r/431712 (https://phabricator.wikimedia.org/T184942) (owner: 10Vgutierrez)
[09:48:22] <wikibugs_>	 (03PS1) 10Jcrespo: mariadb: Depool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431722 (https://phabricator.wikimedia.org/T194118)
[09:48:55] <wikibugs_>	 (03CR) 10Vgutierrez: [C: 032] mtail: Fix varnishrls regex [puppet] - 10https://gerrit.wikimedia.org/r/431712 (https://phabricator.wikimedia.org/T184942) (owner: 10Vgutierrez)
[09:49:02] <wikibugs_>	 (03PS5) 10Vgutierrez: mtail: Fix varnishrls regex [puppet] - 10https://gerrit.wikimedia.org/r/431712 (https://phabricator.wikimedia.org/T184942)
[09:51:20] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 37410272
[09:51:41] <wikibugs_>	 (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431722 (https://phabricator.wikimedia.org/T194118) (owner: 10Jcrespo)
[09:52:08] <wikibugs_>	 (03CR) 10Ema: prometheus: varnish_thumbnails aggregation rule (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/431528 (https://phabricator.wikimedia.org/T184942) (owner: 10Ema)
[09:52:55] <wikibugs_>	 (03Merged) 10jenkins-bot: mariadb: Depool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431722 (https://phabricator.wikimedia.org/T194118) (owner: 10Jcrespo)
[09:53:10] <wikibugs_>	 (03CR) 10jenkins-bot: mariadb: Depool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431722 (https://phabricator.wikimedia.org/T194118) (owner: 10Jcrespo)
[09:55:40] <icinga-wm>	 RECOVERY - Disk space on maps2004 is OK: DISK OK
[09:58:38] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1055 (duration: 00m 54s)
[09:58:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:03:30] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 906600
[10:09:40] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[10:11:50] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[10:15:19] <moritzm>	 !log reimaging mw1310, mw1311 (job runners) to stretch
[10:15:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:20:38] <wikibugs_>	 (03PS1) 10Jcrespo: mariadb: Move db1064 from s4 to x1 [puppet] - 10https://gerrit.wikimedia.org/r/431724 (https://phabricator.wikimedia.org/T194118)
[10:21:24] <wikibugs_>	 (03CR) 10Jcrespo: [C: 032] mariadb: Move db1064 from s4 to x1 [puppet] - 10https://gerrit.wikimedia.org/r/431724 (https://phabricator.wikimedia.org/T194118) (owner: 10Jcrespo)
[10:22:53] <jynus>	 !log stop mariadb on db1055 to clone it to db1064
[10:22:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:29:10] <moritzm>	 !log reimaging mw1347, mw1348 (API servers) to stretch (last two remaining API servers in eqiad)
[10:29:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:37:57] <wikibugs_>	 10Operations, 10SRE-Access-Requests: Access to Google Search Console for Go Fish Digital - https://phabricator.wikimedia.org/T192893#4189871 (10Deskana) >>! In T192893#4185804, @faidon wrote: > I'm not sure if this needs my approval, but if it does, it has it, as long as: > - The console data contain PII, so a...
[10:38:04] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 37769144
[10:40:48] <wikibugs_>	 (03PS1) 10Jcrespo: mariadb: Really depool db2092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431727
[10:46:13] <wikibugs_>	 (03CR) 10Ladsgroup: BETA ONLY - WikibaseLexeme config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431306 (https://phabricator.wikimedia.org/T184745) (owner: 10Addshore)
[10:46:34] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 33544504
[10:47:43] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2004 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 237792
[10:47:44] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 89165568
[10:49:54] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 249104
[10:52:04] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 17801672
[10:57:43] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 18855632
[10:58:02] <wikibugs_>	 (03CR) 10Sbisson: [C: 031] "oups, my bad" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431609 (owner: 10Catrope)
[10:58:46] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 214528
[11:00:56] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 40704072
[11:01:36] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1310 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:02:56] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 22352296
[11:03:47] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 41370656
[11:04:16] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1311 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:04:47] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 376
[11:04:56] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2004 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0
[11:04:57] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0
[11:09:07] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1310 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:10:46] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw1311 is CRITICAL: Host mw1311 is not in mediawiki-installation dsh group
[11:10:55] <moritzm>	 ^ silenced
[11:13:56] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1310 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time
[11:14:45] <wikibugs_>	 (03CR) 10Marostegui: [C: 032] "Thanks for catching thins" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431727 (owner: 10Jcrespo)
[11:15:58] <wikibugs_>	 (03Merged) 10jenkins-bot: mariadb: Really depool db2092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431727 (owner: 10Jcrespo)
[11:16:26] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1310 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.120 second response time
[11:16:37] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1311 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.010 second response time
[11:18:05] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-codfw.php: Really depool db2092 (duration: 00m 53s)
[11:18:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:18:19] * addshore can't access phab.wm.o..... stupid wifi...
[11:18:47] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw2157 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.085 second response time
[11:18:49] <addshore>	 https://www.irccloud.com/pastebin/mrcV05il/
[11:19:26] <wikibugs_>	 (03CR) 10jenkins-bot: mariadb: Really depool db2092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431727 (owner: 10Jcrespo)
[11:20:27] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[11:20:47] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw2157 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.152 second response time
[11:22:27] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw2158 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.165 second response time
[11:23:48] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw2156 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.074 second response time
[11:24:58] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 18159728
[11:25:58] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2004 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 396568
[11:27:30] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[11:27:41] <wikibugs_>	 10Operations, 10SRE-Access-Requests: Access to Google Search Console for Go Fish Digital - https://phabricator.wikimedia.org/T192893#4190022 (10faidon) Thanks @Deskana :) I think that all seems sufficient and we should just go ahead with this. 2018-08-01 sounds reasonable, and we can always extend this if ther...
[11:28:30] <addshore>	 Anyone any idea why I would be getting "You don't have permission to access / on this server." for phabricator.wikimedia.org ? :/
[11:28:50] <_joe_>	 addshore: specific url please
[11:28:58] <addshore>	 https://phabricator.wikimedia.org/
[11:29:08] <elukey>	 ip banned?
[11:29:11] <addshore>	 I'm thinking its something to do with the wifi I'm on,
[11:29:23] <addshore>	 ooh, is there a way to check that?
[11:29:41] <_joe_>	 yes I think that's the more probable cause
[11:30:11] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 30717768
[11:30:11] <icinga-wm>	 RECOVERY - MegaRAID on analytics1032 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy
[11:30:31] <addshore>	 it seems to be hitting the wmf server afaict
[11:30:39] <marostegui>	 jouncebot: next
[11:30:39] <jouncebot>	 In 1 hour(s) and 29 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180508T1300)
[11:31:00] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 22205400
[11:32:00] <wikibugs_>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431733 (https://phabricator.wikimedia.org/T190148)
[11:32:11] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 171733376
[11:32:13] <elukey>	 addshore: can you tell me in pvt your external IP address?
[11:33:00] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0
[11:33:11] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0
[11:33:11] <addshore>	 elukey: yes
[11:33:24] <wikibugs_>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431733 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui)
[11:34:39] <wikibugs_>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431733 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui)
[11:36:00] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1103:3314 for alter table (duration: 00m 59s)
[11:36:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:36:13] <marostegui>	 !log Deploy schema change on db1103:3314 - T191519 T188299 T190148
[11:36:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:36:19] <stashbot>	 T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519
[11:36:19] <stashbot>	 T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148
[11:36:19] <stashbot>	 T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299
[11:39:27] <wikibugs_>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431733 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui)
[11:45:21] <wikibugs_>	 (03PS1) 10Jcrespo: mariadb: Pool db1064 into x1 with low load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431734 (https://phabricator.wikimedia.org/T194118)
[11:46:47] <wikibugs_>	 (03PS1) 10Jcrespo: mariadb: Remove references to db1055, to be decom [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431735 (https://phabricator.wikimedia.org/T194118)
[11:58:02] <wikibugs_>	 (03CR) 10Jcrespo: [C: 032] mariadb: Pool db1064 into x1 with low load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431734 (https://phabricator.wikimedia.org/T194118) (owner: 10Jcrespo)
[11:59:18] <wikibugs_>	 (03Merged) 10jenkins-bot: mariadb: Pool db1064 into x1 with low load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431734 (https://phabricator.wikimedia.org/T194118) (owner: 10Jcrespo)
[11:59:34] <wikibugs_>	 (03PS1) 10Giuseppe Lavagetto: mcrouter: add support for listening on the ssl port [puppet] - 10https://gerrit.wikimedia.org/r/431736 (https://phabricator.wikimedia.org/T192370)
[11:59:36] <wikibugs_>	 (03CR) 10jenkins-bot: mariadb: Pool db1064 into x1 with low load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431734 (https://phabricator.wikimedia.org/T194118) (owner: 10Jcrespo)
[11:59:38] <wikibugs_>	 (03PS1) 10Giuseppe Lavagetto: profile::mediawiki::mcrouter_wancache: add ssl, proxy support [puppet] - 10https://gerrit.wikimedia.org/r/431737 (https://phabricator.wikimedia.org/T192370)
[11:59:40] <wikibugs_>	 (03PS1) 10Giuseppe Lavagetto: puppet_ecdsacert: allow IP-based SANs [puppet] - 10https://gerrit.wikimedia.org/r/431738 (https://phabricator.wikimedia.org/T192370)
[12:00:17] <icinga-wm>	 PROBLEM - MegaRAID on analytics1032 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough
[12:00:22] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] profile::mediawiki::mcrouter_wancache: add ssl, proxy support [puppet] - 10https://gerrit.wikimedia.org/r/431737 (https://phabricator.wikimedia.org/T192370) (owner: 10Giuseppe Lavagetto)
[12:02:11] <wikibugs_>	 (03PS3) 10Thiemo Kreuz (WMDE): Update Wikidata property blacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430045 (owner: 10Matěj Suchánek)
[12:02:14] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-eqiad.php: Pool db1064 with low load (duration: 00m 59s)
[12:02:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:02:30] <wikibugs_>	 (03CR) 10Thiemo Kreuz (WMDE): [C: 031] "Both additions make sense and fit well with the other properties listed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430045 (owner: 10Matěj Suchánek)
[12:05:47] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 36843672
[12:06:47] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 25792272
[12:07:47] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2004 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1816
[12:07:47] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 35048
[12:10:17] <icinga-wm>	 RECOVERY - MegaRAID on analytics1032 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy
[12:12:42] <wikibugs_>	 (03PS2) 10Jcrespo: mariadb: Remove references to db1055, to be decom [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431735 (https://phabricator.wikimedia.org/T194118)
[12:12:44] <wikibugs_>	 (03PS1) 10Jcrespo: mariab: Fully pool db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431740 (https://phabricator.wikimedia.org/T194118)
[12:13:33] <wikibugs_>	 (03CR) 10Jcrespo: [C: 032] mariadb: Remove references to db1055, to be decom [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431735 (https://phabricator.wikimedia.org/T194118) (owner: 10Jcrespo)
[12:14:46] <wikibugs_>	 (03Merged) 10jenkins-bot: mariadb: Remove references to db1055, to be decom [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431735 (https://phabricator.wikimedia.org/T194118) (owner: 10Jcrespo)
[12:16:51] <moritzm>	 !log upgrading app servers in beta to 
[12:16:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:17:16] <moritzm>	 !log upgrading app servers in beta to wikidiff 1.6.0 (T190717)
[12:17:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:17:20] <stashbot>	 T190717: Update wikidiff2 library on the WMF production cluster - https://phabricator.wikimedia.org/T190717
[12:18:38] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-codfw.php: Remove db1055 (duration: 00m 59s)
[12:18:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:19:12] <moritzm>	 !log reimaging mw2159, mw2160, mw2161 (job runners) to stretch
[12:19:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:19:49] <wikibugs_>	 (03CR) 10jenkins-bot: mariadb: Remove references to db1055, to be decom [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431735 (https://phabricator.wikimedia.org/T194118) (owner: 10Jcrespo)
[12:20:52] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-eqiad.php: Remove db1055 (duration: 00m 59s)
[12:20:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:26:31] <wikibugs_>	 (03PS1) 10Jcrespo: mariadb: Upgrade proxysql package [software] - 10https://gerrit.wikimedia.org/r/431742 (https://phabricator.wikimedia.org/T175672)
[12:27:27] <wikibugs_>	 (03CR) 10Jcrespo: [C: 032] mariadb: Upgrade proxysql package [software] - 10https://gerrit.wikimedia.org/r/431742 (https://phabricator.wikimedia.org/T175672) (owner: 10Jcrespo)
[12:28:38] <doctaxon>	 Hi! Any logins of user TaxonBot (dewiki) return login {result Failed reason {You have made too many recent login attempts. Please wait 2 days before trying again.}}
[12:28:42] <doctaxon>	 chasemp:  
[12:28:48] <doctaxon>	 chasemp:  ^^
[12:29:43] <wikibugs_>	 (03PS4) 10Filippo Giunchedi: base: alert on edac (un)correctable errors [puppet] - 10https://gerrit.wikimedia.org/r/422110 (https://phabricator.wikimedia.org/T183177)
[12:33:34] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 32921368
[12:35:34] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0
[12:38:01] <wikibugs_>	 (03PS5) 10Filippo Giunchedi: base: alert on EDAC correctable errors [puppet] - 10https://gerrit.wikimedia.org/r/422110 (https://phabricator.wikimedia.org/T183177)
[12:39:14] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 89698744
[12:39:34] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 124811000
[12:39:44] <wikibugs_>	 10Operations, 10MediaWiki-General-or-Unknown: Investigate spike in 500s during asw-c2-eqiad replacement - https://phabricator.wikimedia.org/T156475#4190160 (10jcrespo)
[12:40:15] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 252976
[12:40:48] <wikibugs_>	 10Operations, 10WMDE-QWERTY-Team, 10wikidiff2, 10Patch-For-Review: Update wikidiff2 library on the WMF production cluster - https://phabricator.wikimedia.org/T190717#4190164 (10MoritzMuehlenhoff) > In the mean time deployment-prep was also migrated to stretch, so as a preparatory step I'll prepare wikidiff...
[12:42:44] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2004 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 16
[12:48:04] <wikibugs_>	 (03CR) 10Filippo Giunchedi: [C: 032] base: alert on EDAC correctable errors [puppet] - 10https://gerrit.wikimedia.org/r/422110 (https://phabricator.wikimedia.org/T183177) (owner: 10Filippo Giunchedi)
[12:48:31] <wikibugs_>	 (03PS1) 10Jcrespo: dbhosts: Promote db1069 as master, remove db1055 [software] - 10https://gerrit.wikimedia.org/r/431747 (https://phabricator.wikimedia.org/T194118)
[12:49:04] <wikibugs_>	 (03PS2) 10Jcrespo: dbhosts: Remove db1055 [software] - 10https://gerrit.wikimedia.org/r/431747 (https://phabricator.wikimedia.org/T194118)
[12:50:03] <wikibugs_>	 (03CR) 10Jcrespo: [C: 032] dbhosts: Remove db1055 [software] - 10https://gerrit.wikimedia.org/r/431747 (https://phabricator.wikimedia.org/T194118) (owner: 10Jcrespo)
[12:50:17] <wikibugs_>	 (03CR) 10Jcrespo: [V: 032 C: 032] dbhosts: Remove db1055 [software] - 10https://gerrit.wikimedia.org/r/431747 (https://phabricator.wikimedia.org/T194118) (owner: 10Jcrespo)
[12:50:32] <wikibugs_>	 (03CR) 10Filippo Giunchedi: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/431057 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn)
[12:54:52] <wikibugs_>	 (03PS1) 10Jcrespo: mariadb: Set db1055 as spare before decommission [puppet] - 10https://gerrit.wikimedia.org/r/431748 (https://phabricator.wikimedia.org/T194118)
[12:56:17] <wikibugs_>	 (03CR) 10Jcrespo: [C: 032] mariadb: Set db1055 as spare before decommission [puppet] - 10https://gerrit.wikimedia.org/r/431748 (https://phabricator.wikimedia.org/T194118) (owner: 10Jcrespo)
[13:00:04] <jouncebot>	 addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor My software never has bugs. It just develops random features. Rise for European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180508T1300).
[13:00:04] <jouncebot>	 chiborg, stephanebisson, and Nikerabbit: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:15] <zeljkof>	 I can SWAT today
[13:00:18] <stephanebisson>	 hello
[13:00:29] <chiborg>	 Hi all
[13:00:35] <icinga-wm>	 PROBLEM - MegaRAID on analytics1032 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough
[13:00:52] <addshore>	 \o
[13:00:56] <Nikerabbit>	 purrr
[13:01:12] <addshore>	 oooh, chiborg, it is advanced search time :D
[13:01:29] <chiborg>	 \o/
[13:02:10] <zeljkof>	 ok everybody, if there is nothing urgent, I'll just deploy in calendar order, ok?
[13:02:42] <zeljkof>	 chiborg: I'll ping you in a few minutes when your patch is at mwdebug1002, so you can test it there
[13:02:53] <marostegui>	 !log Manually fail disk #9 on db1073 to get it replaced
[13:02:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:03:29] <wikibugs_>	 (03PS2) 10Zfilipin: Enable AdvancedSearch BetaFeature on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430388 (https://phabricator.wikimedia.org/T193182) (owner: 10Gabriel Birke)
[13:04:31] <wikibugs_>	 (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430388 (https://phabricator.wikimedia.org/T193182) (owner: 10Gabriel Birke)
[13:05:43] <wikibugs_>	 (03Merged) 10jenkins-bot: Enable AdvancedSearch BetaFeature on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430388 (https://phabricator.wikimedia.org/T193182) (owner: 10Gabriel Birke)
[13:07:24] <zeljkof>	 chiborg: your patch is at mwdebug1002, please test and let me know if I can deploy; let me know if you do not know how to test there
[13:08:38] <wikibugs_>	 (03PS2) 10Zfilipin: Enable maps i18n everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431628 (owner: 10Sbisson)
[13:08:39] <chiborg>	 sorry zeljkof, what is the full url?
[13:09:10] <zeljkof>	 chiborg: instructions on how to test https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug
[13:09:52] <zeljkof>	 in short, install the chrome extension, enable it for mwdebug1002, go to any wikimedia site and the extension will make sure you reach mwdebug1002
[13:09:53] <wikibugs_>	 (03CR) 10jenkins-bot: Enable AdvancedSearch BetaFeature on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430388 (https://phabricator.wikimedia.org/T193182) (owner: 10Gabriel Birke)
[13:10:29] <zeljkof>	 let me know if the docs are not clear on how to do it
[13:10:46] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw1311 is OK: OK
[13:12:22] <wikibugs_>	 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4190310 (10Marostegui) @ayounsi today we have failed over x1 master which was in row C, to a new host in row A. The x1 blocker is now gone and you should be go...
[13:13:04] <chiborg>	 zeljkof Yay, it works! I've tried the english wikipedia, activated the extension as a beta feature and went to the search page.
[13:13:26] <zeljkof>	 chiborg: ok to deploy?
[13:13:37] <chiborg>	 zeljkof yes
[13:13:38] <wikibugs_>	 (03PS1) 10Ema: prometheus: aggregate varnish uptime resets [puppet] - 10https://gerrit.wikimedia.org/r/431749
[13:13:50] <zeljkof>	 chiborg: ok, deploying
[13:14:54] <logmsgbot>	 !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:430388|Enable AdvancedSearch BetaFeature on all wikis (T193182)]] (duration: 01m 00s)
[13:14:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:14:57] <stashbot>	 T193182: Enable AdvancedSearch as a beta feature on all wikis - https://phabricator.wikimedia.org/T193182
[13:15:29] <zeljkof>	 chiborg: deployed; please disable the extension, test on any wiki and thanks for deploying with #releng! ;)
[13:16:01] <zeljkof>	 stephanebisson: please stand by, I'll ping you in a few minutes when the patch is at mwdebug
[13:16:32] <zeljkof>	 stephanebisson: there is no related phab task for 431628?
[13:16:41] <zeljkof>	 (I don't see one in commit message)
[13:17:04] <wikibugs_>	 10Operations, 10MediaWiki-Parser, 10MediaWiki-Platform-Team, 10Parsing-Team, and 2 others: Different production servers have different versions of tidy installed, resulting in varying output - https://phabricator.wikimedia.org/T193414#4190334 (10ssastry) >>! In T193414#4189417, @MoritzMuehlenhoff wrote: >...
[13:17:08] <stephanebisson>	 zeljkof: This is the task. I forgot to link it: https://phabricator.wikimedia.org/T191655
[13:17:43] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw2159 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:17:45] <zeljkof>	 stephanebisson: could you please amend the commit message?
[13:17:53] <stephanebisson>	 yep
[13:18:29] <wikibugs_>	 (03PS3) 10Sbisson: Enable maps i18n everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431628 (https://phabricator.wikimedia.org/T191655)
[13:18:36] <stephanebisson>	 done
[13:18:47] <zeljkof>	 thanks!
[13:18:52] <wikibugs_>	 10Operations, 10CirrusSearch, 10Discovery, 10Elasticsearch, 10Discovery-Search (Current work): Alert when elasticsearch writes are frozen for too long - https://phabricator.wikimedia.org/T193605#4173733 (10Gehel) a:03Gehel
[13:19:21] <wikibugs_>	 (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431628 (https://phabricator.wikimedia.org/T191655) (owner: 10Sbisson)
[13:20:03] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw2161 is CRITICAL: Host mw2161 is not in mediawiki-installation dsh group
[13:20:04] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw2160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:20:31] <wikibugs_>	 (03Merged) 10jenkins-bot: Enable maps i18n everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431628 (https://phabricator.wikimedia.org/T191655) (owner: 10Sbisson)
[13:21:24] <zeljkof>	 stephanebisson: the
[13:21:43] <zeljkof>	 sorry, stephanebisson: the patch is at mwdebug, let me know if it's ok to deploy
[13:21:52] <stephanebisson>	 ok, testing now
[13:22:38] <stephanebisson>	 zeljkof: mwdebug1001 or 1002?
[13:22:44] <ema>	 godog: there are a bunch of unknowns related to memory correctable errors, perhaps due to https://gerrit.wikimedia.org/r/#/c/422110/?
[13:22:57] <godog>	 ema: indeed, I'll take a look
[13:23:00] <zeljkof>	 stephanebisson: sorry, it's always 1002 :D
[13:23:20] <zeljkof>	 I'm strictly following instructions https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#Canary
[13:23:43] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw2161 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:24:12] <stephanebisson>	 zeljkof: looks good
[13:24:17] <stephanebisson>	 zeljkof: you can deploy
[13:24:18] <zeljkof>	 Nikerabbit: please stand by, I'll ping you in a few minutes when your patch is ready for testing
[13:24:23] <zeljkof>	 stephanebisson: ok, deploying
[13:25:04] <Nikerabbit>	 zeljkof: fyi, my patch cannot fully be tested because it interacts with jobqueue – it's a request by mobrovac and we will monitor the jobs once they can switch it to new jobrunner
[13:25:16] <chiborg>	 zeljkof We've tested on en, nl, es, fr, ru and bg, looks fine there. 
[13:25:28] <logmsgbot>	 !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:431628|Enable maps i18n everywhere (T191655)]] (duration: 01m 00s)
[13:25:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:25:32] <stashbot>	 T191655: Deploy maps internationalization to production - https://phabricator.wikimedia.org/T191655
[13:25:35] <zeljkof>	 Nikerabbit: ok, so I can deploy without mwdebug? or should I deploy there first?
[13:25:42] <zeljkof>	 chiborg: great!
[13:25:55] <zeljkof>	 stephanebisson: deployed, please test and thanks for deploying with #releng! ;)
[13:26:03] <Nikerabbit>	 zeljkof: without mwdebug is good
[13:26:13] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw2160 is CRITICAL: Host mw2160 is not in mediawiki-installation dsh group
[13:26:13] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw2159 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:26:38] <wikibugs_>	 (03CR) 10jenkins-bot: Enable maps i18n everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431628 (https://phabricator.wikimedia.org/T191655) (owner: 10Sbisson)
[13:26:45] <zeljkof>	 Nikerabbit: ok, I'll ping you when it's deployed then, depends on how fast CI will be :)
[13:27:24] <icinga-wm>	 PROBLEM - MegaRAID on db1073 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded)
[13:27:25] <icinga-wm>	 ACKNOWLEDGEMENT - MegaRAID on db1073 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T194155
[13:27:27] <stephanebisson>	 zeljkof: thank you!
[13:27:29] <Nikerabbit>	 zeljkof: :+1:
[13:27:44] <wikibugs_>	 10Operations, 10MediaWiki-Parser, 10MediaWiki-Platform-Team, 10Parsing-Team, and 2 others: Different production servers have different versions of tidy installed, resulting in varying output - https://phabricator.wikimedia.org/T193414#4190380 (10MoritzMuehlenhoff) Ack, at this point only four job runners i...
[13:28:04] <wikibugs_>	 10Operations, 10HHVM, 10Patch-For-Review, 10User-Elukey: Upgrade mw* servers to Debian Stretch (using HHVM) - https://phabricator.wikimedia.org/T174431#4190391 (10MoritzMuehlenhoff) All API servers in eqiad are now running stretch.
[13:28:06] <wikibugs_>	 10Operations, 10ops-eqiad: Degraded RAID on db1073 - https://phabricator.wikimedia.org/T194155#4190393 (10ops-monitoring-bot)
[13:28:21] <wikibugs_>	 (03PS1) 10Filippo Giunchedi: base: sum EDAC correctable errors [puppet] - 10https://gerrit.wikimedia.org/r/431750 (https://phabricator.wikimedia.org/T183177)
[13:28:34] <zeljkof>	 Nikerabbit: I did not notice that your patch is for an extension, I would merge it first and deploy last, since it is usually slow in CI...
[13:28:43] <zeljkof>	 anyway, it should not take long, a few minutes
[13:29:17] <wikibugs_>	 (03CR) 10Filippo Giunchedi: [C: 032] base: sum EDAC correctable errors [puppet] - 10https://gerrit.wikimedia.org/r/431750 (https://phabricator.wikimedia.org/T183177) (owner: 10Filippo Giunchedi)
[13:29:23] <wikibugs_>	 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4190417 (10Marostegui) db2092 is now a temporary multi-instance sanitarium host in codfw, replicating the same sections as db11...
[13:29:44] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw2160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:31:45] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw2159 is CRITICAL: Host mw2159 is not in mediawiki-installation dsh group
[13:31:54] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw2161 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:32:55] <moritzm>	 ^ silencing
[13:33:40] <wikibugs_>	 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1073 - https://phabricator.wikimedia.org/T194155#4190471 (10Marostegui) p:05Triage>03High a:03Cmjohnson @Cmjohnson this host has 2 disks with smart alert. I have manually failed disk #9, let's change that one first, let it rebuild and then we can man...
[13:34:34] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw2156 is OK: OK
[13:38:36] <zeljkof>	 Nikerabbit: merged, deploying... 
[13:41:02] <icinga-wm>	 PROBLEM - Memory correctable errors -EDAC- on wtp2013 is CRITICAL: 840 ge 3 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=wtp2013&var-datasource=codfw%2520prometheus%252Fops
[13:41:03] <Nikerabbit>	 this is always so exciting!
[13:41:11] <icinga-wm>	 PROBLEM - Memory correctable errors -EDAC- on db1053 is CRITICAL: 8 ge 3 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=db1053&var-datasource=eqiad%2520prometheus%252Fops
[13:41:12] <icinga-wm>	 PROBLEM - Memory correctable errors -EDAC- on scb1002 is CRITICAL: 32 ge 3 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=scb1002&var-datasource=eqiad%2520prometheus%252Fops
[13:41:20] <godog>	 sigh, sorry about the spam
[13:41:21] <wikibugs_>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "Some minor comments inline, rest LGTM" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/430881 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans)
[13:41:22] <icinga-wm>	 PROBLEM - Memory correctable errors -EDAC- on rdb2002 is CRITICAL: 315 ge 3 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=rdb2002&var-datasource=codfw%2520prometheus%252Fops
[13:41:31] <icinga-wm>	 PROBLEM - Memory correctable errors -EDAC- on elastic1029 is CRITICAL: 5 ge 3 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=elastic1029&var-datasource=eqiad%2520prometheus%252Fops
[13:42:22] <logmsgbot>	 !log zfilipin@tin Synchronized php-1.32.0-wmf.2/extensions/Translate: SWAT: [[gerrit:431744|Refactor TranslationUpdateJob to use only primitive types for parameters (T192111)]] (duration: 01m 11s)
[13:42:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:42:27] <stashbot>	 T192111: Make TranslationsUpdateJob JSON-serializable - https://phabricator.wikimedia.org/T192111
[13:42:33] <godog>	 though it is actually the case that the are corractable errors, according to the kernel anyway
[13:42:58] <zeljkof>	 Nikerabbit: deployed, please test and thanks for deploying with #releng! ;)
[13:43:41] <zeljkof>	 !log EU SWAT finished
[13:43:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:44:56] <gehel>	 godog: you had me worried for an instant :)
[13:45:44] <godog>	 gehel: sudden magnetic storm!
[13:45:49] <Nikerabbit>	 zeljkof: yep, thanks!
[13:46:01] <gehel>	 godog: sounds like an interesting attack vector :)
[13:46:50] <bblack>	 woah nice, working EDAC errors in icinga :)
[13:48:32] <godog>	 heheh getting there
[13:48:49] <godog>	 it'll spam some more as alerts are added
[13:51:01] <icinga-wm>	 PROBLEM - Memory correctable errors -EDAC- on wtp2020 is CRITICAL: 47 ge 3 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=wtp2020&var-datasource=codfw%2520prometheus%252Fops
[13:56:02] <wikibugs_>	 (03PS5) 10Ema: numa_networking: move setting to tlsproxy::instance [puppet] - 10https://gerrit.wikimedia.org/r/430902 (https://phabricator.wikimedia.org/T193865)
[13:57:18] <wikibugs_>	 (03CR) 10Muehlenhoff: debmonitor: add server side puppettization (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/430881 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans)
[14:02:05] <icinga-wm>	 PROBLEM - Memory correctable errors -EDAC- on cp1068 is CRITICAL: 5 ge 3 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=cp1068&var-datasource=eqiad%2520prometheus%252Fops
[14:07:12] <wikibugs_>	 (03PS1) 10Gehel: elasticsearch: alert when cirrus writes are frozen for too long [puppet] - 10https://gerrit.wikimedia.org/r/431754 (https://phabricator.wikimedia.org/T193605)
[14:07:46] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: alert when cirrus writes are frozen for too long [puppet] - 10https://gerrit.wikimedia.org/r/431754 (https://phabricator.wikimedia.org/T193605) (owner: 10Gehel)
[14:09:29] <wikibugs_>	 10Operations: Something is wrong with installer root disk stuff - https://phabricator.wikimedia.org/T149845#4190580 (10MoritzMuehlenhoff) 05Open>03Resolved a:03MoritzMuehlenhoff >>! In T149845#4183217, @RobH wrote: > This seems fixed by adding the rootdelay for jessie and older, and stretch has it go away....
[14:10:03] <icinga-wm>	 PROBLEM - Memory correctable errors -EDAC- on thumbor1004 is CRITICAL: 533 ge 3 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad%2520prometheus%252Fops
[14:10:12] <icinga-wm>	 PROBLEM - Memory correctable errors -EDAC- on mw2213 is CRITICAL: 439 ge 3 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=mw2213&var-datasource=codfw%2520prometheus%252Fops
[14:10:42] <wikibugs_>	 (03PS2) 10Gehel: elasticsearch: alert when cirrus writes are frozen for too long [puppet] - 10https://gerrit.wikimedia.org/r/431754 (https://phabricator.wikimedia.org/T193605)
[14:11:15] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: alert when cirrus writes are frozen for too long [puppet] - 10https://gerrit.wikimedia.org/r/431754 (https://phabricator.wikimedia.org/T193605) (owner: 10Gehel)
[14:11:21] <wikibugs_>	 (03PS6) 10Ema: numa_networking: move setting to tlsproxy::instance [puppet] - 10https://gerrit.wikimedia.org/r/430902 (https://phabricator.wikimedia.org/T193865)
[14:11:50] <wikibugs_>	 10Operations, 10User-fgiunchedi: rack/setup/install ms-be104[0-3].eqiad.wmnet - https://phabricator.wikimedia.org/T190081#4190607 (10fgiunchedi) 05Open>03Resolved Rebalance has completed, resolving
[14:12:41] <doctaxon>	 T194160
[14:12:41] <stashbot>	 T194160: Unlock the login of bot user TaxonBot@TaxonBot to dewiki - https://phabricator.wikimedia.org/T194160
[14:12:42] <wikibugs_>	 (03PS3) 10Gehel: elasticsearch: alert when cirrus writes are frozen for too long [puppet] - 10https://gerrit.wikimedia.org/r/431754 (https://phabricator.wikimedia.org/T193605)
[14:14:47] <wikibugs_>	 10Operations, 10monitoring, 10Graphite, 10User-fgiunchedi: Programmatic generation of grafana dashboards - https://phabricator.wikimedia.org/T171482#4190625 (10fgiunchedi)
[14:14:49] <wikibugs_>	 10Operations, 10monitoring, 10User-fgiunchedi: save grafana dashboards in revision control / puppet - https://phabricator.wikimedia.org/T133392#4190627 (10fgiunchedi)
[14:15:50] <mutante>	 !log mw2215,mw2222,mw2223 - reinstalling with stretch
[14:15:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:15:56] <wikibugs_>	 (03CR) 10Filippo Giunchedi: [C: 031] prometheus: aggregate varnish uptime resets [puppet] - 10https://gerrit.wikimedia.org/r/431749 (owner: 10Ema)
[14:16:22] <icinga-wm>	 PROBLEM - Memory correctable errors -EDAC- on db1051 is CRITICAL: 109 ge 3 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=db1051&var-datasource=eqiad%2520prometheus%252Fops
[14:19:04] <logmsgbot>	 !log ppchelko@tin Started restart [changeprop/deploy@7e86531]: Restart changeprop to try forcing it rebalancing topics
[14:19:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:19:49] <wikibugs_>	 (03PS2) 10Ema: prometheus: aggregate varnish uptime resets [puppet] - 10https://gerrit.wikimedia.org/r/431749
[14:20:06] <wikibugs_>	 (03CR) 10Ema: [C: 032] prometheus: aggregate varnish uptime resets [puppet] - 10https://gerrit.wikimedia.org/r/431749 (owner: 10Ema)
[14:23:09] <wikibugs_>	 (03PS2) 10Ottomata: Stop main-eqiad -> main-codfw MirrorMaker during Kafka main upgrade [puppet] - 10https://gerrit.wikimedia.org/r/431588 (https://phabricator.wikimedia.org/T167039)
[14:23:14] <wikibugs_>	 (03PS1) 10Dzahn: admins: add Shannon Bailey to ldap_only_admins [puppet] - 10https://gerrit.wikimedia.org/r/431756 (https://phabricator.wikimedia.org/T194091)
[14:23:33] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] admins: add Shannon Bailey to ldap_only_admins [puppet] - 10https://gerrit.wikimedia.org/r/431756 (https://phabricator.wikimedia.org/T194091) (owner: 10Dzahn)
[14:23:51] <wikibugs_>	 (03PS2) 10Dzahn: admins: add Shannon Bailey to ldap_only_admins [puppet] - 10https://gerrit.wikimedia.org/r/431756 (https://phabricator.wikimedia.org/T194091)
[14:24:39] <wikibugs_>	 (03CR) 10EBernhardson: elasticsearch: alert when cirrus writes are frozen for too long (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/431754 (https://phabricator.wikimedia.org/T193605) (owner: 10Gehel)
[14:26:08] <wikibugs_>	 (03CR) 10Gehel: elasticsearch: alert when cirrus writes are frozen for too long (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/431754 (https://phabricator.wikimedia.org/T193605) (owner: 10Gehel)
[14:26:53] <icinga-wm>	 PROBLEM - Memory correctable errors -EDAC- on kafka1023 is CRITICAL: 13 ge 3 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=kafka1023&var-datasource=eqiad%2520prometheus%252Fops
[14:34:18] <wikibugs_>	 (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler02/11160/kafka2001.codfw.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/431588 (https://phabricator.wikimedia.org/T167039) (owner: 10Ottomata)
[14:35:47] <wikibugs_>	 (03Abandoned) 10Dzahn: nutcracker: puppetize missing /var/run/nutcracker dir [puppet] - 10https://gerrit.wikimedia.org/r/431057 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn)
[14:37:12] <icinga-wm>	 RECOVERY - MegaRAID on analytics1032 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy
[14:37:40] <mutante>	 !log LDAP: added 'sbailey' to group 'wmf' (T194091)
[14:37:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:37:45] <stashbot>	 T194091: Add sbailey to wmf and other ldap groups - https://phabricator.wikimedia.org/T194091
[14:38:32] <wikibugs_>	 (03PS1) 10Pmiazga: Remove unused PopupsAnonsExperimentalGroupSize config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431759 (https://phabricator.wikimedia.org/T173952)
[14:41:04] <wikibugs_>	 (03PS2) 10Pmiazga: Remove unused PopupsAnonsExperimentalGroupSize config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431759 (https://phabricator.wikimedia.org/T173952)
[14:43:35] <wikibugs_>	 (03CR) 10Dzahn: [C: 032] "done in LDAP, this to reflect the status quo" [puppet] - 10https://gerrit.wikimedia.org/r/431756 (https://phabricator.wikimedia.org/T194091) (owner: 10Dzahn)
[14:48:27] <XioNoX>	 !log disabling pybal on lvs2004 - T193677
[14:48:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:48:31] <stashbot>	 T193677: Interface errors on asw-d-codfw:xe-2/0/47 - https://phabricator.wikimedia.org/T193677
[14:51:01] <wikibugs_>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "A few comments here and there, but I 've finally reviewed all of it. Nice work!" (038 comments) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394620 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans)
[14:52:29] <wikibugs_>	 (03CR) 10Alexandros Kosiaris: [C: 031] Add basic test coverage [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394621 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans)
[14:53:47] <wikibugs_>	 (03CR) 10Alexandros Kosiaris: [C: 031] Add CLI script to be installed in the target hosts [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394990 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans)
[14:53:52] <XioNoX>	 !log re-enable pybal on lvs2004 - T193677
[14:53:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:53:57] <stashbot>	 T193677: Interface errors on asw-d-codfw:xe-2/0/47 - https://phabricator.wikimedia.org/T193677
[14:55:30] <wikibugs_>	 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2067 - https://phabricator.wikimedia.org/T194103#4190864 (10Papaul) a:05Papaul>03Marostegui @Marostegui     Disk replacement complete
[14:56:25] <wikibugs_>	 10Operations, 10ops-codfw: Degraded RAID on wasat - https://phabricator.wikimedia.org/T193394#4190867 (10Papaul) Dear Papaul Tshibamba,     We are contacting you in regards to your case ID# 5329190939.  Please be aware that a functional equivalent part (656108-001) (SPS-DRV HD 1TB 6G SATA 7.2K 2.5 MDL SC) has...
[14:56:53] <wikibugs_>	 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2067 - https://phabricator.wikimedia.org/T194103#4190868 (10Marostegui) Thanks!  ``` physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, Rebuilding) ```
[14:57:35] <wikibugs_>	 (03CR) 10Alexandros Kosiaris: [C: 031] Add server side validation of client certificates [software/debmonitor] - 10https://gerrit.wikimedia.org/r/428302 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans)
[15:02:24] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw2215 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[15:02:55] <icinga-wm>	 PROBLEM - HHVM rendering on mw2215 is CRITICAL: connect to address 10.192.0.40 and port 80: Connection refused
[15:02:55] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw2222 is CRITICAL: connect to address 10.192.0.47 and port 443: Connection refused
[15:02:55] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw2223 is CRITICAL: connect to address 10.192.0.48 and port 443: Connection refused
[15:02:55] <icinga-wm>	 PROBLEM - nutcracker port on mw2215 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[15:02:55] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw2222 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[15:02:55] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw2223 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[15:02:56] <wikibugs_>	 10Operations, 10ops-codfw, 10DC-Ops: rigel.frack.codfw.wmnet (fundraising codfw bastion) will not boot after a power cycle - https://phabricator.wikimedia.org/T193891#4190893 (10Papaul) @jgree as  requested, the server is back up again
[15:04:24] <icinga-wm>	 PROBLEM - nutcracker process on mw2215 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[15:04:24] <icinga-wm>	 PROBLEM - DPKG on mw2222 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[15:04:24] <icinga-wm>	 PROBLEM - DPKG on mw2223 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[15:05:54] <icinga-wm>	 PROBLEM - puppet last run on mw2215 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[15:05:54] <icinga-wm>	 PROBLEM - configured eth on mw2222 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[15:05:54] <icinga-wm>	 PROBLEM - configured eth on mw2223 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[15:06:27] <ottomata>	 !log beginnng Kafka upgrade of main-codfw: T167039
[15:06:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:06:31] <stashbot>	 T167039: Upgrade Kafka on main cluster with security features - https://phabricator.wikimedia.org/T167039
[15:07:08] <wikibugs_>	 (03CR) 10Ottomata: [C: 032] Stop main-eqiad -> main-codfw MirrorMaker during Kafka main upgrade [puppet] - 10https://gerrit.wikimedia.org/r/431588 (https://phabricator.wikimedia.org/T167039) (owner: 10Ottomata)
[15:07:21] <wikibugs_>	 (03PS3) 10Ottomata: Stop main-eqiad -> main-codfw MirrorMaker during Kafka main upgrade [puppet] - 10https://gerrit.wikimedia.org/r/431588 (https://phabricator.wikimedia.org/T167039)
[15:07:22] <wikibugs_>	 (03CR) 10Ottomata: [V: 032 C: 032] Stop main-eqiad -> main-codfw MirrorMaker during Kafka main upgrade [puppet] - 10https://gerrit.wikimedia.org/r/431588 (https://phabricator.wikimedia.org/T167039) (owner: 10Ottomata)
[15:07:24] <icinga-wm>	 PROBLEM - Apache HTTP on mw2215 is CRITICAL: connect to address 10.192.0.40 and port 80: Connection refused
[15:07:25] <icinga-wm>	 PROBLEM - Disk space on mw2222 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[15:07:25] <icinga-wm>	 PROBLEM - Disk space on mw2223 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[15:07:25] <icinga-wm>	 PROBLEM - dhclient process on mw2222 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[15:07:25] <icinga-wm>	 PROBLEM - dhclient process on mw2223 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[15:07:44] <icinga-wm>	 PROBLEM - MegaRAID on analytics1032 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough
[15:08:14] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw2161 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.075 second response time
[15:08:34] <icinga-wm>	 RECOVERY - Apache HTTP on mw2215 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.073 second response time
[15:08:54] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw2159 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.085 second response time
[15:08:54] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw2160 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.074 second response time
[15:08:55] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw2222 is CRITICAL: Host mw2222 is not in mediawiki-installation dsh group
[15:08:55] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw2223 is CRITICAL: Host mw2223 is not in mediawiki-installation dsh group
[15:08:55] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw2215 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[15:08:55] <icinga-wm>	 PROBLEM - MD RAID on mw2215 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[15:08:56] <icinga-wm>	 PROBLEM - HHVM processes on mw2223 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[15:08:56] <icinga-wm>	 PROBLEM - HHVM processes on mw2222 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[15:09:23] <wikibugs_>	 (03PS3) 10Rduran: [WIP] Use Cumin to implement the comunication for the transfer [puppet] - 10https://gerrit.wikimedia.org/r/430868
[15:09:25] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw2161 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.165 second response time
[15:09:36] <ottomata>	 stopping mm instance in codfw
[15:09:41] <XioNoX>	 !log stopping pybal on lvs2001 - T193677
[15:09:45] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw2222 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[15:09:45] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw2223 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[15:09:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:09:45] <stashbot>	 T193677: Interface errors on asw-d-codfw:xe-2/0/47 - https://phabricator.wikimedia.org/T193677
[15:10:05] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw2159 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.175 second response time
[15:10:34] <icinga-wm>	 PROBLEM - Check systemd state on mw2215 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[15:10:34] <icinga-wm>	 PROBLEM - nutcracker port on mw2222 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[15:10:34] <icinga-wm>	 PROBLEM - nutcracker port on mw2223 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[15:10:58] <ottomata>	 ok mm stopped in codfw
[15:11:03] <akosiaris>	 cool
[15:11:10] <ottomata>	 stopping puppet etc.
[15:12:01] <ottomata>	 beginning package upgrade rolling restarts...
[15:12:05] <icinga-wm>	 PROBLEM - Check the NTP synchronisation status of timesyncd on mw2215 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[15:12:05] <icinga-wm>	 PROBLEM - nutcracker process on mw2222 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[15:12:05] <icinga-wm>	 PROBLEM - nutcracker process on mw2223 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[15:13:26] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw2215 is CRITICAL: connect to address 10.192.0.40 and port 443: Connection refused
[15:13:27] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw2215 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[15:13:37] <akosiaris>	 I 'll schedule downtime for mw2215, mw2222 mw2223 
[15:13:47] <akosiaris>	 no reason to have them pollute the channel
[15:13:50] <ottomata>	 k
[15:14:37] <icinga-wm>	 PROBLEM - DPKG on mw2215 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[15:14:47] <icinga-wm>	 PROBLEM - HHVM rendering on mw2223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:14:49] <mutante>	 akosiaris: sorry, i got it
[15:15:15] <ottomata>	 2001 upgraded, moving on
[15:15:18] <mutante>	 the last ones worked without this
[15:16:14] <akosiaris>	 mutante: this ones probably took a bit longer. It's a race condition. Without setting the hiera flag profile::base::notifications_enabled to 0 it's expected to happen every now and then
[15:16:18] <akosiaris>	 these*
[15:16:29] <akosiaris>	 anyway, I 've downtimed them in icinga
[15:16:59] <mutante>	 thank you, ok @ hiera
[15:17:54] <ottomata>	 2002 upgraded, moving on
[15:17:56] <wikibugs_>	 (03CR) 10Vgutierrez: "IMHO this could benefit from exposing clustershell file copy features in cumin - http://clustershell.readthedocs.io/en/latest/tools/clush." [puppet] - 10https://gerrit.wikimedia.org/r/430868 (owner: 10Rduran)
[15:19:27] <wikibugs_>	 (03CR) 10Alexandros Kosiaris: [C: 031] Add login and LDAP support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/425417 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans)
[15:20:37] <elukey>	 Pchelolo: o/ - can you check the changeprop codfw consumers? 
[15:20:38] <elukey>	 https://grafana.wikimedia.org/dashboard/db/kafka?refresh=5m&panelId=46&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus%2Fops&var-kafka_cluster=main-codfw&var-cluster=eventbus&var-kafka_broker=All
[15:20:40] <ottomata>	 ok 2003 upgraed, package upgrades complete
[15:21:23] <elukey>	 timing wise they seem ok, the one going down it is due to mm right?
[15:21:50] <wikibugs_>	 (03CR) 10Imarlier: "> Confirmed that the following all respond the same way from" [puppet] - 10https://gerrit.wikimedia.org/r/431659 (https://phabricator.wikimedia.org/T159354) (owner: 10Imarlier)
[15:21:53] <ottomata>	 elukey:  yeah that makes sense i think
[15:22:03] <ottomata>	 moving on to restart 2, to set broker protocol versin
[15:22:16] <wikibugs_>	 (03PS2) 10Ottomata: Kafka main-codfw patch 1: inter_broker_protocol_version: 1.1.0 [puppet] - 10https://gerrit.wikimedia.org/r/430449 (https://phabricator.wikimedia.org/T167039)
[15:22:20] <elukey>	 ack
[15:22:34] <wikibugs_>	 (03CR) 10Ottomata: [V: 032 C: 032] Kafka main-codfw patch 1: inter_broker_protocol_version: 1.1.0 [puppet] - 10https://gerrit.wikimedia.org/r/430449 (https://phabricator.wikimedia.org/T167039) (owner: 10Ottomata)
[15:22:35] <elukey>	 there was an increase in throughput but it was before you started, so all good
[15:22:46] <Pchelolo>	 elukey: all seems fine
[15:23:15] <elukey>	 super thanks for checking
[15:23:16] <Pchelolo>	 elukey: that's when I restarted CP for that bug when it got stuck on the transclusion topic
[15:23:29] <Pchelolo>	 so not related
[15:23:30] <elukey>	 ack!
[15:23:43] <wikibugs_>	 (03PS1) 10BBlack: Block some networks [puppet] - 10https://gerrit.wikimedia.org/r/431769 (https://phabricator.wikimedia.org/T193762)
[15:24:59] <ottomata>	 bouncing 2001
[15:26:36] <godog>	 !log (un)load edac kernel modules on thumbor1004 to test resetting counters - T183177
[15:26:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:26:40] <stashbot>	 T183177: memory errors not showing in icinga - https://phabricator.wikimedia.org/T183177
[15:27:44] <icinga-wm>	 RECOVERY - Memory correctable errors -EDAC- on thumbor1004 is OK: (C)3 ge (W)1 ge 0 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad%2520prometheus%252Fops
[15:28:07] <godog>	 expected ^
[15:28:12] <ottomata>	 bouncing 2002
[15:29:07] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw2160 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.158 second response time
[15:29:11] <wikibugs_>	 (03CR) 10Jcrespo: "> IMHO this could benefit from exposing clustershell file copy" [puppet] - 10https://gerrit.wikimedia.org/r/430868 (owner: 10Rduran)
[15:30:37] <ottomata>	 bouncing 2003
[15:30:53] <XioNoX>	 !log starting pybal on lvs2001 - T193677
[15:30:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:30:57] <stashbot>	 T193677: Interface errors on asw-d-codfw:xe-2/0/47 - https://phabricator.wikimedia.org/T193677
[15:31:53] <wikibugs_>	 10Operations, 10ops-codfw, 10DC-Ops: rigel.frack.codfw.wmnet (fundraising codfw bastion) will not boot after a power cycle - https://phabricator.wikimedia.org/T193891#4190980 (10Papaul) Rigel was set by default to boot first from NIC  so every time the server  reboots, it stuck and the error bellow so I chan...
[15:32:14] <wikibugs_>	 (03PS3) 10Ottomata: Kafka main-codfw patch 2 [puppet] - 10https://gerrit.wikimedia.org/r/430450 (https://phabricator.wikimedia.org/T167039)
[15:32:18] <wikibugs_>	 (03CR) 10Dzahn: [C: 031] "i can deploy this, a +1 from traffic never hurts though" [puppet] - 10https://gerrit.wikimedia.org/r/431659 (https://phabricator.wikimedia.org/T159354) (owner: 10Imarlier)
[15:32:23] <ottomata>	  restart 2 finished
[15:32:30] <ottomata>	 now time to upgrade client api versions :)
[15:33:06] <wikibugs_>	 (03PS4) 10Ottomata: Kafka main-codfw patch 3 - remove api.version [puppet] - 10https://gerrit.wikimedia.org/r/430640 (https://phabricator.wikimedia.org/T167039)
[15:33:24] <Pchelolo>	 ok ottomata shoot me when it's time for me to deploy consumers
[15:33:30] <wikibugs_>	 (03CR) 10BBlack: [C: 031] performance.wikimedia.org: serve from webperfX001 [puppet] - 10https://gerrit.wikimedia.org/r/431659 (https://phabricator.wikimedia.org/T159354) (owner: 10Imarlier)
[15:33:38] <ottomata>	 Pchelolo:  these can be done at any time, so 
[15:33:38] <ottomata>	 hm
[15:33:41] <wikibugs_>	 (03CR) 10BBlack: [C: 032] Block some networks [puppet] - 10https://gerrit.wikimedia.org/r/431769 (https://phabricator.wikimedia.org/T193762) (owner: 10BBlack)
[15:33:41] <Pchelolo>	 let's not do that simultaniously with you making your part
[15:33:44] <ottomata>	 eyah
[15:33:44] <ottomata>	 ok
[15:33:50] <ottomata>	 i'll do all mine first one by one and make sure it sok
[15:33:54] <ottomata>	 then we'll do cp
[15:34:01] <wikibugs_>	 (03CR) 10Ottomata: [C: 032] Kafka main-codfw patch 3 - remove api.version [puppet] - 10https://gerrit.wikimedia.org/r/430640 (https://phabricator.wikimedia.org/T167039) (owner: 10Ottomata)
[15:34:31] <Pchelolo>	 +1
[15:35:15] <wikibugs_>	 (03PS5) 10Ottomata: Kafka main-codfw - remove api.version [puppet] - 10https://gerrit.wikimedia.org/r/430640 (https://phabricator.wikimedia.org/T167039)
[15:35:24] <wikibugs_>	 (03CR) 10Ottomata: [V: 032 C: 032] Kafka main-codfw - remove api.version [puppet] - 10https://gerrit.wikimedia.org/r/430640 (https://phabricator.wikimedia.org/T167039) (owner: 10Ottomata)
[15:39:41] <ottomata>	 eventbus service restarted without api.version
[15:39:49] <akosiaris>	 ok
[15:41:04] <Pchelolo>	 ok ottomata I will start deploying CP
[15:41:22] <Pchelolo>	 ottomata: is it the good time to do that?
[15:41:27] <ottomata>	 ok, i'm about to do statsv varnishkafka instances, but i think you can go ahead with cp
[15:42:20] <ottomata>	 seeing stuff like Broker version identifed as 1.0.0 in eb logs so das good
[15:43:04] <Pchelolo>	 ok, we're still in a meeting so I am a bit distracted so I'll wait for you
[15:43:45] <wikibugs_>	 (03PS4) 10Ottomata: Kafka main-codfw patch 2 [puppet] - 10https://gerrit.wikimedia.org/r/430450 (https://phabricator.wikimedia.org/T167039)
[15:43:56] <wikibugs_>	 10Operations, 10ops-codfw, 10DC-Ops: rigel.frack.codfw.wmnet (fundraising codfw bastion) will not boot after a power cycle - https://phabricator.wikimedia.org/T193891#4191028 (10RobH) Please note my past comment regarding allocation of a spare was discussed in irc between myself and @Jgreen   rigel's ilom is...
[15:45:02] <elukey>	 ottomata: qq - why the interbroker version is set in the common hiera conifg?
[15:45:05] <elukey>	 *config
[15:45:26] <icinga-wm>	 PROBLEM - Host lvs2004 is DOWN: PING CRITICAL - Packet loss = 100%
[15:45:56] <vgutierrez>	 XioNoX: ^^
[15:46:00] <logmsgbot>	 !log demon@tin Pruned MediaWiki: 1.32.0-wmf.1 [keeping static files] (duration: 01m 47s)
[15:46:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:46:11] <bblack>	 vgutierrez: I think he already moved back traffic to 2001
[15:46:12] <wikibugs_>	 10Operations, 10ops-codfw, 10DC-Ops: rigel.frack.codfw.wmnet (fundraising codfw bastion) will not boot after a power cycle - https://phabricator.wikimedia.org/T193891#4191030 (10RobH) p:05Unbreak!>03Normal Lowering to normal, as the server is known bad (ilom malfunction) but out of warranty.   There is a...
[15:46:21] <ottomata>	 elukey:  hang on will answer
[15:46:25] <bblack>	 probably still saw errors, tryingt something else on the interface
[15:46:25] <ottomata>	 problem with statsv vks
[15:46:30] <XioNoX>	 yep
[15:46:53] <ottomata>	 somehow even though puppet should nto have changed the api version, since statsv produces to main-eqiad
[15:46:55] <ottomata>	 it did...
[15:46:59] <ottomata>	 looking
[15:47:01] <ottomata>	 it changed it from 0.9.0.1 to 0.9
[15:47:02] <ottomata>	 which is weird
[15:47:16] <XioNoX>	 but the main interface shouldn't alert, only ens1f1
[15:47:23] <XioNoX>	 papaul: ^
[15:47:27] <wikibugs_>	 10Operations, 10ops-codfw: rdb2002 correctable memory errors - https://phabricator.wikimedia.org/T194171#4191033 (10fgiunchedi)
[15:47:56] <icinga-wm>	 RECOVERY - Host lvs2004 is UP: PING WARNING - Packet loss = 58%, RTA = 36.22 ms
[15:48:23] <wikibugs_>	 (03PS1) 10Ottomata: Force statsv varnishkafka api.version to 0.9.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/431773 (https://phabricator.wikimedia.org/T167039)
[15:48:59] <wikibugs_>	 (03PS2) 10Ottomata: Force statsv varnishkafka api.version to 0.9.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/431773 (https://phabricator.wikimedia.org/T167039)
[15:48:59] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] Force statsv varnishkafka api.version to 0.9.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/431773 (https://phabricator.wikimedia.org/T167039) (owner: 10Ottomata)
[15:49:36] <wikibugs_>	 (03CR) 10Ottomata: [C: 032] Force statsv varnishkafka api.version to 0.9.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/431773 (https://phabricator.wikimedia.org/T167039) (owner: 10Ottomata)
[15:50:01] <elukey>	 weird indeed
[15:50:13] <wikibugs_>	 (03PS4) 10Dzahn: performance.wikimedia.org: serve from webperfX001 [puppet] - 10https://gerrit.wikimedia.org/r/431659 (https://phabricator.wikimedia.org/T159354) (owner: 10Imarlier)
[15:50:22] <ottomata>	 no time to investigate why, gotta change it back
[15:50:27] <ottomata>	 i think statsv vks are failing  connecting
[15:50:50] <ottomata>	 probably gonna have a blip in statsv stuff (if messages are produced from codfw vk instances?) ping marlier
[15:50:50] <wikibugs_>	 (03CR) 10Dzahn: [C: 032] performance.wikimedia.org: serve from webperfX001 [puppet] - 10https://gerrit.wikimedia.org/r/431659 (https://phabricator.wikimedia.org/T159354) (owner: 10Imarlier)
[15:51:13] <marlier>	 ottomata: ack, thanks
[15:51:30] <ottomata>	 mutante:  i just merged your patch
[15:52:03] <elukey>	 ottomata: maybe 0.9 in hiera vs '0.9' in the ? block
[15:52:17] <mutante>	 ottomata: alright, i was sitting at the yes/no prompt ;)
[15:52:33] <ottomata>	 OH, maybe it made it a decimal value you mean?  yeah it probably did
[15:52:34] <ottomata>	 doh
[15:53:24] <mutante>	 !log switching performance.wikimedia.org from graphite to webperf backends - running puppet on cache::misc servers (T158837)
[15:53:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:53:29] <stashbot>	 T158837: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837
[15:53:32] <wikibugs_>	 10Operations, 10MediaWiki-Parser, 10MediaWiki-Platform-Team, 10Parsing-Team, and 2 others: Different production servers have different versions of tidy installed, resulting in varying output - https://phabricator.wikimedia.org/T193414#4191075 (10Legoktm) >>! In T193414#4190334, @ssastry wrote: > If all ser...
[15:53:44] <ottomata>	 elukey:  re broker version
[15:53:49] <ottomata>	 i'm overriding it in the site specific hiera
[15:54:11] <ottomata>	 and when setting to new value (the one we want to keep), i remove it from site specific override
[15:54:14] <ottomata>	 and the common one sticks
[15:54:15] <wikibugs_>	 (03CR) 10Ema: [C: 04-1] numa_networking: move setting to tlsproxy::instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/430902 (https://phabricator.wikimedia.org/T193865) (owner: 10Ema)
[15:54:23] <elukey>	 ottomata: ack thanks
[15:54:31] <elukey>	 was just triple checking everything :)
[15:54:37] <wikibugs_>	 10Operations, 10DC-Ops, 10Traffic, 10monitoring, and 2 others: memory errors not showing in icinga - https://phabricator.wikimedia.org/T183177#4191082 (10fgiunchedi) The correctable errors check has been deployed and it is yielding some results already. Myself and @herron took at the list of hosts and ther...
[15:58:19] <Pchelolo>	 ok ottomata finally out of the meeting, are you done with your consumers?
[15:58:25] <ottomata>	 yes!
[15:58:26] <ottomata>	 just finished
[15:58:32] <ottomata>	 statsv is fine now
[15:58:35] <ottomata>	 statsv vk
[15:58:40] <ottomata>	 it was all producing to eqiad anway
[15:58:46] <ottomata>	 shoudn't have even botherd with it today :/
[15:58:56] <ottomata>	 so yes, Pchelolo please proceed with cp/jq
[15:59:42] <Pchelolo>	 ok, cool. going with job queue first, it's not doing anything in codfw
[15:59:45] <ottomata>	 k
[16:00:05] <jouncebot>	 godog, moritzm, and _joe_: That opportune time is upon us again. Time for a Puppet SWAT(Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180508T1600).
[16:00:05] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[16:01:00] <logmsgbot>	 !log ppchelko@tin Started deploy [cpjobqueue/deploy@58935d5]: Allow protocol version negotiation. Codfw only. T167039
[16:01:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:01:04] <stashbot>	 T167039: Upgrade Kafka on main cluster with security features - https://phabricator.wikimedia.org/T167039
[16:01:42] <logmsgbot>	 !log ppchelko@tin Finished deploy [cpjobqueue/deploy@58935d5]: Allow protocol version negotiation. Codfw only. T167039 (duration: 00m 42s)
[16:01:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:02:16] <Pchelolo>	 jobqueue done, give me a minute to look at the logs
[16:02:21] <wikibugs_>	 (03PS1) 10Ema: prometheus: fix aggregate varnish uptime resets expression [puppet] - 10https://gerrit.wikimedia.org/r/431777
[16:02:52] <wikibugs_>	 10Operations, 10ops-codfw: mw2213 correctable memory errors - https://phabricator.wikimedia.org/T194172#4191094 (10fgiunchedi)
[16:03:07] <wikibugs_>	 (03CR) 10Filippo Giunchedi: [C: 031] prometheus: fix aggregate varnish uptime resets expression [puppet] - 10https://gerrit.wikimedia.org/r/431777 (owner: 10Ema)
[16:03:30] <logmsgbot>	 !log ppchelko@tin Started deploy [changeprop/deploy@e468d8e]: Allow protocol version negotiation. Codfw only. T167039
[16:03:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:03:40] <wikibugs_>	 (03CR) 10Ema: [C: 032] prometheus: fix aggregate varnish uptime resets expression [puppet] - 10https://gerrit.wikimedia.org/r/431777 (owner: 10Ema)
[16:03:47] <Pchelolo>	 looks solid, proceeding with change-prop
[16:03:51] <ottomata>	 gr8
[16:04:16] <wikibugs_>	 10Operations, 10Traffic, 10netops: cr1-eqsin 4 onboard interfaces down - https://phabricator.wikimedia.org/T193897#4191107 (10ayounsi) 05Open>03Resolved
[16:04:32] <logmsgbot>	 !log ppchelko@tin Finished deploy [changeprop/deploy@e468d8e]: Allow protocol version negotiation. Codfw only. T167039 (duration: 01m 03s)
[16:04:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:04:55] <godog>	 heya -- there's a new version of scap to be deployed, I wanted to do that today but don't clash with the kafka upgrade, how much time do you think is left for the upgrade?
[16:05:11] <elukey>	 20mins more or less
[16:05:20] <wikibugs_>	 10Operations, 10DBA, 10Traffic: Framework to transfer files over the LAN - https://phabricator.wikimedia.org/T156462#4191109 (10jcrespo)
[16:05:26] <ottomata>	 if all goes well and we don't need any rollbacks knock on wood
[16:05:27] <ottomata>	 so far so good
[16:05:40] <godog>	 kk, thanks ottomata elukey !
[16:05:45] <godog>	 thcipriani: ^
[16:05:58] <ottomata>	 i'd reserve another hour godog if that's ok with you.  i have another rolling restart of the cluster to do  (which only takes a min, but would be nice to wathc it a while)
[16:06:00] <thcipriani>	 :)
[16:06:09] <godog>	 ottomata: yup, no problem
[16:06:30] <Pchelolo>	 ottomata: elukey ok, both JQ and CP are doooone, and it looks good - events are being consumed, no issues in logs
[16:07:00] <ottomata>	 greaaat
[16:07:00] <wikibugs_>	 10Operations, 10DBA, 10Traffic: Framework to transfer files over the LAN - https://phabricator.wikimedia.org/T156462#4191122 (10jcrespo) @Vgutierrez suggested using https://github.com/vstakhov/hpenc , which I don't think is a bad idea at all- it would just change some of the executions of openssl and netcat...
[16:07:03] <ottomata>	 yeah all looks good here too
[16:07:06] <elukey>	 \o/
[16:07:11] <elukey>	 didn't spot anything weird
[16:07:17] <akosiaris>	 \o/
[16:08:10] <icinga-wm>	 RECOVERY - Memory correctable errors -EDAC- on db1051 is OK: (C)3 ge (W)1 ge 0 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=db1051&var-datasource=eqiad%2520prometheus%252Fops
[16:08:17] <ottomata>	 ok
[16:08:25] <ottomata>	 proceeding to log message format version step, restart 3.
[16:08:31] <elukey>	 ack
[16:08:48] <wikibugs_>	 (03CR) 10Ottomata: [C: 032] Kafka main-codfw patch 2 [puppet] - 10https://gerrit.wikimedia.org/r/430450 (https://phabricator.wikimedia.org/T167039) (owner: 10Ottomata)
[16:08:50] <wikibugs_>	 (03PS5) 10Ottomata: Kafka main-codfw patch 2 [puppet] - 10https://gerrit.wikimedia.org/r/430450 (https://phabricator.wikimedia.org/T167039)
[16:08:52] <wikibugs_>	 (03CR) 10Ottomata: [V: 032 C: 032] Kafka main-codfw patch 2 [puppet] - 10https://gerrit.wikimedia.org/r/430450 (https://phabricator.wikimedia.org/T167039) (owner: 10Ottomata)
[16:09:41] <XioNoX>	 !log failing traffic over lvs2004 - T193677
[16:09:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:09:45] <stashbot>	 T193677: Interface errors on asw-d-codfw:xe-2/0/47 - https://phabricator.wikimedia.org/T193677
[16:10:34] <ottomata>	 bouncing 2001
[16:10:37] <logmsgbot>	 !log dzahn@neodymium conftool action : set/pooled=yes; selector: name=mw2215.codfw.wmnet
[16:10:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:12:56] <wikibugs_>	 10Operations, 10ops-codfw: wtp2013 memory correctable errors - https://phabricator.wikimedia.org/T194174#4191135 (10fgiunchedi)
[16:14:06] <logmsgbot>	 !log dzahn@neodymium conftool action : set/pooled=yes; selector: name=mw2222.codfw.wmnet
[16:14:08] <ottomata>	 bouncing 2002
[16:14:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:14:22] <ottomata>	 hm before i do
[16:14:24] <ottomata>	 cp	 is cp ok
[16:14:25] <ottomata>	 ?
[16:14:39] <ottomata>	 Member 495153-1a91f0c4-35d4-45ca-8162-247bcfac0088 in group change-prop-on_transclusion_update has failed, removing it from the group etc.
[16:14:43] <ottomata>	 in kafka server logs
[16:14:47] <ottomata>	 could be normal operation, not sure
[16:15:00] <ottomata>	 it is probably from my leader rebalance
[16:15:13] <logmsgbot>	 !log dzahn@neodymium conftool action : set/pooled=yes; selector: name=mw2223.codfw.wmnet
[16:15:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:15:19] <ottomata>	 Pchelolo: ^^^
[16:15:30] <Pchelolo>	 ottomata: silence from me means it's fine
[16:15:33] <ottomata>	 haha ok
[16:15:36] <ottomata>	 ok bouncing 2002
[16:15:54] <elukey>	 burrow is not screaming too
[16:16:49] <wikibugs_>	 10Operations, 10ops-codfw: mw2213 correctable memory errors - https://phabricator.wikimedia.org/T194172#4191150 (10RobH) Unfortunately, this system is out of warranty as of 2018-01-16.  In looking at the service event log, it appears this server has had problems for awhile:   ``` /admin1-> racadm getsel Record...
[16:18:30] <icinga-wm>	 RECOVERY - Memory correctable errors -EDAC- on elastic1029 is OK: (C)3 ge (W)1 ge 0 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=elastic1029&var-datasource=eqiad%2520prometheus%252Fops
[16:19:06] <urandom>	 !log force (split) compaction of wikipedia_T_mobile__ng_lead.data, restbase1016 - T192689
[16:19:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:19:10] <stashbot>	 T192689: Unchecked storage growth - https://phabricator.wikimedia.org/T192689
[16:19:19] <ottomata>	 bounced 2003
[16:19:22] <ottomata>	 allrighty....
[16:19:35] <elukey>	 done done done??
[16:19:36] <ottomata>	 akosiaris:  Pchelolo elukey  codfw upgraded. looking good so far!
[16:19:40] <elukey>	 woooooww
[16:19:42] <akosiaris>	 nice!
[16:19:43] <ottomata>	 mm is down in codfw
[16:19:49] <elukey>	 outstanding ottomata, great work!
[16:19:50] <ottomata>	 and will be til after we upgrade eqiad tomorrow
[16:20:00] <icinga-wm>	 RECOVERY - MegaRAID on db1073 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy
[16:20:02] <wikibugs_>	 (03PS1) 10Imarlier: performance website: allow traffic [puppet] - 10https://gerrit.wikimedia.org/r/431779 (https://phabricator.wikimedia.org/T159354)
[16:20:30] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] performance website: allow traffic [puppet] - 10https://gerrit.wikimedia.org/r/431779 (https://phabricator.wikimedia.org/T159354) (owner: 10Imarlier)
[16:21:05] <ottomata>	 elukey:  FYI, i think you know this, but auto.leader.rebalacne is now enabled
[16:21:09] <Pchelolo>	 that's what I call a well planned operation :)
[16:21:10] <Pchelolo>	 thank you gentlemen.
[16:21:16] <ottomata>	 it works wayyy better in these later versions
[16:21:24] <elukey>	 yep yep I saw it
[16:21:26] <ottomata>	 so you no longer need to do that step after rebooting brokers
[16:21:27] <ottomata>	 :)
[16:21:27] <elukey>	 like we have in jumbo
[16:21:30] <ottomata>	 yuppers
[16:21:30] <ottomata>	 great
[16:21:39] <akosiaris>	 :-)
[16:21:53] <akosiaris>	 same bat channel same bat time tomorrow ? 
[16:22:06] <ottomata>	 1h earlier
[16:22:07] <herron>	 !log cleared low count edac counters on hosts mw2205 dbstore1002 db1051 elastic1029 T183177
[16:22:07] <ottomata>	 14 utc
[16:22:08] <ottomata>	 for eqiad
[16:22:10] <akosiaris>	 cool
[16:22:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:22:11] <stashbot>	 T183177: memory errors not showing in icinga - https://phabricator.wikimedia.org/T183177
[16:22:11] <wikibugs_>	 (03PS1) 10Dzahn: disable icinga notifications on mw22* hosts [puppet] - 10https://gerrit.wikimedia.org/r/431780
[16:22:17] <ottomata>	 ya Pchelolo 14utc still ok tomorrow for you?
[16:22:32] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] disable icinga notifications on mw22* hosts [puppet] - 10https://gerrit.wikimedia.org/r/431780 (owner: 10Dzahn)
[16:22:34] <wikibugs_>	 (03PS2) 10Imarlier: performance website: allow traffic [puppet] - 10https://gerrit.wikimedia.org/r/431779 (https://phabricator.wikimedia.org/T159354)
[16:22:49] <Pchelolo>	 ottomata: ye, no problem. Just 1 more 6 am wake-up
[16:22:59] <wikibugs_>	 (03PS2) 10Dzahn: disable icinga notifications on mw22* hosts [puppet] - 10https://gerrit.wikimedia.org/r/431780
[16:23:47] <akosiaris>	 ouch
[16:24:20] <ottomata>	 Pchelolo:  we're just taking advantage of you while you have a little bit of jet lag left
[16:24:22] <ottomata>	 i hope
[16:24:36] <ottomata>	 (thank you :) 
[16:24:37] <ottomata>	 )
[16:25:05] <Pchelolo>	 haha that's exactly why I agree doing that, didn't even need an alarm today
[16:25:37] <wikibugs_>	 (03CR) 10Dzahn: [C: 032] disable icinga notifications on mw22* hosts [puppet] - 10https://gerrit.wikimedia.org/r/431780 (owner: 10Dzahn)
[16:26:41] <wikibugs_>	 10Operations, 10ops-codfw: wtp2020 correctable memory errors - https://phabricator.wikimedia.org/T194176#4191187 (10fgiunchedi)
[16:27:56] <mutante>	 !log mw2251,mw2252,mw2201 - reinstall with stretch
[16:27:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:30:17] <wikibugs_>	 (03PS3) 10Imarlier: performance website: allow traffic [puppet] - 10https://gerrit.wikimedia.org/r/431779 (https://phabricator.wikimedia.org/T159354)
[16:31:04] <wikibugs_>	 (03PS1) 10Alexandros Kosiaris: WIP: Provision RSA keys for ganeti root auth [puppet] - 10https://gerrit.wikimedia.org/r/431782
[16:36:07] <wikibugs_>	 (03CR) 10Imarlier: "bblack and dzahn -- I haven't played with our firewall config things at all, so let me know if I'm missing something." [puppet] - 10https://gerrit.wikimedia.org/r/431779 (https://phabricator.wikimedia.org/T159354) (owner: 10Imarlier)
[16:36:13] <wikibugs_>	 10Operations, 10ops-codfw: Degraded RAID on wasat - https://phabricator.wikimedia.org/T193394#4191265 (10Papaul) a:05Papaul>03RobH @RobH disk replacement complete
[16:36:53] <mutante>	 !log mwmaint1001 - reinstalling one more time after proxysql issues are resolved, PXE booting (T192092)
[16:36:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:36:57] <stashbot>	 T192092: setup replacement for terbium (maintenance_server) on stretch - https://phabricator.wikimedia.org/T192092
[16:37:00] <icinga-wm>	 PROBLEM - Host mwmaint1001 is DOWN: PING CRITICAL - Packet loss = 100%
[16:37:05] <godog>	 to confirm, I can go ahead with scap upgrade elukey ottomata|lunch  ?
[16:37:30] <icinga-wm>	 RECOVERY - Host mwmaint1001 is UP: PING WARNING - Packet loss = 93%, RTA = 0.20 ms
[16:37:42] <elukey>	 godog: yeah everything seems fine
[16:37:54] <godog>	 ack, thanks! cc thcipriani 
[16:38:11] <thcipriani>	 I'm around to test
[16:39:12] <mutante>	 jouncebot: next
[16:39:12] <jouncebot>	 In 0 hour(s) and 20 minute(s): Services – Graphoid / Parsoid / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180508T1700)
[16:39:23] <wikibugs_>	 (03PS1) 10Dzahn: Revert "add mwmaint1001 to scap hosts" [puppet] - 10https://gerrit.wikimedia.org/r/431785
[16:40:05] <wikibugs_>	 (03CR) 10Dzahn: [C: 032] "remove from scap hosts during reinstall to avoid warnings for deployers during scap run" [puppet] - 10https://gerrit.wikimedia.org/r/431785 (owner: 10Dzahn)
[16:40:28] <XioNoX>	 !log re-pooling lvs2001 - T193677
[16:40:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:40:32] <stashbot>	 T193677: Interface errors on asw-d-codfw:xe-2/0/47 - https://phabricator.wikimedia.org/T193677
[16:40:39] <wikibugs_>	 (03PS2) 10Dzahn: Revert "add mwmaint1001 to scap hosts" [puppet] - 10https://gerrit.wikimedia.org/r/431785
[16:40:39] <wikibugs_>	 (03PS2) 10Filippo Giunchedi: Scap: bump version to 3.8.1-1 [puppet] - 10https://gerrit.wikimedia.org/r/430820 (https://phabricator.wikimedia.org/T127762) (owner: 10Thcipriani)
[16:41:10] <wikibugs_>	 10Operations, 10ops-codfw, 10Traffic, 10netops: Interface errors on asw-d-codfw:xe-2/0/47 - https://phabricator.wikimedia.org/T193677#4191293 (10ayounsi) 05Open>03Resolved No more errors.
[16:41:22] <wikibugs_>	 (03CR) 10Filippo Giunchedi: [C: 032] Scap: bump version to 3.8.1-1 [puppet] - 10https://gerrit.wikimedia.org/r/430820 (https://phabricator.wikimedia.org/T127762) (owner: 10Thcipriani)
[16:41:27] <wikibugs_>	 (03PS3) 10Filippo Giunchedi: Scap: bump version to 3.8.1-1 [puppet] - 10https://gerrit.wikimedia.org/r/430820 (https://phabricator.wikimedia.org/T127762) (owner: 10Thcipriani)
[16:41:44] <godog>	 mutante: gah, rebase clashes :(
[16:42:02] <wikibugs_>	 (03CR) 10Filippo Giunchedi: [V: 032 C: 032] Scap: bump version to 3.8.1-1 [puppet] - 10https://gerrit.wikimedia.org/r/430820 (https://phabricator.wikimedia.org/T127762) (owner: 10Thcipriani)
[16:42:28] <mutante>	 godog: :/  just trying to prevent that scap'pers get warnings while i reinstall 
[16:42:52] <wikibugs_>	 10Operations, 10Performance-Team, 10monitoring, 10Patch-For-Review: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837#4191298 (10Imarlier)
[16:42:59] <godog>	 mutante: indeed, sounds like a good idea
[16:43:23] <godog>	 thcipriani: upgraded on tin
[16:43:35] <godog>	 !log upload scap 3.8.1-1 - T127762
[16:43:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:43:37] <stashbot>	 T127762: Update Debian Package for Scap3 - https://phabricator.wikimedia.org/T127762
[16:43:39] <thcipriani>	 godog: cool, testing
[16:43:44] <wikibugs_>	 (03PS1) 10Andrew Bogott: Horizon: added some keystone policy checks [puppet] - 10https://gerrit.wikimedia.org/r/431787
[16:45:18] <logmsgbot>	 !log thcipriani@tin Synchronized README: Testing Scap 3.8.1-1 (duration: 01m 02s)
[16:45:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:45:39] <wikibugs_>	 (03PS3) 10Andrew Bogott: Horizon: add a few config settings for the upcoming wikimediamemberdashboard [puppet] - 10https://gerrit.wikimedia.org/r/431658
[16:45:41] <wikibugs_>	 (03PS2) 10Andrew Bogott: Horizon: added some keystone policy checks [puppet] - 10https://gerrit.wikimedia.org/r/431787
[16:47:09] <thcipriani>	 godog: > Executing check 'Check endpoints for mwdebug1001.eqiad.wmnet' so new checks are running! sync looks like it went fine, thank you for the update!
[16:47:31] <wikibugs_>	 (03PS4) 10Dzahn: performance website: allow traffic [puppet] - 10https://gerrit.wikimedia.org/r/431779 (https://phabricator.wikimedia.org/T159354) (owner: 10Imarlier)
[16:47:32] <godog>	 thcipriani: np! will be rolling out fully at the next puppet run
[16:48:09] <wikibugs_>	 10Operations, 10ops-codfw, 10DC-Ops: rigel.frack.codfw.wmnet (fundraising codfw bastion) will not boot after a power cycle - https://phabricator.wikimedia.org/T193891#4191314 (10Jgreen) 05Open>03Resolved Papaul and I spent some more time on this, and found that "BIOS Serial Console" was set to auto, not...
[16:48:22] <thcipriani>	 godog: awesome, most of the changes were changes that affect deployment-tin apart from changes to git-lfs-backed repos for scap3 so that deploy tested a good chunk of stuff.
[16:48:55] <wikibugs_>	 (03CR) 10Andrew Bogott: [C: 032] Horizon: add a few config settings for the upcoming wikimediamemberdashboard [puppet] - 10https://gerrit.wikimedia.org/r/431658 (owner: 10Andrew Bogott)
[16:49:12] <wikibugs_>	 (03CR) 10Dzahn: [C: 032] performance website: allow traffic [puppet] - 10https://gerrit.wikimedia.org/r/431779 (https://phabricator.wikimedia.org/T159354) (owner: 10Imarlier)
[16:49:29] <wikibugs_>	 (03PS5) 10Dzahn: performance website: allow traffic [puppet] - 10https://gerrit.wikimedia.org/r/431779 (https://phabricator.wikimedia.org/T159354) (owner: 10Imarlier)
[16:50:04] <wikibugs_>	 (03CR) 10Muehlenhoff: [C: 031] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/431779 (https://phabricator.wikimedia.org/T159354) (owner: 10Imarlier)
[16:50:12] <godog>	 thcipriani: excellent!
[16:54:37] <icinga-wm>	 PROBLEM - puppet last run on labtestweb2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 57 seconds ago with 1 failures. Failed resources (up to 3 shown): Package[scap]
[16:55:38] <godog>	 uh oh, I'll take a look
[16:57:05] <godog>	 should be recovering, puppet agent ran fine
[16:59:35] <icinga-wm>	 RECOVERY - puppet last run on labtestweb2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[17:00:04] <jouncebot>	 cscott, arlolra, subbu, halfak, and Amir1: #bothumor My software never has bugs. It just develops random features. Rise for Services – Graphoid / Parsoid / Citoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180508T1700).
[17:00:28] <bawolff>	 !log Clear botpassword throttle for [[User:TaxonBot]] (T194160) 
[17:00:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:00:32] <stashbot>	 T194160: Unlock the login of bot user TaxonBot@TaxonBot to dewiki - https://phabricator.wikimedia.org/T194160
[17:01:46] <wikibugs_>	 (03PS1) 10Imarlier: performance website: remove from graphite hosts [puppet] - 10https://gerrit.wikimedia.org/r/431792 (https://phabricator.wikimedia.org/T159354)
[17:04:41] <wikibugs_>	 10Operations, 10ops-eqiad, 10Discovery, 10Discovery-Search, and 4 others: rack/setup/install wdqs10[09|10].eqiad.wmnet - https://phabricator.wikimedia.org/T194184#4191367 (10RobH) p:05Triage>03Normal
[17:07:25] <wikibugs_>	 10Operations, 10ops-eqiad, 10Discovery, 10Discovery-Search, and 4 others: rack/setup/install wdqs10[09|10].eqiad.wmnet - https://phabricator.wikimedia.org/T194184#4191395 (10RobH)
[17:11:24] <icinga-wm>	 PROBLEM - Check systemd state on kafka2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[17:12:04] <icinga-wm>	 PROBLEM - Check systemd state on kafka2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[17:12:05] <icinga-wm>	 PROBLEM - Check systemd state on kafka2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[17:14:23] <elukey>	 checking --^
[17:14:37] <wikibugs_>	 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: rack/setup/install cloudelastic100[1-4].eqiad.wmnet systems - https://phabricator.wikimedia.org/T194186#4191413 (10RobH) p:05Triage>03Normal
[17:16:32] <wikibugs_>	 (03PS3) 10Andrew Bogott: Horizon: added some keystone policy checks [puppet] - 10https://gerrit.wikimedia.org/r/431787
[17:16:36] <elukey>	 those are mirror maker instances
[17:17:12] <elukey>	 a systemctl reset-failed is enough, will wait for ottomata
[17:18:11] <ottomata>	 ah
[17:18:17] <ottomata>	 makes sense, puppet removed the mm instance systemd units, ya?
[17:18:20] <ottomata>	 need reset-failed elukey?
[17:19:37] <elukey>	 yeah exactly
[17:20:38] <wikibugs_>	 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: rack/setup/install cloudelastic100[1-4].eqiad.wmnet systems - https://phabricator.wikimedia.org/T194186#4191439 (10RobH) @bd808 or @chasemp:  Before @Cmjohnson racks these, I'd like to confirm the networking requirements.  These have 10Gbit net...
[17:20:44] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw2161 is OK: OK
[17:25:44] <icinga-wm>	 RECOVERY - Check systemd state on kafka2002 is OK: OK - running: The system is fully operational
[17:25:44] <icinga-wm>	 RECOVERY - Check systemd state on kafka2001 is OK: OK - running: The system is fully operational
[17:26:15] <icinga-wm>	 RECOVERY - Check systemd state on kafka2003 is OK: OK - running: The system is fully operational
[17:26:35] <wikibugs_>	 10Operations, 10DBA, 10Traffic: Framework to transfer files over the LAN - https://phabricator.wikimedia.org/T156462#4191453 (10jcrespo) The recommended cipher, which is an easier change, is chacha20 or, alternatively, AES-GCM rather than the randomly selected one on the commit.
[17:27:05] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw2160 is OK: OK
[17:27:37] <icinga-wm>	 ACKNOWLEDGEMENT - HP RAID on db2067 is CRITICAL: CRITICAL: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:11, 1I:1:12 - Failed: 1I:1:10 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T194187
[17:28:02] <wikibugs_>	 10Operations, 10ops-codfw: Degraded RAID on db2067 - https://phabricator.wikimedia.org/T194187#4191458 (10ops-monitoring-bot)
[17:29:30] <wikibugs_>	 (03CR) 10Chad: [V: 032 C: 032] Update non-core plugins to their respective stable-2.14 tips [software/gerrit/gerrit] (wmf/stable-2.14) - 10https://gerrit.wikimedia.org/r/431675 (owner: 10Chad)
[17:29:44] <wikibugs_>	 (03PS2) 10Framawiki: Create the 'eventcoordinator' user group on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430418 (https://phabricator.wikimedia.org/T193075)
[17:32:11] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw2159 is OK: OK
[17:37:34] <wikibugs_>	 (03PS1) 10Ottomata: Stop main-codfw -> main-eqiad MirrorMaker during Kafka main upgrade [puppet] - 10https://gerrit.wikimedia.org/r/431799 (https://phabricator.wikimedia.org/T167039)
[17:37:36] <wikibugs_>	 (03PS1) 10Ottomata: Kafka main-eqiad inter_broker_protocol_version: 1.1.0 [puppet] - 10https://gerrit.wikimedia.org/r/431800 (https://phabricator.wikimedia.org/T167039)
[17:37:38] <wikibugs_>	 (03PS1) 10Ottomata: Kafka main-eqiad - remove api.version [puppet] - 10https://gerrit.wikimedia.org/r/431801 (https://phabricator.wikimedia.org/T167039)
[17:37:40] <wikibugs_>	 (03PS1) 10Ottomata: Kafka main-eqiad - log.message.format.version [puppet] - 10https://gerrit.wikimedia.org/r/431802 (https://phabricator.wikimedia.org/T167039)
[18:00:04] <jouncebot>	 Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180508T1800)
[18:03:29] <wikibugs_>	 (03PS1) 10Dzahn: Revert "disable icinga notifications on mw22* hosts" [puppet] - 10https://gerrit.wikimedia.org/r/431807
[18:03:36] <wikibugs_>	 (03PS1) 10Dzahn: Revert "Revert "add mwmaint1001 to scap hosts"" [puppet] - 10https://gerrit.wikimedia.org/r/431808
[18:03:47] <logmsgbot>	 !log andrew@tin Started deploy [horizon/deploy@9245ca9]: rolling out member dashboard
[18:03:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:06:35] <logmsgbot>	 !log dzahn@neodymium conftool action : set/pooled=yes; selector: name=mw2251.codfw.wmnet
[18:06:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:07:05] <logmsgbot>	 !log andrew@tin Finished deploy [horizon/deploy@9245ca9]: rolling out member dashboard (duration: 03m 18s)
[18:07:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:08:24] <logmsgbot>	 !log dzahn@neodymium conftool action : set/pooled=yes; selector: name=mw2252.codfw.wmnet
[18:08:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:11:36] <wikibugs_>	 (03Abandoned) 10Krinkle: mtail: Update a /w/load.php test case from a current varnishncsa sample [puppet] - 10https://gerrit.wikimedia.org/r/431608 (https://phabricator.wikimedia.org/T184942) (owner: 10Krinkle)
[18:15:08] <wikibugs_>	 (03PS1) 10Dzahn: mw-maintenance/wikidata: set $ensure for rebuildTermSqlIndex.log [puppet] - 10https://gerrit.wikimedia.org/r/431810 (https://phabricator.wikimedia.org/T192092)
[18:15:36] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] mw-maintenance/wikidata: set $ensure for rebuildTermSqlIndex.log [puppet] - 10https://gerrit.wikimedia.org/r/431810 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn)
[18:16:27] <wikibugs_>	 (03PS2) 10Dzahn: mw-maintenance/wikidata: set $ensure for rebuildTermSqlIndex.log [puppet] - 10https://gerrit.wikimedia.org/r/431810 (https://phabricator.wikimedia.org/T192092)
[18:17:04] <wikibugs_>	 (03CR) 10Dzahn: [C: 032] mw-maintenance/wikidata: set $ensure for rebuildTermSqlIndex.log [puppet] - 10https://gerrit.wikimedia.org/r/431810 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn)
[18:18:45] <twentyafterfour>	 !log Branching MediaWiki master to wmf/1.32.0-wmf.3 refs T191049
[18:18:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:18:49] <stashbot>	 T191049: 1.32.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T191049
[18:19:02] <wikibugs_>	 (03PS4) 10Andrew Bogott: Horizon: added some keystone policy checks [puppet] - 10https://gerrit.wikimedia.org/r/431787 (https://phabricator.wikimedia.org/T194191)
[18:22:39] <mutante>	 !log mwmaint1001 - rebooting
[18:22:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:23:50] <wikibugs_>	 10Operations, 10ops-codfw: Degraded RAID on db2067 - https://phabricator.wikimedia.org/T194187#4191616 (10Marostegui)
[18:23:55] <wikibugs_>	 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2067 - https://phabricator.wikimedia.org/T194103#4191618 (10Marostegui)
[18:24:21] <wikibugs_>	 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2067 - https://phabricator.wikimedia.org/T194103#4188843 (10Marostegui) The disk has failed to rebuild, can we try another one?: ```       physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, Failed)  ```  Thanks!
[18:27:19] <wikibugs_>	 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1073 - https://phabricator.wikimedia.org/T194155#4191631 (10Marostegui) 05Open>03Resolved Disk #9 finished rebuilding:  ``` root@db1073:~# megacli -PDRbld -ShowProg -PhysDrv [32:9] -aALL  Device(Encl-32 Slot-9) is not in rebuild process  Exit Code: 0x0...
[18:29:32] <wikibugs_>	 10Operations, 10Discovery, 10Maps, 10Maps-Sprint, and 2 others: Decide on Cache-Control headers for map tiles - https://phabricator.wikimedia.org/T186732#4191642 (10jmatazzoni)
[18:30:47] <mutante>	 say hi if you're using terbium to run manual maintenance commands
[18:31:03] <mutante>	 i'll want to move stuff to mwmaint1001 instead and test
[18:31:38] <mutante>	 also let's see if we can maybe puppetize it if there are regular but manual commands left
[18:32:02] <wikibugs_>	 (03PS1) 10Ottomata: Remove version requirements for kafkacat and librdkafka from kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/431815
[18:32:14] <jynus>	 mutante:for the 2 I know about, I have told the users to puppetize them
[18:32:30] <jynus>	 but they may need some logs to control task status
[18:33:07] <mutante>	 jynus: aha, thank you
[18:33:17] <mutante>	 i should write a list mail before i switch the host over
[18:33:25] <jynus>	 you can see them referring to local logs
[18:33:26] <mutante>	 but first mentioning here
[18:33:27] <jynus>	 on puppet
[18:33:50] <mutante>	 it literally just reinstalled and the puppet class is fixed 
[18:33:55] <wikibugs_>	 (03PS2) 10Ottomata: Remove version requirements for kafkacat and librdkafka from kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/431815 (https://phabricator.wikimedia.org/T182163)
[18:33:58] <mutante>	 works without errors on stretch now
[18:34:14] <mutante>	 ok jynus, will check, *nod*
[18:34:29] <wikibugs_>	 (03CR) 10Ottomata: [C: 032] Remove version requirements for kafkacat and librdkafka from kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/431815 (https://phabricator.wikimedia.org/T182163) (owner: 10Ottomata)
[18:35:20] <mutante>	 jynus: and thanks for the merges re: sqlproxy etc, it's all green on mwmaint1001 now :)
[18:35:49] <mutante>	 for the nutcracker part i just had to make sure it gets rebooted once 
[18:36:45] <wikibugs_>	 (03PS2) 10Dzahn: Revert "Revert "add mwmaint1001 to scap hosts"" [puppet] - 10https://gerrit.wikimedia.org/r/431808
[18:41:54] <wikibugs_>	 (03CR) 10Dzahn: [C: 032] Revert "Revert "add mwmaint1001 to scap hosts"" [puppet] - 10https://gerrit.wikimedia.org/r/431808 (owner: 10Dzahn)
[18:43:30] <jynus>	 mutante: see things such as https://gerrit.wikimedia.org/r/#/c/427202/5/modules/mediawiki/manifests/maintenance/wikidata.pp
[18:44:15] <jynus>	 while fully idempotent, it checks /var/log/wikidata/* logs to check progress
[18:45:14] <jynus>	 migrating that host is not trivial anyway, you may want to coordinate a lot
[18:45:24] <jynus>	 joe had lots of issues last time
[18:46:54] <mutante>	 *nod*, i see
[18:47:01] <mutante>	 i just fixed something else related to that /var/log/wikidata dir
[18:47:13] <mutante>	 which was an issue on the inactive host where it's not running
[18:47:31] <mutante>	 indeed i will want to coordinate with hoo on the wikidata part
[18:48:10] <jynus>	 actually it is Amir3-2 who deployed most of those
[18:48:43] <jynus>	 with my help/bugging him to puppetize them
[18:50:48] <wikibugs_>	 10Operations, 10Performance-Team, 10Patch-For-Review: Move coal from graphite#001 nodes to webperf#001 - https://phabricator.wikimedia.org/T159354#4191743 (10Imarlier)
[18:50:52] <mutante>	 oh, good to know, thanks
[18:51:32] <wikibugs_>	 (03PS1) 10Chad: 2.14.8-22-g07c8aa9910 [software/gerrit] (stable-2.14) - 10https://gerrit.wikimedia.org/r/431818
[18:52:32] <marostegui>	 !log Manually fail disk #7 on db1073 to get it replaced
[18:52:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:54:58] <wikibugs_>	 (03CR) 10Ottomata: [C: 032] icinga-downtime - fail if given FQDN [puppet] - 10https://gerrit.wikimedia.org/r/430079 (owner: 10Ottomata)
[18:55:00] <wikibugs_>	 (03PS4) 10Ottomata: icinga-downtime - fail if given FQDN [puppet] - 10https://gerrit.wikimedia.org/r/430079
[18:55:02] <wikibugs_>	 (03CR) 10Ottomata: [V: 032 C: 032] icinga-downtime - fail if given FQDN [puppet] - 10https://gerrit.wikimedia.org/r/430079 (owner: 10Ottomata)
[18:55:16] <wikibugs_>	 (03CR) 10Paladox: [C: 031] 2.14.8-22-g07c8aa9910 [software/gerrit] (stable-2.14) - 10https://gerrit.wikimedia.org/r/431818 (owner: 10Chad)
[18:55:32] <mutante>	 !log mw2202, mw2203, mw2204 - reinstall with stretch
[18:55:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:58:46] <icinga-wm>	 PROBLEM - MegaRAID on db1073 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded)
[18:58:47] <icinga-wm>	 ACKNOWLEDGEMENT - MegaRAID on db1073 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T194197
[18:58:53] <wikibugs_>	 10Operations, 10ops-eqiad: Degraded RAID on db1073 - https://phabricator.wikimedia.org/T194197#4191790 (10ops-monitoring-bot)
[19:00:04] <jouncebot>	 twentyafterfour: That opportune time is upon us again. Time for a MediaWiki train deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180508T1900).
[19:00:39] <wikibugs_>	 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1073 - https://phabricator.wikimedia.org/T194197#4191793 (10Marostegui) p:05Triage>03Normal This disk was manually failed to get it replaced and clear the SMART alert. It has already been swapped by Chris, and it is rebuilding:   ``` root@db1073:~# meg...
[19:08:06] <icinga-wm>	 RECOVERY - Device not healthy -SMART- on db1073 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1073&var-datasource=eqiad%2520prometheus%252Fops
[19:11:48] <twentyafterfour>	 !log updated mediawiki changelog https://www.mediawiki.org/wiki/MediaWiki_1.32/wmf.3/Changelog refs T191049
[19:11:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:11:53] <stashbot>	 T191049: 1.32.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T191049
[19:13:36] <wikibugs_>	 (03PS1) 1020after4: testwikis wikis to 1.32.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431821
[19:13:38] <wikibugs_>	 (03CR) 1020after4: [C: 032] testwikis wikis to 1.32.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431821 (owner: 1020after4)
[19:14:25] <twentyafterfour>	 !log testwikis to 1.32.0-wmf.3 - https://gerrit.wikimedia.org/r/#/c/431821/ refs T191049
[19:14:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:15:02] <wikibugs_>	 (03Merged) 10jenkins-bot: testwikis wikis to 1.32.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431821 (owner: 1020after4)
[19:15:27] <logmsgbot>	 !log twentyafterfour@tin Started scap: testwikis wikis to 1.32.0-wmf.3
[19:15:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:20:06] <wikibugs_>	 (03CR) 10jenkins-bot: testwikis wikis to 1.32.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431821 (owner: 1020after4)
[19:30:49] <wikibugs_>	 (03PS1) 10Bstorm: WIP: wiki replicas - prepare for refactored actor storage [puppet] - 10https://gerrit.wikimedia.org/r/431823 (https://phabricator.wikimedia.org/T188299)
[19:37:26] <wikibugs_>	 (03PS1) 10Bstorm: wiki replicas: remove the SQL reference file for indexes since it is obsolete [puppet] - 10https://gerrit.wikimedia.org/r/431825
[19:50:13] <wikibugs_>	 10Operations, 10MediaWiki-Parser, 10MediaWiki-Platform-Team, 10Parsing-Team, and 2 others: Different production servers have different versions of tidy installed, resulting in varying output - https://phabricator.wikimedia.org/T193414#4191968 (10hashar) T191771 is MediaWiki parser tests failing under CI wh...
[19:54:19] <wikibugs_>	 (03PS1) 10Herron: logstash: add tcp tls input for syslogs [puppet] - 10https://gerrit.wikimedia.org/r/431830 (https://phabricator.wikimedia.org/T193766)
[19:54:46] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] logstash: add tcp tls input for syslogs [puppet] - 10https://gerrit.wikimedia.org/r/431830 (https://phabricator.wikimedia.org/T193766) (owner: 10Herron)
[19:54:55] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1036 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[19:55:29] <ottomata>	 !
[19:55:30] <ottomata>	 ha
[19:55:31] <ottomata>	  sorry
[19:55:33] <ottomata>	 my downtime expired
[19:55:45] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1034 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[19:55:50] <ottomata>	 fixing
[19:55:54] <wikibugs_>	 (03PS2) 10Herron: logstash: add tcp tls input for syslogs [puppet] - 10https://gerrit.wikimedia.org/r/431830 (https://phabricator.wikimedia.org/T193766)
[19:59:32] <icinga-wm>	 RECOVERY - MegaRAID on db1073 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy
[20:00:45] <wikibugs_>	 (03CR) 10Herron: [C: 04-2] "need to test if the existing filters applied to type syslog will behave the same with tcp input" [puppet] - 10https://gerrit.wikimedia.org/r/431830 (https://phabricator.wikimedia.org/T193766) (owner: 10Herron)
[20:01:48] <mutante>	 !log mw2205,mw2206,mw2207 - reinstalling with stretch - mw2202 - wmf-auto-reimage failed: Timeout of 60 minutes reached waiting for reboot
[20:01:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:02:31] <logmsgbot>	 !log dzahn@neodymium conftool action : set/pooled=yes; selector: name=mw2201.codfw.wmnet
[20:02:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:16:16] <wikibugs_>	 (03PS1) 10Herron: ELK: change elasticsearch index prefix to logstash-syslog for syslog type [puppet] - 10https://gerrit.wikimedia.org/r/431860 (https://phabricator.wikimedia.org/T193766)
[20:17:43] <logmsgbot>	 !log milimetric@tin Started deploy [analytics/refinery@2a4633c]: Deploying renamed geowiki jobs as geoeditors
[20:17:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:23:48] <wikibugs_>	 (03PS5) 10Andrew Bogott: Horizon: added some keystone policy checks [puppet] - 10https://gerrit.wikimedia.org/r/431787 (https://phabricator.wikimedia.org/T194191)
[20:24:51] <logmsgbot>	 !log milimetric@tin Finished deploy [analytics/refinery@2a4633c]: Deploying renamed geowiki jobs as geoeditors (duration: 07m 07s)
[20:24:52] <icinga-wm>	 PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation=compareAndSwap https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[20:24:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:25:15] <wikibugs_>	 (03PS3) 10Urbanecm: Create the 'eventcoordinator' user group on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430418 (https://phabricator.wikimedia.org/T193075) (owner: 10Framawiki)
[20:25:31] <icinga-wm>	 PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=PATCH https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[20:26:33] <wikibugs_>	 (03CR) 10Andrew Bogott: [C: 032] Horizon: added some keystone policy checks [puppet] - 10https://gerrit.wikimedia.org/r/431787 (https://phabricator.wikimedia.org/T194191) (owner: 10Andrew Bogott)
[20:29:07] <icinga-wm>	 RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[20:33:57] <icinga-wm>	 RECOVERY - MegaRAID on analytics1032 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy
[20:38:27] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1036 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[20:38:48] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1034 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[20:39:17] <icinga-wm>	 RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[20:53:13] <wikibugs_>	 (03PS1) 10MaxSem: Deploy CongressLookup to betalabs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431989
[20:55:31] <twentyafterfour>	 mutante: mw2204.codfw.wmnet returned [255]: Permission denied 
[20:55:47] <twentyafterfour>	 same for mw2203
[20:56:15] <legoktm>	 MaxSem: woah can we slow down a bit?
[20:56:54] <Reedy>	 twentyafterfour: being reinstalled
[20:57:42] <Reedy>	 mutante: ^ did they not get depooled?
[20:58:11] <mutante>	 argg. that is a failure of the reinstall script that normally does that automatically
[20:58:40] <Reedy>	 :(
[20:58:59] <logmsgbot>	 !log dzahn@neodymium conftool action : set/pooled=inactive; selector: name=mw2203.codfw.wmnet
[20:59:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:59:09] <logmsgbot>	 !log dzahn@neodymium conftool action : set/pooled=inactive; selector: name=mw2204.codfw.wmnet
[20:59:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:59:55] <mutante>	 twentyafterfour: ^ depooled, normally happens automatically, sorry
[21:04:21] <icinga-wm>	 PROBLEM - MegaRAID on analytics1032 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough
[21:07:50] <wikibugs_>	 (03CR) 10Anomie: WIP: wiki replicas - prepare for refactored actor storage (0317 comments) [puppet] - 10https://gerrit.wikimedia.org/r/431823 (https://phabricator.wikimedia.org/T188299) (owner: 10Bstorm)
[21:09:49] <logmsgbot>	 !log twentyafterfour@tin Finished scap: testwikis wikis to 1.32.0-wmf.3 (duration: 114m 21s)
[21:09:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:10:21] <MaxSem>	 legoktm: ?
[21:10:39] <legoktm>	 re: CongressLookup. discussing in -dev
[21:12:45] <wikibugs_>	 (03PS1) 10Ayounsi: Revert "Smokeping, remove Rigel" [puppet] - 10https://gerrit.wikimedia.org/r/431991
[21:12:52] <wikibugs_>	 (03PS2) 10Ayounsi: Revert "Smokeping, remove Rigel" [puppet] - 10https://gerrit.wikimedia.org/r/431991
[21:13:20] <MaxSem>	 legoktm: the window is in several hours, don't worry:)
[21:14:59] <wikibugs_>	 (03CR) 10Ayounsi: [C: 032] Revert "Smokeping, remove Rigel" [puppet] - 10https://gerrit.wikimedia.org/r/431991 (owner: 10Ayounsi)
[21:29:28] <wikibugs_>	 (03PS1) 10Smalyshev: Add string and external-id types to indexing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431994 (https://phabricator.wikimedia.org/T163642)
[21:30:44] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] Add string and external-id types to indexing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431994 (https://phabricator.wikimedia.org/T163642) (owner: 10Smalyshev)
[21:38:51] <wikibugs_>	 10Operations, 10Cloud-Services, 10netops: Allocate public v4 IPs for Neutron setup in eqiad - https://phabricator.wikimedia.org/T193496#4192431 (10ayounsi) @chasemp Can you provide an ETA for returning the /25?
[21:39:10] <wikibugs_>	 (03PS2) 10Smalyshev: Add string and external-id types to Wikibase indexing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431994 (https://phabricator.wikimedia.org/T163642)
[21:40:19] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] Add string and external-id types to Wikibase indexing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431994 (https://phabricator.wikimedia.org/T163642) (owner: 10Smalyshev)
[21:41:13] <wikibugs_>	 (03CR) 10Krinkle: [C: 031] performance website: remove from graphite hosts [puppet] - 10https://gerrit.wikimedia.org/r/431792 (https://phabricator.wikimedia.org/T159354) (owner: 10Imarlier)
[21:41:37] <wikibugs_>	 (03PS1) 1020after4: group0 wikis to 1.32.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431998
[21:41:39] <wikibugs_>	 (03CR) 1020after4: [C: 032] group0 wikis to 1.32.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431998 (owner: 1020after4)
[21:42:00] <wikibugs_>	 (03PS2) 10Dzahn: performance website: remove from graphite hosts [puppet] - 10https://gerrit.wikimedia.org/r/431792 (https://phabricator.wikimedia.org/T159354) (owner: 10Imarlier)
[21:42:49] <mutante>	 marlier: did you want to also manually delete the site from graphite servers etc?
[21:42:54] <wikibugs_>	 (03Merged) 10jenkins-bot: group0 wikis to 1.32.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431998 (owner: 1020after4)
[21:43:03] <wikibugs_>	 (03CR) 10Dzahn: [C: 032] performance website: remove from graphite hosts [puppet] - 10https://gerrit.wikimedia.org/r/431792 (https://phabricator.wikimedia.org/T159354) (owner: 10Imarlier)
[21:45:20] <logmsgbot>	 !log twentyafterfour@tin rebuilt and synchronized wikiversions files: group0 wikis to 1.32.0-wmf.3
[21:45:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:48:36] <wikibugs_>	 (03CR) 10Bstorm: WIP: wiki replicas - prepare for refactored actor storage (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/431823 (https://phabricator.wikimedia.org/T188299) (owner: 10Bstorm)
[21:59:06] <twentyafterfour>	 !log MediaWiki train for 1.32.0-wmf.3 group0 is complete. Will resume with group1 tomorrow, same bat time, same bat channel (refs T191049)
[21:59:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:59:10] <stashbot>	 T191049: 1.32.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T191049
[21:59:29] <wikibugs_>	 (03CR) 10jenkins-bot: group0 wikis to 1.32.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431998 (owner: 1020after4)
[22:04:08] <wikibugs_>	 10Operations, 10Fundraising-Backlog, 10Traffic, 10fundraising-tech-ops: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561#4192548 (10cwdent) The problems I see are:  - content served over http - weak DH supported (https://weakdh.org/) resulting in "B" grade from Qualys  I d...
[22:05:15] <XioNoX>	 !log progressively push updated BGP_sanitize_in bogon ASN filters to routers - T190317
[22:05:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:05:19] <stashbot>	 T190317: Update BGP_sanitize_in filter - https://phabricator.wikimedia.org/T190317
[22:09:51] <wikibugs_>	 (03PS3) 10Smalyshev: Add string and external-id types to Wikibase indexing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431994 (https://phabricator.wikimedia.org/T163642)
[22:19:16] <wikibugs_>	 (03PS1) 10Chad: Adding deploy_artifacts.py wrapper to easily push stuff with mvn deploy:deploy-file [software/gerrit] (stable-2.14) - 10https://gerrit.wikimedia.org/r/432007
[22:33:51] <wikibugs_>	 (03PS1) 10Krinkle: Remove unused vendor/autoload.php from missing.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/432010
[22:33:53] <wikibugs_>	 (03PS1) 10Krinkle: multiversion: Remove unused vendor/autoload from getMWVersion. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/432011
[22:33:55] <wikibugs_>	 (03PS1) 10Krinkle: multiversion: Move vendor/autoload from MWMultiVersion to profiler.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/432012
[22:33:57] <wikibugs_>	 (03PS1) 10Krinkle: Move multiversion/vendor/ to vendor/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/432013
[22:34:08] <Krinkle>	 no_justification: :)
[22:35:45] <XioNoX>	 !log remove PREFERRED-TRANSIT Tele2-DTAG from esams/knams routers
[22:35:47] <icinga-wm>	 RECOVERY - MegaRAID on analytics1032 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy
[22:35:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:36:59] <no_justification>	 Wheeee
[22:43:13] <wikibugs_>	 (03PS2) 10Chad: Adding deploy_artifacts.py wrapper to easily push stuff with mvn deploy:deploy-file [software/gerrit] (stable-2.14) - 10https://gerrit.wikimedia.org/r/432007
[22:47:24] <logmsgbot>	 !log awight@tin Started deploy [ores/deploy@5b27205]: Deploy LFS files to ores1002
[22:47:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:49:21] <logmsgbot>	 !log awight@tin Finished deploy [ores/deploy@5b27205]: Deploy LFS files to ores1002 (duration: 01m 59s)
[22:49:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:52:56] <logmsgbot>	 !log awight@tin Started deploy [ores/deploy@bf182e2]: Rollback ores1002 to master
[22:52:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:53:14] <logmsgbot>	 !log awight@tin Finished deploy [ores/deploy@bf182e2]: Rollback ores1002 to master (duration: 00m 19s)
[22:53:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:00:04] <jouncebot>	 addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180508T2300).
[23:00:04] <jouncebot>	 Lucas_WMDE and Smalyshev: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[23:00:13] <Lucas_WMDE>	 o/
[23:00:51] <SMalyshev>	 here
[23:03:55] <Lucas_WMDE>	 I can start with a disclaimer that my backport isn’t directly testable, I don’t have a reliable way to trigger the exception it fixes
[23:04:14] <Lucas_WMDE>	 it also has a CI failure, unfortunately – I have to hope that the SWATter will agree the failure is unrelated :)
[23:04:53] <thcipriani>	 oh boy
[23:04:55] <thcipriani>	 I can SWAT
[23:05:28] <SMalyshev>	 great
[23:06:08] <thcipriani>	 Lucas_WMDE: I guess let's backport https://gerrit.wikimedia.org/r/#/c/430577/ and then rebase your patch.
[23:06:24] <Lucas_WMDE>	 okay
[23:06:38] <Lucas_WMDE>	 (I don’t think a rebase will be necessary, since it’s in a different extension?)
[23:06:55] <thcipriani>	 right :)
[23:07:00] <Lucas_WMDE>	 ok :)
[23:07:20] <Lucas_WMDE>	 also, it looks like I found a semi-reliable way to test my change after all
[23:08:55] <thcipriani>	 nice
[23:09:32] <thcipriani>	 while I wait for jenkins to do its thing I'll get the config change done.
[23:09:45] <Lucas_WMDE>	 ok
[23:09:46] <wikibugs_>	 (03PS4) 10Thcipriani: Add string and external-id types to Wikibase indexing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431994 (https://phabricator.wikimedia.org/T163642) (owner: 10Smalyshev)
[23:09:55] <wikibugs_>	 (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431994 (https://phabricator.wikimedia.org/T163642) (owner: 10Smalyshev)
[23:11:19] <wikibugs_>	 (03Merged) 10jenkins-bot: Add string and external-id types to Wikibase indexing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431994 (https://phabricator.wikimedia.org/T163642) (owner: 10Smalyshev)
[23:11:35] <wikibugs_>	 (03CR) 10jenkins-bot: Add string and external-id types to Wikibase indexing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431994 (https://phabricator.wikimedia.org/T163642) (owner: 10Smalyshev)
[23:12:49] <XioNoX>	 !log lowering ospf metric of ulsfo-codfw to 390
[23:12:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:13:02] <thcipriani>	 SMalyshev: your change is live on mwdebug1002, check please
[23:13:48] <SMalyshev>	 thcipriani: checking
[23:16:49] <SMalyshev>	 thcipriani: hmm actually I am not sure I can check it on mwdebug since it depends on jobs... and jobs run on different hosts, right?
[23:17:28] <SMalyshev>	 I may be able to check it on tin though
[23:17:42] <thcipriani>	 ah, cool, yeah it should be live there as well
[23:17:44] <SMalyshev>	 erh I mean terbium
[23:17:56] * thcipriani pulls to terbium
[23:18:22] <thcipriani>	 SMalyshev: live on terbium
[23:18:31] <wikibugs_>	 (03PS3) 10Chad: Adding deploy_artifacts.py wrapper to easily push stuff with mvn deploy:deploy-file [software/gerrit] (stable-2.14) - 10https://gerrit.wikimedia.org/r/432007
[23:19:40] <SMalyshev>	 thcipriani: aha, great. Seems to be working just fine
[23:20:00] <thcipriani>	 SMalyshev: cool, going live
[23:21:34] <SMalyshev>	 thanks!
[23:22:45] <logmsgbot>	 !log thcipriani@tin Synchronized wmf-config: SWAT: [[gerrit:431994|Add string and external-id types to Wikibase indexing]] T163642 T99899 (duration: 01m 26s)
[23:22:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:22:50] <stashbot>	 T99899: [Story] Looking up entities by external identifiers - https://phabricator.wikimedia.org/T99899
[23:22:50] <stashbot>	 T163642: Index Wikidata strings in statements in the search engine - https://phabricator.wikimedia.org/T163642
[23:22:51] <thcipriani>	 ^ SMalyshev live now
[23:25:24] <wikibugs_>	 (03PS1) 10Krinkle: Use perftools/xhgui-collector instead of perftools/xhgui [mediawiki-config] - 10https://gerrit.wikimedia.org/r/432016
[23:26:02] <Krinkle>	 no_justification: -104097, +712 :)
[23:26:15] <no_justification>	 OMG I LOVE YOU
[23:26:49] <legoktm>	 :o
[23:26:57] <icinga-wm>	 PROBLEM - MegaRAID on analytics1032 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough
[23:27:09] <wikibugs_>	 (03CR) 10Krinkle: "I've done a plain git-mv in this commit, but that might not actually work. There's a couple of references in vendor/composer that try to f" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/432013 (owner: 10Krinkle)
[23:27:20] <wikibugs_>	 (03CR) 10Krinkle: [C: 04-1] Move multiversion/vendor/ to vendor/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/432013 (owner: 10Krinkle)
[23:27:40] <wikibugs_>	 (03CR) 10Krinkle: [C: 04-1] "Untested. Will run tests later this/next week on HHVM and PHP7" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/432016 (owner: 10Krinkle)
[23:28:41] <wikibugs_>	 (03CR) 10Krinkle: "TODO: Move within XWD conditional. That would mean one less autoloader on all MediaWIki php requests." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/432012 (owner: 10Krinkle)
[23:31:16] <wikibugs_>	 (03PS1) 10Chad: Minimal pom.xml so output from mvn looks sane [software/gerrit] (stable-2.14) - 10https://gerrit.wikimedia.org/r/432017
[23:32:36] <wikibugs_>	 (03CR) 10Jdlrobson: [C: 031] Remove unused PopupsAnonsExperimentalGroupSize config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431759 (https://phabricator.wikimedia.org/T173952) (owner: 10Pmiazga)
[23:42:04] <XioNoX>	 !log progressively push BGP_sanitize_in as-path too-many-hops to routers - T190317
[23:42:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:42:08] <stashbot>	 T190317: Update BGP_sanitize_in filter - https://phabricator.wikimedia.org/T190317
[23:42:30] <mutante>	 jouncebot: now
[23:42:30] <jouncebot>	 For the next 0 hour(s) and 17 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180508T2300)
[23:49:31] <wikibugs_>	 (03CR) 10Chad: [V: 032 C: 032] 2.14.8-22-g07c8aa9910 [software/gerrit] (stable-2.14) - 10https://gerrit.wikimedia.org/r/431818 (owner: 10Chad)
[23:50:00] <wikibugs_>	 (03CR) 10Chad: [V: 032 C: 032] Adding deploy_artifacts.py wrapper to easily push stuff with mvn deploy:deploy-file [software/gerrit] (stable-2.14) - 10https://gerrit.wikimedia.org/r/432007 (owner: 10Chad)
[23:50:27] <thcipriani>	 Lucas_WMDE: sorry for the delay, your change is on mwdebug1002, check please
[23:50:33] <Lucas_WMDE>	 will do, no problem
[23:51:20] <Lucas_WMDE>	 (my test is pretty simple, open https://www.wikidata.org/wiki/Special:ConstraintReport/Q23 several times and check that I never get a BadMethodCallException)
[23:51:32] <thcipriani>	 :)
[23:54:27] <Lucas_WMDE>	 no errors in about ten requests, that’s good enough for me
[23:54:34] <Lucas_WMDE>	 thcipriani feel free to proceed :)
[23:54:37] * thcipriani does
[23:58:25] <logmsgbot>	 !log thcipriani@tin Synchronized php-1.32.0-wmf.2/extensions/WikibaseQualityConstraints/src/ConstraintCheck/Helper/LoggingHelper.php: SWAT: [[gerrit:431805|Do not try to access null message message key]] T194140 (duration: 01m 32s)
[23:58:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:58:29] <stashbot>	 T194140: Fatal exception of type "BadMethodCallException" on Special:ConstraintReport - https://phabricator.wikimedia.org/T194140
[23:58:51] <thcipriani>	 ^ Lucas_WMDE live everywhere
[23:58:59] <Lucas_WMDE>	 great, thanks!
[23:59:09] <Lucas_WMDE>	 I’ll check logstash tomorrow to see if the errors stopped
[23:59:17] <thcipriani>	 cool, thanks :)