[00:30:36] PROBLEM - High CPU load on API appserver on mw2220 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:32:16] PROBLEM - High CPU load on API appserver on mw2221 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:32:37] PROBLEM - puppet last run on mw2219 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:32:37] PROBLEM - nutcracker process on mw2220 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:32:37] PROBLEM - nutcracker port on mw2221 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:34:16] PROBLEM - Check size of conntrack table on mw2219 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:34:16] PROBLEM - nutcracker process on mw2221 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:34:16] PROBLEM - puppet last run on mw2220 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:35:56] PROBLEM - Check systemd state on mw2219 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:35:56] PROBLEM - MD RAID on mw2219 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:35:56] PROBLEM - puppet last run on mw2221 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:35:56] PROBLEM - Check size of conntrack table on mw2220 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:35:57] RECOVERY - MegaRAID on analytics1032 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy [00:37:36] PROBLEM - Check the NTP synchronisation status of timesyncd on mw2219 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:37:36] PROBLEM - Check systemd state on mw2220 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:37:36] PROBLEM - MD RAID on mw2220 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:37:37] PROBLEM - Check size of conntrack table on mw2221 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:38:56] PROBLEM - HHVM rendering on mw2221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:39:26] PROBLEM - Nginx local proxy to apache on mw2219 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:41:06] PROBLEM - Nginx local proxy to apache on mw2220 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:41:16] PROBLEM - Apache HTTP on mw2221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:41:36] PROBLEM - Apache HTTP on mw2220 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:42:06] PROBLEM - Apache HTTP on mw2219 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:42:46] PROBLEM - Nginx local proxy to apache on mw2221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:45:46] PROBLEM - Check systemd state on mw2221 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:45:46] PROBLEM - MD RAID on mw2221 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:45:47] PROBLEM - nutcracker process on mw2221 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:45:47] PROBLEM - High CPU load on API appserver on mw2221 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:45:56] PROBLEM - configured eth on mw2219 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:45:57] PROBLEM - mediawiki-installation DSH group on mw2219 is CRITICAL: Host mw2219 is not in mediawiki-installation dsh group [00:45:57] PROBLEM - HHVM processes on mw2219 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:46:07] PROBLEM - Check systemd state on mw2220 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:46:07] PROBLEM - MD RAID on mw2220 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:46:07] RECOVERY - Check size of conntrack table on mw2221 is OK: OK: nf_conntrack is 0 % full [00:46:16] PROBLEM - High CPU load on API appserver on mw2220 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:46:16] PROBLEM - nutcracker process on mw2220 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:46:17] PROBLEM - nutcracker port on mw2221 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused [00:46:36] RECOVERY - Check size of conntrack table on mw2220 is OK: OK: nf_conntrack is 0 % full [00:46:36] PROBLEM - Check systemd state on mw2219 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:46:36] RECOVERY - MD RAID on mw2219 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [00:46:56] RECOVERY - MD RAID on mw2221 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [00:46:56] RECOVERY - Check size of conntrack table on mw2219 is OK: OK: nf_conntrack is 1 % full [00:46:56] RECOVERY - High CPU load on API appserver on mw2221 is OK: OK - load average: 4.89, 5.37, 3.57 [00:46:57] RECOVERY - configured eth on mw2219 is OK: OK - interfaces up [00:47:06] RECOVERY - HHVM processes on mw2219 is OK: PROCS OK: 6 processes with command name hhvm [00:47:16] RECOVERY - MD RAID on mw2220 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [00:47:17] RECOVERY - Apache HTTP on mw2219 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 3.908 second response time [00:47:26] RECOVERY - High CPU load on API appserver on mw2220 is OK: OK - load average: 4.68, 5.49, 3.64 [00:47:46] PROBLEM - mediawiki-installation DSH group on mw2220 is CRITICAL: Host mw2220 is not in mediawiki-installation dsh group [00:47:46] PROBLEM - nutcracker port on mw2219 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused [00:49:06] RECOVERY - Nginx local proxy to apache on mw2219 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 3.070 second response time [00:49:16] PROBLEM - HHVM rendering on mw2220 is CRITICAL: connect to address 10.192.0.45 and port 80: Connection refused [00:49:17] PROBLEM - mediawiki-installation DSH group on mw2221 is CRITICAL: Host mw2221 is not in mediawiki-installation dsh group [00:49:17] PROBLEM - nutcracker port on mw2220 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused [00:49:17] PROBLEM - nutcracker process on mw2219 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (nutcracker), command name nutcracker [00:50:57] RECOVERY - Check systemd state on mw2219 is OK: OK - unknown: The operational state could not be determined, due to lack of resources or another error cause. [00:53:27] RECOVERY - nutcracker process on mw2221 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker [00:53:47] RECOVERY - nutcracker port on mw2221 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [00:53:57] RECOVERY - Apache HTTP on mw2221 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.291 second response time [00:54:08] RECOVERY - nutcracker port on mw2219 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [00:54:18] RECOVERY - Check systemd state on mw2221 is OK: OK - running: The system is fully operational [00:54:27] RECOVERY - Nginx local proxy to apache on mw2221 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.247 second response time [00:54:38] RECOVERY - Check systemd state on mw2220 is OK: OK - running: The system is fully operational [00:54:47] RECOVERY - nutcracker process on mw2220 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker [00:55:18] RECOVERY - Apache HTTP on mw2220 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.144 second response time [00:55:47] RECOVERY - HHVM rendering on mw2221 is OK: HTTP OK: HTTP/1.1 200 OK - 74933 bytes in 2.153 second response time [00:56:48] RECOVERY - Nginx local proxy to apache on mw2220 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 0.366 second response time [00:56:58] RECOVERY - HHVM rendering on mw2220 is OK: HTTP OK: HTTP/1.1 200 OK - 74933 bytes in 3.597 second response time [00:57:47] RECOVERY - puppet last run on mw2219 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:59:31] RECOVERY - puppet last run on mw2220 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [01:00:00] PROBLEM - HP RAID on db2067 is CRITICAL: CRITICAL: Slot 0: Failed: 1I:1:10 - OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:11, 1I:1:12 - Controller: OK - Battery/Capacitor: OK [01:00:01] ACKNOWLEDGEMENT - HP RAID on db2067 is CRITICAL: CRITICAL: Slot 0: Failed: 1I:1:10 - OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:11, 1I:1:12 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T194103 [01:00:29] 10Operations, 10ops-codfw: Degraded RAID on db2067 - https://phabricator.wikimedia.org/T194103#4188843 (10ops-monitoring-bot) [01:00:31] RECOVERY - nutcracker port on mw2220 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [01:01:10] RECOVERY - puppet last run on mw2221 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [01:03:50] RECOVERY - nutcracker process on mw2219 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker [01:06:01] PROBLEM - MegaRAID on analytics1032 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough [01:06:40] PROBLEM - Device not healthy -SMART- on db2067 is CRITICAL: cluster=mysql device=cciss,11 instance=db2067:9100 job=node site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2067&var-datasource=codfw%2520prometheus%252Fops [01:07:40] RECOVERY - Check the NTP synchronisation status of timesyncd on mw2219 is OK: OK: synced at Tue 2018-05-08 01:07:33 UTC. [01:16:10] RECOVERY - MegaRAID on analytics1032 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy [01:46:21] PROBLEM - MegaRAID on analytics1032 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough [02:33:09] !log l10nupdate@tin scap sync-l10n completed (1.32.0-wmf.2) (duration: 05m 45s) [02:33:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:46:51] RECOVERY - MegaRAID on analytics1032 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy [03:28:00] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 700.40 seconds [03:47:20] PROBLEM - MegaRAID on analytics1032 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough [04:15:00] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 200.10 seconds [04:24:42] !log dzahn@neodymium conftool action : set/pooled=yes; selector: name=mw2219.codfw.wmnet [04:24:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:26:20] !log dzahn@neodymium conftool action : set/pooled=yes; selector: name=mw2220.codfw.wmnet [04:26:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:27:53] !log dzahn@neodymium conftool action : set/pooled=yes; selector: name=mw2221.codfw.wmnet [04:27:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:30:20] RECOVERY - mediawiki-installation DSH group on mw2219 is OK: OK [04:30:40] RECOVERY - mediawiki-installation DSH group on mw2220 is OK: OK [04:31:00] RECOVERY - mediawiki-installation DSH group on mw2221 is OK: OK [05:12:30] RECOVERY - Maps - OSM synchronization lag - eqiad on einsteinium is OK: (C)1.728e+05 ge (W)9e+04 ge 1.874e+04 https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=11&fullscreen&orgId=1 [05:14:21] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1097:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431698 [05:16:00] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1097:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431698 (owner: 10Marostegui) [05:16:43] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2067 - https://phabricator.wikimedia.org/T194103#4189150 (10Marostegui) a:03Papaul @Papaul can we get a new disk for this one? Thanks! [05:17:15] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1097:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431698 (owner: 10Marostegui) [05:17:19] (03PS2) 10Marostegui: Revert "wiki replicas: Depool labsdb1011 for MCR table changes" [puppet] - 10https://gerrit.wikimedia.org/r/431672 (owner: 10Bstorm) [05:18:35] ACKNOWLEDGEMENT - Device not healthy -SMART- on db2067 is CRITICAL: cluster=mysql device=cciss,11 instance=db2067:9100 job=node site=codfw Marostegui https://phabricator.wikimedia.org/T194103 https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2067&var-datasource=codfw%2520prometheus%252Fops [05:18:41] (03CR) 10Marostegui: [C: 032] Revert "wiki replicas: Depool labsdb1011 for MCR table changes" [puppet] - 10https://gerrit.wikimedia.org/r/431672 (owner: 10Bstorm) [05:18:59] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1097:3314 after alter table (duration: 01m 00s) [05:19:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:19:30] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1097:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431698 (owner: 10Marostegui) [05:19:33] !log Reload haproxy on dbproxy1010 to repool labsdb1011 - https://phabricator.wikimedia.org/T174047 [05:19:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:22:11] (03PS1) 10Marostegui: db-eqiad.php: Depool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431701 (https://phabricator.wikimedia.org/T190148) [05:23:39] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431701 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [05:24:56] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431701 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [05:25:35] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431701 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [05:26:10] (03PS2) 10Marostegui: mariadb: db1069 is now x1 master [puppet] - 10https://gerrit.wikimedia.org/r/431568 (https://phabricator.wikimedia.org/T186320) [05:26:12] (03PS2) 10Marostegui: db-eqiad.php: Promote db1069 to be x1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431566 (https://phabricator.wikimedia.org/T186320) [05:26:18] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1121 for alter table (duration: 01m 00s) [05:26:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:26:22] !log Deploy schema change on db1121 with replication (this will generate lag on labs on s4) - T191519 T188299 T190148 [05:26:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:26:27] T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519 [05:26:28] T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148 [05:26:28] T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299 [05:28:22] !log Disable gtid on db1069 an db2034 before x1 failover - T186320 [05:28:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:28:26] T186320: Decommission db1051-db1060 (DBA tracking) - https://phabricator.wikimedia.org/T186320 [05:29:48] !log Disable puppet on db1055 and db1069 before x1 failover - T186320 [05:29:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:36:35] !log Move dbstore1002:x1 under db1069 for x1 failover - T186320 [05:36:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:36:39] T186320: Decommission db1051-db1060 (DBA tracking) - https://phabricator.wikimedia.org/T186320 [05:41:22] !log Move db2034 under db1069 for x1 failover - T186320 [05:41:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:42:30] (03CR) 10Marostegui: [C: 032] mariadb: db1069 is now x1 master [puppet] - 10https://gerrit.wikimedia.org/r/431568 (https://phabricator.wikimedia.org/T186320) (owner: 10Marostegui) [05:58:07] We are starting the x1 failover in 2 minutes [05:59:08] Going to merge: https://gerrit.wikimedia.org/r/#/c/431566/ without deploying [05:59:37] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Promote db1069 to be x1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431566 (https://phabricator.wikimedia.org/T186320) (owner: 10Marostegui) [06:00:04] marostegui and jynus: Your horoscope predicts another unfortunate x1 master switchover deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180508T0600). [06:00:05] jynus: ready? [06:00:10] hahaha [06:00:12] yes [06:00:16] let's go then [06:00:24] !log Start x1 failover [06:00:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:29] !log Set db1055 ready only [06:00:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:45] done [06:00:54] (03Merged) 10jenkins-bot: db-eqiad.php: Promote db1069 to be x1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431566 (https://phabricator.wikimedia.org/T186320) (owner: 10Marostegui) [06:01:09] (03CR) 10jenkins-bot: db-eqiad.php: Promote db1069 to be x1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431566 (https://phabricator.wikimedia.org/T186320) (owner: 10Marostegui) [06:02:11] db1069-bin.000008:928243708 [06:02:21] yep! [06:03:23] running puppet [06:03:57] all looking good [06:03:59] I see db1055 advancing its master log [06:04:01] going to deploy mediawiki [06:04:23] good from your side? [06:04:31] sure [06:04:35] deploying [06:05:38] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Promote db1069 as new x1 master (duration: 01m 00s) [06:05:39] going to to disable read_only on db1069 [06:05:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:52] !log Read_only=off on db1069 to finish with the x1 failover [06:05:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:56] done [06:06:16] I'm going to update tendril to confirm [06:06:20] good! [06:07:18] I can see writes coming to db1069 [06:07:58] replication errors stopped [06:08:10] fatals also [06:08:43] errors from 6:01:30 to 6:05:30 [06:10:05] matches the read only times yep [06:10:17] (03CR) 10Marostegui: [C: 032] x1.hosts: db1069 is the new x1 master [software] - 10https://gerrit.wikimedia.org/r/431567 (https://phabricator.wikimedia.org/T186320) (owner: 10Marostegui) [06:11:05] (03Merged) 10jenkins-bot: x1.hosts: db1069 is the new x1 master [software] - 10https://gerrit.wikimedia.org/r/431567 (https://phabricator.wikimedia.org/T186320) (owner: 10Marostegui) [06:21:11] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Remove db1060 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431703 (https://phabricator.wikimedia.org/T193732) [06:22:50] (03CR) 10Marostegui: [C: 032] db-eqiad,db-codfw.php: Remove db1060 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431703 (https://phabricator.wikimedia.org/T193732) (owner: 10Marostegui) [06:24:05] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db1060 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431703 (https://phabricator.wikimedia.org/T193732) (owner: 10Marostegui) [06:25:30] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Remove db1060 from config - T193732 (duration: 00m 59s) [06:25:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:34] T193732: Decommission db1060 - https://phabricator.wikimedia.org/T193732 [06:26:37] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Remove db1060 from config - T193732 (duration: 01m 01s) [06:26:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:39] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db1060 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431703 (https://phabricator.wikimedia.org/T193732) (owner: 10Marostegui) [06:29:51] PROBLEM - puppet last run on labstore1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/apparmor.d/abstractions/ssl_certs] [06:31:02] (03PS1) 10Marostegui: mariadb: Set db1060 as spare [puppet] - 10https://gerrit.wikimedia.org/r/431704 (https://phabricator.wikimedia.org/T193732) [06:31:40] (03PS2) 10Marostegui: mariadb: Set db1060 as spare [puppet] - 10https://gerrit.wikimedia.org/r/431704 (https://phabricator.wikimedia.org/T193732) [06:39:29] (03CR) 10Marostegui: [C: 032] mariadb: Set db1060 as spare [puppet] - 10https://gerrit.wikimedia.org/r/431704 (https://phabricator.wikimedia.org/T193732) (owner: 10Marostegui) [06:47:55] (03PS3) 10Jcrespo: tendril: Move cron jobs to dbmonitor, remove proxysql from terbium [puppet] - 10https://gerrit.wikimedia.org/r/431529 (https://phabricator.wikimedia.org/T193919) [06:50:04] !log reimaging mw1313, mw1343, mw1344 to stretch [06:50:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:16] !log Stop MySQL on db1060 as it will be decommissioned - T193732 [06:51:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:20] T193732: Decommission db1060 - https://phabricator.wikimedia.org/T193732 [06:51:54] (03CR) 10Jcrespo: [C: 032] tendril: Move cron jobs to dbmonitor, remove proxysql from terbium [puppet] - 10https://gerrit.wikimedia.org/r/431529 (https://phabricator.wikimedia.org/T193919) (owner: 10Jcrespo) [06:53:38] (03PS1) 10Marostegui: s2.hosts: Remove db1060 [software] - 10https://gerrit.wikimedia.org/r/431705 (https://phabricator.wikimedia.org/T193732) [06:55:11] (03CR) 10Marostegui: [C: 032] s2.hosts: Remove db1060 [software] - 10https://gerrit.wikimedia.org/r/431705 (https://phabricator.wikimedia.org/T193732) (owner: 10Marostegui) [06:55:56] (03Merged) 10jenkins-bot: s2.hosts: Remove db1060 [software] - 10https://gerrit.wikimedia.org/r/431705 (https://phabricator.wikimedia.org/T193732) (owner: 10Marostegui) [06:59:00] (03CR) 10Nikerabbit: [C: 031] cawiki: remove gendered namespace aliases, already on MW core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429989 (https://phabricator.wikimedia.org/T113616) (owner: 10MarcoAurelio) [07:00:08] RECOVERY - puppet last run on labstore1003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:02:08] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received: /en.wikipedia.org/v1/page/metadata/{title}{/revision} (Get extended metadata of a test page) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for Apri [07:02:08] ut before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received [07:02:28] PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received [07:03:08] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy [07:03:29] RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy [07:05:41] 10Operations, 10ops-eqiad, 10DBA, 10decommission, 10Patch-For-Review: Decommission db1060 - https://phabricator.wikimedia.org/T193732#4189275 (10Marostegui) a:05Marostegui>03RobH This is ready for @RobH and DC-Ops to take over [07:06:14] PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received [07:06:27] (03PS1) 10Muehlenhoff: Move scap proxy in A3 to mw2216 [puppet] - 10https://gerrit.wikimedia.org/r/431706 [07:06:43] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [07:06:43] PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [07:07:21] 10Operations, 10HHVM, 10Patch-For-Review, 10User-Elukey: Upgrade mw* servers to Debian Stretch (using HHVM) - https://phabricator.wikimedia.org/T174431#4189280 (10MoritzMuehlenhoff) All application servers are now running stretch (excluding job runners and API servers). [07:07:33] RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy [07:07:34] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy [07:07:45] 10Operations, 10HHVM, 10Patch-For-Review, 10User-Elukey: Upgrade mw* servers to Debian Stretch (using HHVM) - https://phabricator.wikimedia.org/T174431#4189281 (10MoritzMuehlenhoff) [07:08:13] RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy [07:08:44] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4189282 (10Marostegui) What if we temporarily convert db2092 (s1) to codfw sanitarium, copy db1116's data to db2092. Once the n... [07:09:13] PROBLEM - eventstreams on scb1002 is CRITICAL: connect to address 10.64.16.21 and port 8092: Connection refused [07:11:06] (03CR) 10Muehlenhoff: [C: 032] Move scap proxy in A3 to mw2216 [puppet] - 10https://gerrit.wikimedia.org/r/431706 (owner: 10Muehlenhoff) [07:11:43] (03PS1) 10Jcrespo: tendril: Explicit perl package dependencies on maintenance [puppet] - 10https://gerrit.wikimedia.org/r/431707 (https://phabricator.wikimedia.org/T184797) [07:12:08] (03PS2) 10Jcrespo: tendril: Explicit perl package dependencies on maintenance [puppet] - 10https://gerrit.wikimedia.org/r/431707 (https://phabricator.wikimedia.org/T184797) [07:12:13] RECOVERY - eventstreams on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 1066 bytes in 0.023 second response time [07:13:32] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4189285 (10jcrespo) But one host will not be enough, we need 2. [07:14:30] (03CR) 10Jcrespo: [C: 032] tendril: Explicit perl package dependencies on maintenance [puppet] - 10https://gerrit.wikimedia.org/r/431707 (https://phabricator.wikimedia.org/T184797) (owner: 10Jcrespo) [07:15:29] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4189287 (10Marostegui) >>! In T190704#4189285, @jcrespo wrote: > But one host will not be enough, we need 2. Yes, but for that... [07:30:57] !log cleaning up maintenance hosts (terbium, etc.) from tendril maintenance files [07:31:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:21] (03PS1) 10Marostegui: db-codfw.php: Depool db2092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431708 (https://phabricator.wikimedia.org/T190704) [07:41:17] 10Operations, 10Dumps-Generation, 10Patch-For-Review: data retrieval/write issues via NFS on dumpsdata1001, impacting some dump jobs - https://phabricator.wikimedia.org/T191177#4189336 (10ArielGlenn) 05Open>03Resolved This month's run looks good, no nulls in stub files, no other weirdness either so I'm g... [07:41:35] (03PS1) 10Marostegui: mariadb: Convert db2092 to sanitarium multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/431709 (https://phabricator.wikimedia.org/T190704) [07:41:48] 10Operations, 10Patch-For-Review: setup replacement for terbium (maintenance_server) on stretch - https://phabricator.wikimedia.org/T192092#4189342 (10jcrespo) [07:41:50] 10Operations, 10Patch-For-Review: provide proxysql for stretch, add package to puppet - https://phabricator.wikimedia.org/T193919#4189339 (10jcrespo) 05Open>03Resolved a:03jcrespo * proxysql and tendril maintenance have been removed from mediawiki maintenance * proxysql for stretch package has been uploa... [07:44:27] (03PS8) 10Elukey: role::aqs: deprecate cassandra-metrics-collector [puppet] - 10https://gerrit.wikimedia.org/r/431546 (https://phabricator.wikimedia.org/T186567) [07:44:36] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/compiler02/11155/" [puppet] - 10https://gerrit.wikimedia.org/r/431709 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [07:45:13] (03PS2) 10Marostegui: mariadb: Convert db2092 to sanitarium multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/431709 (https://phabricator.wikimedia.org/T190704) [07:45:26] (03CR) 10Elukey: [C: 032] role::aqs: deprecate cassandra-metrics-collector [puppet] - 10https://gerrit.wikimedia.org/r/431546 (https://phabricator.wikimedia.org/T186567) (owner: 10Elukey) [07:46:33] (03PS3) 10Marostegui: mariadb: Convert db2092 to sanitarium multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/431709 (https://phabricator.wikimedia.org/T190704) [07:47:00] (03PS4) 10Marostegui: mariadb: Convert db2092 to sanitarium multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/431709 (https://phabricator.wikimedia.org/T190704) [07:48:03] (03CR) 10Marostegui: [C: 032] mariadb: Convert db2092 to sanitarium multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/431709 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [07:48:40] PROBLEM - Check systemd state on aqs1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:49:44] this is me --^ [07:49:49] RECOVERY - Check systemd state on aqs1004 is OK: OK - running: The system is fully operational [07:49:52] I am removing cassandra metrics collector [07:51:31] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4081506 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` db209... [07:52:10] PROBLEM - Check systemd state on aqs1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:52:21] elukey: you'll also need to remove the wmf_auto_restart for the cassandra-metrics-collector [07:52:44] moritzm: yep yep [07:53:19] RECOVERY - Check systemd state on aqs1008 is OK: OK - running: The system is fully operational [07:53:25] !log second attempt to remove the cassandra-metrics-collector (+ cleanup) from aqs* [07:53:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:25] moritzm: ah I didn't see the ensure for that class! Amending puppet now [07:55:44] ack! [07:56:19] (03PS1) 10Elukey: cassandra::metrics: propagate ensure parameter to wmf-auto-restart [puppet] - 10https://gerrit.wikimedia.org/r/431710 (https://phabricator.wikimedia.org/T186567) [07:56:31] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431708 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [07:56:44] (03PS1) 10Marostegui: sX.hosts: db2092 is now multiinstance [software] - 10https://gerrit.wikimedia.org/r/431711 (https://phabricator.wikimedia.org/T190704) [07:56:54] (03CR) 10Elukey: [C: 032] cassandra::metrics: propagate ensure parameter to wmf-auto-restart [puppet] - 10https://gerrit.wikimedia.org/r/431710 (https://phabricator.wikimedia.org/T186567) (owner: 10Elukey) [07:57:42] (03CR) 10Marostegui: [C: 032] sX.hosts: db2092 is now multiinstance [software] - 10https://gerrit.wikimedia.org/r/431711 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [07:57:51] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431708 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [07:58:28] (03Merged) 10jenkins-bot: sX.hosts: db2092 is now multiinstance [software] - 10https://gerrit.wikimedia.org/r/431711 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [07:59:04] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2092 T190704 (duration: 00m 57s) [07:59:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:08] T190704: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704 [07:59:39] (03CR) 10jenkins-bot: db-codfw.php: Depool db2092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431708 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [08:03:24] !log Stop MySQL on db1116 to transfer its content to db2092 - T190704 [08:03:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:07] 10Operations, 10Puppet, 10DBA, 10Patch-For-Review: Move mariadb_maintenance away from terbium/wasat (mediawiki_maintenance) - https://phabricator.wikimedia.org/T184797#4189395 (10jcrespo) 05Open>03Resolved a:03jcrespo Done, no maintenance code yet for database maintenance, but that is still on terbiu... [08:12:12] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4189416 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db2092.codfw.wmnet'] ``` and were **ALL** successful. [08:14:22] 10Operations, 10MediaWiki-Parser, 10MediaWiki-Platform-Team, 10Parsing-Team, and 2 others: Different production servers have different versions of tidy installed, resulting in varying output - https://phabricator.wikimedia.org/T193414#4168635 (10MoritzMuehlenhoff) I've created some test packages at https:/... [08:25:14] (03PS1) 10Vgutierrez: mtail: Fix varnishrls regex [WIP] [puppet] - 10https://gerrit.wikimedia.org/r/431712 (https://phabricator.wikimedia.org/T184942) [08:25:20] (03PS2) 10Jcrespo: proxysql: require proxysql package installation for module proxysql [puppet] - 10https://gerrit.wikimedia.org/r/431584 (https://phabricator.wikimedia.org/T193919) [08:25:42] (03CR) 10jerkins-bot: [V: 04-1] mtail: Fix varnishrls regex [WIP] [puppet] - 10https://gerrit.wikimedia.org/r/431712 (https://phabricator.wikimedia.org/T184942) (owner: 10Vgutierrez) [08:26:52] (03PS8) 10Jcrespo: Install parallel gzip (pigz) and parallel xz (pxz) on all servers [puppet] - 10https://gerrit.wikimedia.org/r/419709 [08:27:11] !log reimaging mw1308, mw1309 (job runners) to stretch [08:27:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:32] (03CR) 10Jcrespo: [C: 032] Install parallel gzip (pigz) and parallel xz (pxz) on all servers [puppet] - 10https://gerrit.wikimedia.org/r/419709 (owner: 10Jcrespo) [08:28:45] (03PS2) 10Vgutierrez: mtail: Fix varnishrls regex [WIP] [puppet] - 10https://gerrit.wikimedia.org/r/431712 (https://phabricator.wikimedia.org/T184942) [08:29:12] (03CR) 10jerkins-bot: [V: 04-1] mtail: Fix varnishrls regex [WIP] [puppet] - 10https://gerrit.wikimedia.org/r/431712 (https://phabricator.wikimedia.org/T184942) (owner: 10Vgutierrez) [08:30:13] !log reimaging mw2156, mw2157, mw2158 (job runners) to stretch [08:30:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:19] (03PS3) 10Vgutierrez: mtail: Fix varnishrls regex [WIP] [puppet] - 10https://gerrit.wikimedia.org/r/431712 (https://phabricator.wikimedia.org/T184942) [08:43:22] (03PS4) 10Vgutierrez: mtail: Fix varnishrls regex [puppet] - 10https://gerrit.wikimedia.org/r/431712 (https://phabricator.wikimedia.org/T184942) [08:45:10] akosiaris: any issue with scb1002? [08:49:25] RECOVERY - MegaRAID on analytics1032 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy [08:53:27] (03PS1) 10Marostegui: mariadb: Enable innodb_strict_mode on the last two roles [puppet] - 10https://gerrit.wikimedia.org/r/431715 (https://phabricator.wikimedia.org/T150949) [08:55:22] (03CR) 10Marostegui: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/11156/" [puppet] - 10https://gerrit.wikimedia.org/r/431715 (https://phabricator.wikimedia.org/T150949) (owner: 10Marostegui) [08:56:40] !log reimaging mw1345, mw1346 (API servers) to stretch [08:56:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:24] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942#4189645 (10Vgutierrez) >>! In T184942#4187502, @Krinkle wrote: > @Vgutierrez @ema I'm working on using the Prometheus metrics for the ResourceLoader dash... [09:04:51] (03PS3) 10MarcoAurelio: cawiki: remove gendered namespace aliases, already on MW core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429989 (https://phabricator.wikimedia.org/T113616) [09:08:24] (03PS1) 10Jcrespo: admin: Adjustments to jynus' defaults and aliases [puppet] - 10https://gerrit.wikimedia.org/r/431716 [09:10:23] (03CR) 10Jcrespo: [C: 032] admin: Adjustments to jynus' defaults and aliases [puppet] - 10https://gerrit.wikimedia.org/r/431716 (owner: 10Jcrespo) [09:12:12] 10Operations, 10Traffic: Consider adding expect-CT: header to enforce certificate transparency - https://phabricator.wikimedia.org/T193521#4189695 (10Vgutierrez) p:05Triage>03Normal [09:17:39] !log reducing replication factor on cassandra v3 (unused) keyspace for maps [09:17:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:35] PROBLEM - MegaRAID on analytics1032 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough [09:20:22] !log forced a BBU re-learn cycle on analytics1032 [09:20:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:45] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 33821672 [09:21:32] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1121" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431717 [09:21:33] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1121" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431717 [09:21:35] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 49390072 [09:21:45] RECOVERY - Postgres Replication Lag on maps2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 794840 [09:23:17] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1121" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431717 (owner: 10Marostegui) [09:23:55] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 45290216 [09:24:28] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1121" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431717 (owner: 10Marostegui) [09:24:46] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 21906160 [09:25:05] PROBLEM - Disk space on maps2004 is CRITICAL: DISK CRITICAL - free space: /srv 53039 MB (3% inode=99%) [09:25:44] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1121 after alter table (duration: 01m 00s) [09:25:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:58] (03Abandoned) 10Jcrespo: [WIP] Move all misc db scripts to db_maintenance module [puppet] - 10https://gerrit.wikimedia.org/r/295654 (owner: 10Jcrespo) [09:26:06] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 47955288 [09:27:15] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 122681864 [09:28:06] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 154968 [09:29:51] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1121" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431717 (owner: 10Marostegui) [09:32:17] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [09:33:16] I am guessing that these are the last reimages --^ [09:33:40] PROBLEM - HHVM jobrunner on mw2157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:33:58] mw1309/08 are the offenders [09:34:20] so yeah good :) [09:34:31] PROBLEM - mediawiki-installation DSH group on mw2156 is CRITICAL: Host mw2156 is not in mediawiki-installation dsh group [09:34:40] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 86122968 [09:35:40] RECOVERY - Postgres Replication Lag on maps2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 [09:36:20] PROBLEM - Nginx local proxy to apache on mw2158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:37:05] (03PS1) 10Jcrespo: proxysql: Changes added (simplifications) to the proxysql class [puppet] - 10https://gerrit.wikimedia.org/r/431720 (https://phabricator.wikimedia.org/T171071) [09:37:24] (03Abandoned) 10Jcrespo: proxysql: Changes added (simplifications) to the proxysql class [puppet] - 10https://gerrit.wikimedia.org/r/404154 (https://phabricator.wikimedia.org/T171071) (owner: 10Jcrespo) [09:38:21] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [09:38:31] PROBLEM - HHVM jobrunner on mw2156 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:39:42] (03CR) 10Jcrespo: [C: 032] proxysql: Changes added (simplifications) to the proxysql class [puppet] - 10https://gerrit.wikimedia.org/r/431720 (https://phabricator.wikimedia.org/T171071) (owner: 10Jcrespo) [09:40:30] yeah, silencing [09:41:10] PROBLEM - Nginx local proxy to apache on mw2157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:41:56] (03Abandoned) 10Jcrespo: mw-maintenance: move mariadb maintenance to tendril [puppet] - 10https://gerrit.wikimedia.org/r/403978 (https://phabricator.wikimedia.org/T184797) (owner: 10Dzahn) [09:46:50] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 29262672 [09:47:00] RECOVERY - Postgres Replication Lag on maps2004 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 181536 [09:47:51] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 455056 [09:47:57] (03CR) 10Ema: [C: 031] mtail: Fix varnishrls regex [puppet] - 10https://gerrit.wikimedia.org/r/431712 (https://phabricator.wikimedia.org/T184942) (owner: 10Vgutierrez) [09:48:22] (03PS1) 10Jcrespo: mariadb: Depool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431722 (https://phabricator.wikimedia.org/T194118) [09:48:55] (03CR) 10Vgutierrez: [C: 032] mtail: Fix varnishrls regex [puppet] - 10https://gerrit.wikimedia.org/r/431712 (https://phabricator.wikimedia.org/T184942) (owner: 10Vgutierrez) [09:49:02] (03PS5) 10Vgutierrez: mtail: Fix varnishrls regex [puppet] - 10https://gerrit.wikimedia.org/r/431712 (https://phabricator.wikimedia.org/T184942) [09:51:20] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 37410272 [09:51:41] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431722 (https://phabricator.wikimedia.org/T194118) (owner: 10Jcrespo) [09:52:08] (03CR) 10Ema: prometheus: varnish_thumbnails aggregation rule (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/431528 (https://phabricator.wikimedia.org/T184942) (owner: 10Ema) [09:52:55] (03Merged) 10jenkins-bot: mariadb: Depool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431722 (https://phabricator.wikimedia.org/T194118) (owner: 10Jcrespo) [09:53:10] (03CR) 10jenkins-bot: mariadb: Depool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431722 (https://phabricator.wikimedia.org/T194118) (owner: 10Jcrespo) [09:55:40] RECOVERY - Disk space on maps2004 is OK: DISK OK [09:58:38] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1055 (duration: 00m 54s) [09:58:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:30] RECOVERY - Postgres Replication Lag on maps2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 906600 [10:09:40] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [10:11:50] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [10:15:19] !log reimaging mw1310, mw1311 (job runners) to stretch [10:15:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:38] (03PS1) 10Jcrespo: mariadb: Move db1064 from s4 to x1 [puppet] - 10https://gerrit.wikimedia.org/r/431724 (https://phabricator.wikimedia.org/T194118) [10:21:24] (03CR) 10Jcrespo: [C: 032] mariadb: Move db1064 from s4 to x1 [puppet] - 10https://gerrit.wikimedia.org/r/431724 (https://phabricator.wikimedia.org/T194118) (owner: 10Jcrespo) [10:22:53] !log stop mariadb on db1055 to clone it to db1064 [10:22:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:10] !log reimaging mw1347, mw1348 (API servers) to stretch (last two remaining API servers in eqiad) [10:29:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:57] 10Operations, 10SRE-Access-Requests: Access to Google Search Console for Go Fish Digital - https://phabricator.wikimedia.org/T192893#4189871 (10Deskana) >>! In T192893#4185804, @faidon wrote: > I'm not sure if this needs my approval, but if it does, it has it, as long as: > - The console data contain PII, so a... [10:38:04] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 37769144 [10:40:48] (03PS1) 10Jcrespo: mariadb: Really depool db2092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431727 [10:46:13] (03CR) 10Ladsgroup: BETA ONLY - WikibaseLexeme config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431306 (https://phabricator.wikimedia.org/T184745) (owner: 10Addshore) [10:46:34] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 33544504 [10:47:43] RECOVERY - Postgres Replication Lag on maps2004 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 237792 [10:47:44] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 89165568 [10:49:54] RECOVERY - Postgres Replication Lag on maps2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 249104 [10:52:04] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 17801672 [10:57:43] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 18855632 [10:58:02] (03CR) 10Sbisson: [C: 031] "oups, my bad" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431609 (owner: 10Catrope) [10:58:46] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 214528 [11:00:56] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 40704072 [11:01:36] PROBLEM - HHVM jobrunner on mw1310 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:02:56] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 22352296 [11:03:47] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 41370656 [11:04:16] PROBLEM - Nginx local proxy to apache on mw1311 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:04:47] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 376 [11:04:56] RECOVERY - Postgres Replication Lag on maps2004 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 [11:04:57] RECOVERY - Postgres Replication Lag on maps2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 [11:09:07] PROBLEM - Nginx local proxy to apache on mw1310 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:10:46] PROBLEM - mediawiki-installation DSH group on mw1311 is CRITICAL: Host mw1311 is not in mediawiki-installation dsh group [11:10:55] ^ silenced [11:13:56] RECOVERY - HHVM jobrunner on mw1310 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [11:14:45] (03CR) 10Marostegui: [C: 032] "Thanks for catching thins" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431727 (owner: 10Jcrespo) [11:15:58] (03Merged) 10jenkins-bot: mariadb: Really depool db2092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431727 (owner: 10Jcrespo) [11:16:26] RECOVERY - Nginx local proxy to apache on mw1310 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.120 second response time [11:16:37] RECOVERY - Nginx local proxy to apache on mw1311 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.010 second response time [11:18:05] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Really depool db2092 (duration: 00m 53s) [11:18:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:19] * addshore can't access phab.wm.o..... stupid wifi... [11:18:47] RECOVERY - HHVM jobrunner on mw2157 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.085 second response time [11:18:49] https://www.irccloud.com/pastebin/mrcV05il/ [11:19:26] (03CR) 10jenkins-bot: mariadb: Really depool db2092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431727 (owner: 10Jcrespo) [11:20:27] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [11:20:47] RECOVERY - Nginx local proxy to apache on mw2157 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.152 second response time [11:22:27] RECOVERY - Nginx local proxy to apache on mw2158 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.165 second response time [11:23:48] RECOVERY - HHVM jobrunner on mw2156 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.074 second response time [11:24:58] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 18159728 [11:25:58] RECOVERY - Postgres Replication Lag on maps2004 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 396568 [11:27:30] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [11:27:41] 10Operations, 10SRE-Access-Requests: Access to Google Search Console for Go Fish Digital - https://phabricator.wikimedia.org/T192893#4190022 (10faidon) Thanks @Deskana :) I think that all seems sufficient and we should just go ahead with this. 2018-08-01 sounds reasonable, and we can always extend this if ther... [11:28:30] Anyone any idea why I would be getting "You don't have permission to access / on this server." for phabricator.wikimedia.org ? :/ [11:28:50] <_joe_> addshore: specific url please [11:28:58] https://phabricator.wikimedia.org/ [11:29:08] ip banned? [11:29:11] I'm thinking its something to do with the wifi I'm on, [11:29:23] ooh, is there a way to check that? [11:29:41] <_joe_> yes I think that's the more probable cause [11:30:11] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 30717768 [11:30:11] RECOVERY - MegaRAID on analytics1032 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy [11:30:31] it seems to be hitting the wmf server afaict [11:30:39] jouncebot: next [11:30:39] In 1 hour(s) and 29 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180508T1300) [11:31:00] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 22205400 [11:32:00] (03PS1) 10Marostegui: db-eqiad.php: Depool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431733 (https://phabricator.wikimedia.org/T190148) [11:32:11] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 171733376 [11:32:13] addshore: can you tell me in pvt your external IP address? [11:33:00] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 [11:33:11] RECOVERY - Postgres Replication Lag on maps2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 [11:33:11] elukey: yes [11:33:24] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431733 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [11:34:39] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431733 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [11:36:00] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1103:3314 for alter table (duration: 00m 59s) [11:36:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:13] !log Deploy schema change on db1103:3314 - T191519 T188299 T190148 [11:36:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:19] T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519 [11:36:19] T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148 [11:36:19] T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299 [11:39:27] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431733 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [11:45:21] (03PS1) 10Jcrespo: mariadb: Pool db1064 into x1 with low load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431734 (https://phabricator.wikimedia.org/T194118) [11:46:47] (03PS1) 10Jcrespo: mariadb: Remove references to db1055, to be decom [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431735 (https://phabricator.wikimedia.org/T194118) [11:58:02] (03CR) 10Jcrespo: [C: 032] mariadb: Pool db1064 into x1 with low load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431734 (https://phabricator.wikimedia.org/T194118) (owner: 10Jcrespo) [11:59:18] (03Merged) 10jenkins-bot: mariadb: Pool db1064 into x1 with low load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431734 (https://phabricator.wikimedia.org/T194118) (owner: 10Jcrespo) [11:59:34] (03PS1) 10Giuseppe Lavagetto: mcrouter: add support for listening on the ssl port [puppet] - 10https://gerrit.wikimedia.org/r/431736 (https://phabricator.wikimedia.org/T192370) [11:59:36] (03CR) 10jenkins-bot: mariadb: Pool db1064 into x1 with low load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431734 (https://phabricator.wikimedia.org/T194118) (owner: 10Jcrespo) [11:59:38] (03PS1) 10Giuseppe Lavagetto: profile::mediawiki::mcrouter_wancache: add ssl, proxy support [puppet] - 10https://gerrit.wikimedia.org/r/431737 (https://phabricator.wikimedia.org/T192370) [11:59:40] (03PS1) 10Giuseppe Lavagetto: puppet_ecdsacert: allow IP-based SANs [puppet] - 10https://gerrit.wikimedia.org/r/431738 (https://phabricator.wikimedia.org/T192370) [12:00:17] PROBLEM - MegaRAID on analytics1032 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough [12:00:22] (03CR) 10jerkins-bot: [V: 04-1] profile::mediawiki::mcrouter_wancache: add ssl, proxy support [puppet] - 10https://gerrit.wikimedia.org/r/431737 (https://phabricator.wikimedia.org/T192370) (owner: 10Giuseppe Lavagetto) [12:02:11] (03PS3) 10Thiemo Kreuz (WMDE): Update Wikidata property blacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430045 (owner: 10Matěj Suchánek) [12:02:14] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Pool db1064 with low load (duration: 00m 59s) [12:02:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:30] (03CR) 10Thiemo Kreuz (WMDE): [C: 031] "Both additions make sense and fit well with the other properties listed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430045 (owner: 10Matěj Suchánek) [12:05:47] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 36843672 [12:06:47] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 25792272 [12:07:47] RECOVERY - Postgres Replication Lag on maps2004 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1816 [12:07:47] RECOVERY - Postgres Replication Lag on maps2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 35048 [12:10:17] RECOVERY - MegaRAID on analytics1032 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy [12:12:42] (03PS2) 10Jcrespo: mariadb: Remove references to db1055, to be decom [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431735 (https://phabricator.wikimedia.org/T194118) [12:12:44] (03PS1) 10Jcrespo: mariab: Fully pool db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431740 (https://phabricator.wikimedia.org/T194118) [12:13:33] (03CR) 10Jcrespo: [C: 032] mariadb: Remove references to db1055, to be decom [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431735 (https://phabricator.wikimedia.org/T194118) (owner: 10Jcrespo) [12:14:46] (03Merged) 10jenkins-bot: mariadb: Remove references to db1055, to be decom [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431735 (https://phabricator.wikimedia.org/T194118) (owner: 10Jcrespo) [12:16:51] !log upgrading app servers in beta to [12:16:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:16] !log upgrading app servers in beta to wikidiff 1.6.0 (T190717) [12:17:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:20] T190717: Update wikidiff2 library on the WMF production cluster - https://phabricator.wikimedia.org/T190717 [12:18:38] !log jynus@tin Synchronized wmf-config/db-codfw.php: Remove db1055 (duration: 00m 59s) [12:18:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:12] !log reimaging mw2159, mw2160, mw2161 (job runners) to stretch [12:19:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:49] (03CR) 10jenkins-bot: mariadb: Remove references to db1055, to be decom [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431735 (https://phabricator.wikimedia.org/T194118) (owner: 10Jcrespo) [12:20:52] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Remove db1055 (duration: 00m 59s) [12:20:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:31] (03PS1) 10Jcrespo: mariadb: Upgrade proxysql package [software] - 10https://gerrit.wikimedia.org/r/431742 (https://phabricator.wikimedia.org/T175672) [12:27:27] (03CR) 10Jcrespo: [C: 032] mariadb: Upgrade proxysql package [software] - 10https://gerrit.wikimedia.org/r/431742 (https://phabricator.wikimedia.org/T175672) (owner: 10Jcrespo) [12:28:38] Hi! Any logins of user TaxonBot (dewiki) return login {result Failed reason {You have made too many recent login attempts. Please wait 2 days before trying again.}} [12:28:42] chasemp: [12:28:48] chasemp: ^^ [12:29:43] (03PS4) 10Filippo Giunchedi: base: alert on edac (un)correctable errors [puppet] - 10https://gerrit.wikimedia.org/r/422110 (https://phabricator.wikimedia.org/T183177) [12:33:34] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 32921368 [12:35:34] RECOVERY - Postgres Replication Lag on maps2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 [12:38:01] (03PS5) 10Filippo Giunchedi: base: alert on EDAC correctable errors [puppet] - 10https://gerrit.wikimedia.org/r/422110 (https://phabricator.wikimedia.org/T183177) [12:39:14] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 89698744 [12:39:34] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 124811000 [12:39:44] 10Operations, 10MediaWiki-General-or-Unknown: Investigate spike in 500s during asw-c2-eqiad replacement - https://phabricator.wikimedia.org/T156475#4190160 (10jcrespo) [12:40:15] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 252976 [12:40:48] 10Operations, 10WMDE-QWERTY-Team, 10wikidiff2, 10Patch-For-Review: Update wikidiff2 library on the WMF production cluster - https://phabricator.wikimedia.org/T190717#4190164 (10MoritzMuehlenhoff) > In the mean time deployment-prep was also migrated to stretch, so as a preparatory step I'll prepare wikidiff... [12:42:44] RECOVERY - Postgres Replication Lag on maps2004 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 16 [12:48:04] (03CR) 10Filippo Giunchedi: [C: 032] base: alert on EDAC correctable errors [puppet] - 10https://gerrit.wikimedia.org/r/422110 (https://phabricator.wikimedia.org/T183177) (owner: 10Filippo Giunchedi) [12:48:31] (03PS1) 10Jcrespo: dbhosts: Promote db1069 as master, remove db1055 [software] - 10https://gerrit.wikimedia.org/r/431747 (https://phabricator.wikimedia.org/T194118) [12:49:04] (03PS2) 10Jcrespo: dbhosts: Remove db1055 [software] - 10https://gerrit.wikimedia.org/r/431747 (https://phabricator.wikimedia.org/T194118) [12:50:03] (03CR) 10Jcrespo: [C: 032] dbhosts: Remove db1055 [software] - 10https://gerrit.wikimedia.org/r/431747 (https://phabricator.wikimedia.org/T194118) (owner: 10Jcrespo) [12:50:17] (03CR) 10Jcrespo: [V: 032 C: 032] dbhosts: Remove db1055 [software] - 10https://gerrit.wikimedia.org/r/431747 (https://phabricator.wikimedia.org/T194118) (owner: 10Jcrespo) [12:50:32] (03CR) 10Filippo Giunchedi: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/431057 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [12:54:52] (03PS1) 10Jcrespo: mariadb: Set db1055 as spare before decommission [puppet] - 10https://gerrit.wikimedia.org/r/431748 (https://phabricator.wikimedia.org/T194118) [12:56:17] (03CR) 10Jcrespo: [C: 032] mariadb: Set db1055 as spare before decommission [puppet] - 10https://gerrit.wikimedia.org/r/431748 (https://phabricator.wikimedia.org/T194118) (owner: 10Jcrespo) [13:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor My software never has bugs. It just develops random features. Rise for European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180508T1300). [13:00:04] chiborg, stephanebisson, and Nikerabbit: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:15] I can SWAT today [13:00:18] hello [13:00:29] Hi all [13:00:35] PROBLEM - MegaRAID on analytics1032 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough [13:00:52] \o [13:00:56] purrr [13:01:12] oooh, chiborg, it is advanced search time :D [13:01:29] \o/ [13:02:10] ok everybody, if there is nothing urgent, I'll just deploy in calendar order, ok? [13:02:42] chiborg: I'll ping you in a few minutes when your patch is at mwdebug1002, so you can test it there [13:02:53] !log Manually fail disk #9 on db1073 to get it replaced [13:02:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:29] (03PS2) 10Zfilipin: Enable AdvancedSearch BetaFeature on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430388 (https://phabricator.wikimedia.org/T193182) (owner: 10Gabriel Birke) [13:04:31] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430388 (https://phabricator.wikimedia.org/T193182) (owner: 10Gabriel Birke) [13:05:43] (03Merged) 10jenkins-bot: Enable AdvancedSearch BetaFeature on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430388 (https://phabricator.wikimedia.org/T193182) (owner: 10Gabriel Birke) [13:07:24] chiborg: your patch is at mwdebug1002, please test and let me know if I can deploy; let me know if you do not know how to test there [13:08:38] (03PS2) 10Zfilipin: Enable maps i18n everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431628 (owner: 10Sbisson) [13:08:39] sorry zeljkof, what is the full url? [13:09:10] chiborg: instructions on how to test https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug [13:09:52] in short, install the chrome extension, enable it for mwdebug1002, go to any wikimedia site and the extension will make sure you reach mwdebug1002 [13:09:53] (03CR) 10jenkins-bot: Enable AdvancedSearch BetaFeature on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430388 (https://phabricator.wikimedia.org/T193182) (owner: 10Gabriel Birke) [13:10:29] let me know if the docs are not clear on how to do it [13:10:46] RECOVERY - mediawiki-installation DSH group on mw1311 is OK: OK [13:12:22] 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4190310 (10Marostegui) @ayounsi today we have failed over x1 master which was in row C, to a new host in row A. The x1 blocker is now gone and you should be go... [13:13:04] zeljkof Yay, it works! I've tried the english wikipedia, activated the extension as a beta feature and went to the search page. [13:13:26] chiborg: ok to deploy? [13:13:37] zeljkof yes [13:13:38] (03PS1) 10Ema: prometheus: aggregate varnish uptime resets [puppet] - 10https://gerrit.wikimedia.org/r/431749 [13:13:50] chiborg: ok, deploying [13:14:54] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:430388|Enable AdvancedSearch BetaFeature on all wikis (T193182)]] (duration: 01m 00s) [13:14:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:57] T193182: Enable AdvancedSearch as a beta feature on all wikis - https://phabricator.wikimedia.org/T193182 [13:15:29] chiborg: deployed; please disable the extension, test on any wiki and thanks for deploying with #releng! ;) [13:16:01] stephanebisson: please stand by, I'll ping you in a few minutes when the patch is at mwdebug [13:16:32] stephanebisson: there is no related phab task for 431628? [13:16:41] (I don't see one in commit message) [13:17:04] 10Operations, 10MediaWiki-Parser, 10MediaWiki-Platform-Team, 10Parsing-Team, and 2 others: Different production servers have different versions of tidy installed, resulting in varying output - https://phabricator.wikimedia.org/T193414#4190334 (10ssastry) >>! In T193414#4189417, @MoritzMuehlenhoff wrote: >... [13:17:08] zeljkof: This is the task. I forgot to link it: https://phabricator.wikimedia.org/T191655 [13:17:43] PROBLEM - HHVM jobrunner on mw2159 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:17:45] stephanebisson: could you please amend the commit message? [13:17:53] yep [13:18:29] (03PS3) 10Sbisson: Enable maps i18n everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431628 (https://phabricator.wikimedia.org/T191655) [13:18:36] done [13:18:47] thanks! [13:18:52] 10Operations, 10CirrusSearch, 10Discovery, 10Elasticsearch, 10Discovery-Search (Current work): Alert when elasticsearch writes are frozen for too long - https://phabricator.wikimedia.org/T193605#4173733 (10Gehel) a:03Gehel [13:19:21] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431628 (https://phabricator.wikimedia.org/T191655) (owner: 10Sbisson) [13:20:03] PROBLEM - mediawiki-installation DSH group on mw2161 is CRITICAL: Host mw2161 is not in mediawiki-installation dsh group [13:20:04] PROBLEM - Nginx local proxy to apache on mw2160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:20:31] (03Merged) 10jenkins-bot: Enable maps i18n everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431628 (https://phabricator.wikimedia.org/T191655) (owner: 10Sbisson) [13:21:24] stephanebisson: the [13:21:43] sorry, stephanebisson: the patch is at mwdebug, let me know if it's ok to deploy [13:21:52] ok, testing now [13:22:38] zeljkof: mwdebug1001 or 1002? [13:22:44] godog: there are a bunch of unknowns related to memory correctable errors, perhaps due to https://gerrit.wikimedia.org/r/#/c/422110/? [13:22:57] ema: indeed, I'll take a look [13:23:00] stephanebisson: sorry, it's always 1002 :D [13:23:20] I'm strictly following instructions https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#Canary [13:23:43] PROBLEM - HHVM jobrunner on mw2161 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:24:12] zeljkof: looks good [13:24:17] zeljkof: you can deploy [13:24:18] Nikerabbit: please stand by, I'll ping you in a few minutes when your patch is ready for testing [13:24:23] stephanebisson: ok, deploying [13:25:04] zeljkof: fyi, my patch cannot fully be tested because it interacts with jobqueue – it's a request by mobrovac and we will monitor the jobs once they can switch it to new jobrunner [13:25:16] zeljkof We've tested on en, nl, es, fr, ru and bg, looks fine there. [13:25:28] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:431628|Enable maps i18n everywhere (T191655)]] (duration: 01m 00s) [13:25:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:32] T191655: Deploy maps internationalization to production - https://phabricator.wikimedia.org/T191655 [13:25:35] Nikerabbit: ok, so I can deploy without mwdebug? or should I deploy there first? [13:25:42] chiborg: great! [13:25:55] stephanebisson: deployed, please test and thanks for deploying with #releng! ;) [13:26:03] zeljkof: without mwdebug is good [13:26:13] PROBLEM - mediawiki-installation DSH group on mw2160 is CRITICAL: Host mw2160 is not in mediawiki-installation dsh group [13:26:13] PROBLEM - Nginx local proxy to apache on mw2159 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:26:38] (03CR) 10jenkins-bot: Enable maps i18n everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431628 (https://phabricator.wikimedia.org/T191655) (owner: 10Sbisson) [13:26:45] Nikerabbit: ok, I'll ping you when it's deployed then, depends on how fast CI will be :) [13:27:24] PROBLEM - MegaRAID on db1073 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [13:27:25] ACKNOWLEDGEMENT - MegaRAID on db1073 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T194155 [13:27:27] zeljkof: thank you! [13:27:29] zeljkof: :+1: [13:27:44] 10Operations, 10MediaWiki-Parser, 10MediaWiki-Platform-Team, 10Parsing-Team, and 2 others: Different production servers have different versions of tidy installed, resulting in varying output - https://phabricator.wikimedia.org/T193414#4190380 (10MoritzMuehlenhoff) Ack, at this point only four job runners i... [13:28:04] 10Operations, 10HHVM, 10Patch-For-Review, 10User-Elukey: Upgrade mw* servers to Debian Stretch (using HHVM) - https://phabricator.wikimedia.org/T174431#4190391 (10MoritzMuehlenhoff) All API servers in eqiad are now running stretch. [13:28:06] 10Operations, 10ops-eqiad: Degraded RAID on db1073 - https://phabricator.wikimedia.org/T194155#4190393 (10ops-monitoring-bot) [13:28:21] (03PS1) 10Filippo Giunchedi: base: sum EDAC correctable errors [puppet] - 10https://gerrit.wikimedia.org/r/431750 (https://phabricator.wikimedia.org/T183177) [13:28:34] Nikerabbit: I did not notice that your patch is for an extension, I would merge it first and deploy last, since it is usually slow in CI... [13:28:43] anyway, it should not take long, a few minutes [13:29:17] (03CR) 10Filippo Giunchedi: [C: 032] base: sum EDAC correctable errors [puppet] - 10https://gerrit.wikimedia.org/r/431750 (https://phabricator.wikimedia.org/T183177) (owner: 10Filippo Giunchedi) [13:29:23] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4190417 (10Marostegui) db2092 is now a temporary multi-instance sanitarium host in codfw, replicating the same sections as db11... [13:29:44] PROBLEM - HHVM jobrunner on mw2160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:31:45] PROBLEM - mediawiki-installation DSH group on mw2159 is CRITICAL: Host mw2159 is not in mediawiki-installation dsh group [13:31:54] PROBLEM - Nginx local proxy to apache on mw2161 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:32:55] ^ silencing [13:33:40] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1073 - https://phabricator.wikimedia.org/T194155#4190471 (10Marostegui) p:05Triage>03High a:03Cmjohnson @Cmjohnson this host has 2 disks with smart alert. I have manually failed disk #9, let's change that one first, let it rebuild and then we can man... [13:34:34] RECOVERY - mediawiki-installation DSH group on mw2156 is OK: OK [13:38:36] Nikerabbit: merged, deploying... [13:41:02] PROBLEM - Memory correctable errors -EDAC- on wtp2013 is CRITICAL: 840 ge 3 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=wtp2013&var-datasource=codfw%2520prometheus%252Fops [13:41:03] this is always so exciting! [13:41:11] PROBLEM - Memory correctable errors -EDAC- on db1053 is CRITICAL: 8 ge 3 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=db1053&var-datasource=eqiad%2520prometheus%252Fops [13:41:12] PROBLEM - Memory correctable errors -EDAC- on scb1002 is CRITICAL: 32 ge 3 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=scb1002&var-datasource=eqiad%2520prometheus%252Fops [13:41:20] sigh, sorry about the spam [13:41:21] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Some minor comments inline, rest LGTM" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/430881 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [13:41:22] PROBLEM - Memory correctable errors -EDAC- on rdb2002 is CRITICAL: 315 ge 3 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=rdb2002&var-datasource=codfw%2520prometheus%252Fops [13:41:31] PROBLEM - Memory correctable errors -EDAC- on elastic1029 is CRITICAL: 5 ge 3 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=elastic1029&var-datasource=eqiad%2520prometheus%252Fops [13:42:22] !log zfilipin@tin Synchronized php-1.32.0-wmf.2/extensions/Translate: SWAT: [[gerrit:431744|Refactor TranslationUpdateJob to use only primitive types for parameters (T192111)]] (duration: 01m 11s) [13:42:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:27] T192111: Make TranslationsUpdateJob JSON-serializable - https://phabricator.wikimedia.org/T192111 [13:42:33] though it is actually the case that the are corractable errors, according to the kernel anyway [13:42:58] Nikerabbit: deployed, please test and thanks for deploying with #releng! ;) [13:43:41] !log EU SWAT finished [13:43:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:56] godog: you had me worried for an instant :) [13:45:44] gehel: sudden magnetic storm! [13:45:49] zeljkof: yep, thanks! [13:46:01] godog: sounds like an interesting attack vector :) [13:46:50] woah nice, working EDAC errors in icinga :) [13:48:32] heheh getting there [13:48:49] it'll spam some more as alerts are added [13:51:01] PROBLEM - Memory correctable errors -EDAC- on wtp2020 is CRITICAL: 47 ge 3 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=wtp2020&var-datasource=codfw%2520prometheus%252Fops [13:56:02] (03PS5) 10Ema: numa_networking: move setting to tlsproxy::instance [puppet] - 10https://gerrit.wikimedia.org/r/430902 (https://phabricator.wikimedia.org/T193865) [13:57:18] (03CR) 10Muehlenhoff: debmonitor: add server side puppettization (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/430881 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [14:02:05] PROBLEM - Memory correctable errors -EDAC- on cp1068 is CRITICAL: 5 ge 3 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=cp1068&var-datasource=eqiad%2520prometheus%252Fops [14:07:12] (03PS1) 10Gehel: elasticsearch: alert when cirrus writes are frozen for too long [puppet] - 10https://gerrit.wikimedia.org/r/431754 (https://phabricator.wikimedia.org/T193605) [14:07:46] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: alert when cirrus writes are frozen for too long [puppet] - 10https://gerrit.wikimedia.org/r/431754 (https://phabricator.wikimedia.org/T193605) (owner: 10Gehel) [14:09:29] 10Operations: Something is wrong with installer root disk stuff - https://phabricator.wikimedia.org/T149845#4190580 (10MoritzMuehlenhoff) 05Open>03Resolved a:03MoritzMuehlenhoff >>! In T149845#4183217, @RobH wrote: > This seems fixed by adding the rootdelay for jessie and older, and stretch has it go away.... [14:10:03] PROBLEM - Memory correctable errors -EDAC- on thumbor1004 is CRITICAL: 533 ge 3 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad%2520prometheus%252Fops [14:10:12] PROBLEM - Memory correctable errors -EDAC- on mw2213 is CRITICAL: 439 ge 3 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=mw2213&var-datasource=codfw%2520prometheus%252Fops [14:10:42] (03PS2) 10Gehel: elasticsearch: alert when cirrus writes are frozen for too long [puppet] - 10https://gerrit.wikimedia.org/r/431754 (https://phabricator.wikimedia.org/T193605) [14:11:15] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: alert when cirrus writes are frozen for too long [puppet] - 10https://gerrit.wikimedia.org/r/431754 (https://phabricator.wikimedia.org/T193605) (owner: 10Gehel) [14:11:21] (03PS6) 10Ema: numa_networking: move setting to tlsproxy::instance [puppet] - 10https://gerrit.wikimedia.org/r/430902 (https://phabricator.wikimedia.org/T193865) [14:11:50] 10Operations, 10User-fgiunchedi: rack/setup/install ms-be104[0-3].eqiad.wmnet - https://phabricator.wikimedia.org/T190081#4190607 (10fgiunchedi) 05Open>03Resolved Rebalance has completed, resolving [14:12:41] T194160 [14:12:41] T194160: Unlock the login of bot user TaxonBot@TaxonBot to dewiki - https://phabricator.wikimedia.org/T194160 [14:12:42] (03PS3) 10Gehel: elasticsearch: alert when cirrus writes are frozen for too long [puppet] - 10https://gerrit.wikimedia.org/r/431754 (https://phabricator.wikimedia.org/T193605) [14:14:47] 10Operations, 10monitoring, 10Graphite, 10User-fgiunchedi: Programmatic generation of grafana dashboards - https://phabricator.wikimedia.org/T171482#4190625 (10fgiunchedi) [14:14:49] 10Operations, 10monitoring, 10User-fgiunchedi: save grafana dashboards in revision control / puppet - https://phabricator.wikimedia.org/T133392#4190627 (10fgiunchedi) [14:15:50] !log mw2215,mw2222,mw2223 - reinstalling with stretch [14:15:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:56] (03CR) 10Filippo Giunchedi: [C: 031] prometheus: aggregate varnish uptime resets [puppet] - 10https://gerrit.wikimedia.org/r/431749 (owner: 10Ema) [14:16:22] PROBLEM - Memory correctable errors -EDAC- on db1051 is CRITICAL: 109 ge 3 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=db1051&var-datasource=eqiad%2520prometheus%252Fops [14:19:04] !log ppchelko@tin Started restart [changeprop/deploy@7e86531]: Restart changeprop to try forcing it rebalancing topics [14:19:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:49] (03PS2) 10Ema: prometheus: aggregate varnish uptime resets [puppet] - 10https://gerrit.wikimedia.org/r/431749 [14:20:06] (03CR) 10Ema: [C: 032] prometheus: aggregate varnish uptime resets [puppet] - 10https://gerrit.wikimedia.org/r/431749 (owner: 10Ema) [14:23:09] (03PS2) 10Ottomata: Stop main-eqiad -> main-codfw MirrorMaker during Kafka main upgrade [puppet] - 10https://gerrit.wikimedia.org/r/431588 (https://phabricator.wikimedia.org/T167039) [14:23:14] (03PS1) 10Dzahn: admins: add Shannon Bailey to ldap_only_admins [puppet] - 10https://gerrit.wikimedia.org/r/431756 (https://phabricator.wikimedia.org/T194091) [14:23:33] (03CR) 10jerkins-bot: [V: 04-1] admins: add Shannon Bailey to ldap_only_admins [puppet] - 10https://gerrit.wikimedia.org/r/431756 (https://phabricator.wikimedia.org/T194091) (owner: 10Dzahn) [14:23:51] (03PS2) 10Dzahn: admins: add Shannon Bailey to ldap_only_admins [puppet] - 10https://gerrit.wikimedia.org/r/431756 (https://phabricator.wikimedia.org/T194091) [14:24:39] (03CR) 10EBernhardson: elasticsearch: alert when cirrus writes are frozen for too long (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/431754 (https://phabricator.wikimedia.org/T193605) (owner: 10Gehel) [14:26:08] (03CR) 10Gehel: elasticsearch: alert when cirrus writes are frozen for too long (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/431754 (https://phabricator.wikimedia.org/T193605) (owner: 10Gehel) [14:26:53] PROBLEM - Memory correctable errors -EDAC- on kafka1023 is CRITICAL: 13 ge 3 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=kafka1023&var-datasource=eqiad%2520prometheus%252Fops [14:34:18] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler02/11160/kafka2001.codfw.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/431588 (https://phabricator.wikimedia.org/T167039) (owner: 10Ottomata) [14:35:47] (03Abandoned) 10Dzahn: nutcracker: puppetize missing /var/run/nutcracker dir [puppet] - 10https://gerrit.wikimedia.org/r/431057 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [14:37:12] RECOVERY - MegaRAID on analytics1032 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy [14:37:40] !log LDAP: added 'sbailey' to group 'wmf' (T194091) [14:37:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:45] T194091: Add sbailey to wmf and other ldap groups - https://phabricator.wikimedia.org/T194091 [14:38:32] (03PS1) 10Pmiazga: Remove unused PopupsAnonsExperimentalGroupSize config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431759 (https://phabricator.wikimedia.org/T173952) [14:41:04] (03PS2) 10Pmiazga: Remove unused PopupsAnonsExperimentalGroupSize config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431759 (https://phabricator.wikimedia.org/T173952) [14:43:35] (03CR) 10Dzahn: [C: 032] "done in LDAP, this to reflect the status quo" [puppet] - 10https://gerrit.wikimedia.org/r/431756 (https://phabricator.wikimedia.org/T194091) (owner: 10Dzahn) [14:48:27] !log disabling pybal on lvs2004 - T193677 [14:48:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:31] T193677: Interface errors on asw-d-codfw:xe-2/0/47 - https://phabricator.wikimedia.org/T193677 [14:51:01] (03CR) 10Alexandros Kosiaris: [C: 04-1] "A few comments here and there, but I 've finally reviewed all of it. Nice work!" (038 comments) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394620 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [14:52:29] (03CR) 10Alexandros Kosiaris: [C: 031] Add basic test coverage [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394621 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [14:53:47] (03CR) 10Alexandros Kosiaris: [C: 031] Add CLI script to be installed in the target hosts [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394990 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [14:53:52] !log re-enable pybal on lvs2004 - T193677 [14:53:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:57] T193677: Interface errors on asw-d-codfw:xe-2/0/47 - https://phabricator.wikimedia.org/T193677 [14:55:30] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2067 - https://phabricator.wikimedia.org/T194103#4190864 (10Papaul) a:05Papaul>03Marostegui @Marostegui Disk replacement complete [14:56:25] 10Operations, 10ops-codfw: Degraded RAID on wasat - https://phabricator.wikimedia.org/T193394#4190867 (10Papaul) Dear Papaul Tshibamba, We are contacting you in regards to your case ID# 5329190939. Please be aware that a functional equivalent part (656108-001) (SPS-DRV HD 1TB 6G SATA 7.2K 2.5 MDL SC) has... [14:56:53] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2067 - https://phabricator.wikimedia.org/T194103#4190868 (10Marostegui) Thanks! ``` physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, Rebuilding) ``` [14:57:35] (03CR) 10Alexandros Kosiaris: [C: 031] Add server side validation of client certificates [software/debmonitor] - 10https://gerrit.wikimedia.org/r/428302 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [15:02:24] PROBLEM - High CPU load on API appserver on mw2215 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:02:55] PROBLEM - HHVM rendering on mw2215 is CRITICAL: connect to address 10.192.0.40 and port 80: Connection refused [15:02:55] PROBLEM - Nginx local proxy to apache on mw2222 is CRITICAL: connect to address 10.192.0.47 and port 443: Connection refused [15:02:55] PROBLEM - Nginx local proxy to apache on mw2223 is CRITICAL: connect to address 10.192.0.48 and port 443: Connection refused [15:02:55] PROBLEM - nutcracker port on mw2215 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:02:55] PROBLEM - Check whether ferm is active by checking the default input chain on mw2222 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:02:55] PROBLEM - Check whether ferm is active by checking the default input chain on mw2223 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:02:56] 10Operations, 10ops-codfw, 10DC-Ops: rigel.frack.codfw.wmnet (fundraising codfw bastion) will not boot after a power cycle - https://phabricator.wikimedia.org/T193891#4190893 (10Papaul) @jgree as requested, the server is back up again [15:04:24] PROBLEM - nutcracker process on mw2215 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:04:24] PROBLEM - DPKG on mw2222 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:04:24] PROBLEM - DPKG on mw2223 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:05:54] PROBLEM - puppet last run on mw2215 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:05:54] PROBLEM - configured eth on mw2222 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:05:54] PROBLEM - configured eth on mw2223 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:06:27] !log beginnng Kafka upgrade of main-codfw: T167039 [15:06:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:31] T167039: Upgrade Kafka on main cluster with security features - https://phabricator.wikimedia.org/T167039 [15:07:08] (03CR) 10Ottomata: [C: 032] Stop main-eqiad -> main-codfw MirrorMaker during Kafka main upgrade [puppet] - 10https://gerrit.wikimedia.org/r/431588 (https://phabricator.wikimedia.org/T167039) (owner: 10Ottomata) [15:07:21] (03PS3) 10Ottomata: Stop main-eqiad -> main-codfw MirrorMaker during Kafka main upgrade [puppet] - 10https://gerrit.wikimedia.org/r/431588 (https://phabricator.wikimedia.org/T167039) [15:07:22] (03CR) 10Ottomata: [V: 032 C: 032] Stop main-eqiad -> main-codfw MirrorMaker during Kafka main upgrade [puppet] - 10https://gerrit.wikimedia.org/r/431588 (https://phabricator.wikimedia.org/T167039) (owner: 10Ottomata) [15:07:24] PROBLEM - Apache HTTP on mw2215 is CRITICAL: connect to address 10.192.0.40 and port 80: Connection refused [15:07:25] PROBLEM - Disk space on mw2222 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:07:25] PROBLEM - Disk space on mw2223 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:07:25] PROBLEM - dhclient process on mw2222 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:07:25] PROBLEM - dhclient process on mw2223 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:07:44] PROBLEM - MegaRAID on analytics1032 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough [15:08:14] RECOVERY - HHVM jobrunner on mw2161 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.075 second response time [15:08:34] RECOVERY - Apache HTTP on mw2215 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.073 second response time [15:08:54] RECOVERY - HHVM jobrunner on mw2159 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.085 second response time [15:08:54] RECOVERY - HHVM jobrunner on mw2160 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.074 second response time [15:08:55] PROBLEM - mediawiki-installation DSH group on mw2222 is CRITICAL: Host mw2222 is not in mediawiki-installation dsh group [15:08:55] PROBLEM - mediawiki-installation DSH group on mw2223 is CRITICAL: Host mw2223 is not in mediawiki-installation dsh group [15:08:55] PROBLEM - Check size of conntrack table on mw2215 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:08:55] PROBLEM - MD RAID on mw2215 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:08:56] PROBLEM - HHVM processes on mw2223 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:08:56] PROBLEM - HHVM processes on mw2222 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:09:23] (03PS3) 10Rduran: [WIP] Use Cumin to implement the comunication for the transfer [puppet] - 10https://gerrit.wikimedia.org/r/430868 [15:09:25] RECOVERY - Nginx local proxy to apache on mw2161 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.165 second response time [15:09:36] stopping mm instance in codfw [15:09:41] !log stopping pybal on lvs2001 - T193677 [15:09:45] PROBLEM - High CPU load on API appserver on mw2222 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:09:45] PROBLEM - High CPU load on API appserver on mw2223 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:09:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:45] T193677: Interface errors on asw-d-codfw:xe-2/0/47 - https://phabricator.wikimedia.org/T193677 [15:10:05] RECOVERY - Nginx local proxy to apache on mw2159 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.175 second response time [15:10:34] PROBLEM - Check systemd state on mw2215 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:10:34] PROBLEM - nutcracker port on mw2222 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:10:34] PROBLEM - nutcracker port on mw2223 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:10:58] ok mm stopped in codfw [15:11:03] cool [15:11:10] stopping puppet etc. [15:12:01] beginning package upgrade rolling restarts... [15:12:05] PROBLEM - Check the NTP synchronisation status of timesyncd on mw2215 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:12:05] PROBLEM - nutcracker process on mw2222 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:12:05] PROBLEM - nutcracker process on mw2223 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:13:26] PROBLEM - Nginx local proxy to apache on mw2215 is CRITICAL: connect to address 10.192.0.40 and port 443: Connection refused [15:13:27] PROBLEM - Check whether ferm is active by checking the default input chain on mw2215 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:13:37] I 'll schedule downtime for mw2215, mw2222 mw2223 [15:13:47] no reason to have them pollute the channel [15:13:50] k [15:14:37] PROBLEM - DPKG on mw2215 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:14:47] PROBLEM - HHVM rendering on mw2223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:14:49] akosiaris: sorry, i got it [15:15:15] 2001 upgraded, moving on [15:15:18] the last ones worked without this [15:16:14] mutante: this ones probably took a bit longer. It's a race condition. Without setting the hiera flag profile::base::notifications_enabled to 0 it's expected to happen every now and then [15:16:18] these* [15:16:29] anyway, I 've downtimed them in icinga [15:16:59] thank you, ok @ hiera [15:17:54] 2002 upgraded, moving on [15:17:56] (03CR) 10Vgutierrez: "IMHO this could benefit from exposing clustershell file copy features in cumin - http://clustershell.readthedocs.io/en/latest/tools/clush." [puppet] - 10https://gerrit.wikimedia.org/r/430868 (owner: 10Rduran) [15:19:27] (03CR) 10Alexandros Kosiaris: [C: 031] Add login and LDAP support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/425417 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [15:20:37] Pchelolo: o/ - can you check the changeprop codfw consumers? [15:20:38] https://grafana.wikimedia.org/dashboard/db/kafka?refresh=5m&panelId=46&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus%2Fops&var-kafka_cluster=main-codfw&var-cluster=eventbus&var-kafka_broker=All [15:20:40] ok 2003 upgraed, package upgrades complete [15:21:23] timing wise they seem ok, the one going down it is due to mm right? [15:21:50] (03CR) 10Imarlier: "> Confirmed that the following all respond the same way from" [puppet] - 10https://gerrit.wikimedia.org/r/431659 (https://phabricator.wikimedia.org/T159354) (owner: 10Imarlier) [15:21:53] elukey: yeah that makes sense i think [15:22:03] moving on to restart 2, to set broker protocol versin [15:22:16] (03PS2) 10Ottomata: Kafka main-codfw patch 1: inter_broker_protocol_version: 1.1.0 [puppet] - 10https://gerrit.wikimedia.org/r/430449 (https://phabricator.wikimedia.org/T167039) [15:22:20] ack [15:22:34] (03CR) 10Ottomata: [V: 032 C: 032] Kafka main-codfw patch 1: inter_broker_protocol_version: 1.1.0 [puppet] - 10https://gerrit.wikimedia.org/r/430449 (https://phabricator.wikimedia.org/T167039) (owner: 10Ottomata) [15:22:35] there was an increase in throughput but it was before you started, so all good [15:22:46] elukey: all seems fine [15:23:15] super thanks for checking [15:23:16] elukey: that's when I restarted CP for that bug when it got stuck on the transclusion topic [15:23:29] so not related [15:23:30] ack! [15:23:43] (03PS1) 10BBlack: Block some networks [puppet] - 10https://gerrit.wikimedia.org/r/431769 (https://phabricator.wikimedia.org/T193762) [15:24:59] bouncing 2001 [15:26:36] !log (un)load edac kernel modules on thumbor1004 to test resetting counters - T183177 [15:26:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:40] T183177: memory errors not showing in icinga - https://phabricator.wikimedia.org/T183177 [15:27:44] RECOVERY - Memory correctable errors -EDAC- on thumbor1004 is OK: (C)3 ge (W)1 ge 0 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad%2520prometheus%252Fops [15:28:07] expected ^ [15:28:12] bouncing 2002 [15:29:07] RECOVERY - Nginx local proxy to apache on mw2160 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.158 second response time [15:29:11] (03CR) 10Jcrespo: "> IMHO this could benefit from exposing clustershell file copy" [puppet] - 10https://gerrit.wikimedia.org/r/430868 (owner: 10Rduran) [15:30:37] bouncing 2003 [15:30:53] !log starting pybal on lvs2001 - T193677 [15:30:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:57] T193677: Interface errors on asw-d-codfw:xe-2/0/47 - https://phabricator.wikimedia.org/T193677 [15:31:53] 10Operations, 10ops-codfw, 10DC-Ops: rigel.frack.codfw.wmnet (fundraising codfw bastion) will not boot after a power cycle - https://phabricator.wikimedia.org/T193891#4190980 (10Papaul) Rigel was set by default to boot first from NIC so every time the server reboots, it stuck and the error bellow so I chan... [15:32:14] (03PS3) 10Ottomata: Kafka main-codfw patch 2 [puppet] - 10https://gerrit.wikimedia.org/r/430450 (https://phabricator.wikimedia.org/T167039) [15:32:18] (03CR) 10Dzahn: [C: 031] "i can deploy this, a +1 from traffic never hurts though" [puppet] - 10https://gerrit.wikimedia.org/r/431659 (https://phabricator.wikimedia.org/T159354) (owner: 10Imarlier) [15:32:23] restart 2 finished [15:32:30] now time to upgrade client api versions :) [15:33:06] (03PS4) 10Ottomata: Kafka main-codfw patch 3 - remove api.version [puppet] - 10https://gerrit.wikimedia.org/r/430640 (https://phabricator.wikimedia.org/T167039) [15:33:24] ok ottomata shoot me when it's time for me to deploy consumers [15:33:30] (03CR) 10BBlack: [C: 031] performance.wikimedia.org: serve from webperfX001 [puppet] - 10https://gerrit.wikimedia.org/r/431659 (https://phabricator.wikimedia.org/T159354) (owner: 10Imarlier) [15:33:38] Pchelolo: these can be done at any time, so [15:33:38] hm [15:33:41] (03CR) 10BBlack: [C: 032] Block some networks [puppet] - 10https://gerrit.wikimedia.org/r/431769 (https://phabricator.wikimedia.org/T193762) (owner: 10BBlack) [15:33:41] let's not do that simultaniously with you making your part [15:33:44] eyah [15:33:44] ok [15:33:50] i'll do all mine first one by one and make sure it sok [15:33:54] then we'll do cp [15:34:01] (03CR) 10Ottomata: [C: 032] Kafka main-codfw patch 3 - remove api.version [puppet] - 10https://gerrit.wikimedia.org/r/430640 (https://phabricator.wikimedia.org/T167039) (owner: 10Ottomata) [15:34:31] +1 [15:35:15] (03PS5) 10Ottomata: Kafka main-codfw - remove api.version [puppet] - 10https://gerrit.wikimedia.org/r/430640 (https://phabricator.wikimedia.org/T167039) [15:35:24] (03CR) 10Ottomata: [V: 032 C: 032] Kafka main-codfw - remove api.version [puppet] - 10https://gerrit.wikimedia.org/r/430640 (https://phabricator.wikimedia.org/T167039) (owner: 10Ottomata) [15:39:41] eventbus service restarted without api.version [15:39:49] ok [15:41:04] ok ottomata I will start deploying CP [15:41:22] ottomata: is it the good time to do that? [15:41:27] ok, i'm about to do statsv varnishkafka instances, but i think you can go ahead with cp [15:42:20] seeing stuff like Broker version identifed as 1.0.0 in eb logs so das good [15:43:04] ok, we're still in a meeting so I am a bit distracted so I'll wait for you [15:43:45] (03PS4) 10Ottomata: Kafka main-codfw patch 2 [puppet] - 10https://gerrit.wikimedia.org/r/430450 (https://phabricator.wikimedia.org/T167039) [15:43:56] 10Operations, 10ops-codfw, 10DC-Ops: rigel.frack.codfw.wmnet (fundraising codfw bastion) will not boot after a power cycle - https://phabricator.wikimedia.org/T193891#4191028 (10RobH) Please note my past comment regarding allocation of a spare was discussed in irc between myself and @Jgreen rigel's ilom is... [15:45:02] ottomata: qq - why the interbroker version is set in the common hiera conifg? [15:45:05] *config [15:45:26] PROBLEM - Host lvs2004 is DOWN: PING CRITICAL - Packet loss = 100% [15:45:56] XioNoX: ^^ [15:46:00] !log demon@tin Pruned MediaWiki: 1.32.0-wmf.1 [keeping static files] (duration: 01m 47s) [15:46:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:11] vgutierrez: I think he already moved back traffic to 2001 [15:46:12] 10Operations, 10ops-codfw, 10DC-Ops: rigel.frack.codfw.wmnet (fundraising codfw bastion) will not boot after a power cycle - https://phabricator.wikimedia.org/T193891#4191030 (10RobH) p:05Unbreak!>03Normal Lowering to normal, as the server is known bad (ilom malfunction) but out of warranty. There is a... [15:46:21] elukey: hang on will answer [15:46:25] probably still saw errors, tryingt something else on the interface [15:46:25] problem with statsv vks [15:46:30] yep [15:46:53] somehow even though puppet should nto have changed the api version, since statsv produces to main-eqiad [15:46:55] it did... [15:46:59] looking [15:47:01] it changed it from 0.9.0.1 to 0.9 [15:47:02] which is weird [15:47:16] but the main interface shouldn't alert, only ens1f1 [15:47:23] papaul: ^ [15:47:27] 10Operations, 10ops-codfw: rdb2002 correctable memory errors - https://phabricator.wikimedia.org/T194171#4191033 (10fgiunchedi) [15:47:56] RECOVERY - Host lvs2004 is UP: PING WARNING - Packet loss = 58%, RTA = 36.22 ms [15:48:23] (03PS1) 10Ottomata: Force statsv varnishkafka api.version to 0.9.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/431773 (https://phabricator.wikimedia.org/T167039) [15:48:59] (03PS2) 10Ottomata: Force statsv varnishkafka api.version to 0.9.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/431773 (https://phabricator.wikimedia.org/T167039) [15:48:59] (03CR) 10jerkins-bot: [V: 04-1] Force statsv varnishkafka api.version to 0.9.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/431773 (https://phabricator.wikimedia.org/T167039) (owner: 10Ottomata) [15:49:36] (03CR) 10Ottomata: [C: 032] Force statsv varnishkafka api.version to 0.9.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/431773 (https://phabricator.wikimedia.org/T167039) (owner: 10Ottomata) [15:50:01] weird indeed [15:50:13] (03PS4) 10Dzahn: performance.wikimedia.org: serve from webperfX001 [puppet] - 10https://gerrit.wikimedia.org/r/431659 (https://phabricator.wikimedia.org/T159354) (owner: 10Imarlier) [15:50:22] no time to investigate why, gotta change it back [15:50:27] i think statsv vks are failing connecting [15:50:50] probably gonna have a blip in statsv stuff (if messages are produced from codfw vk instances?) ping marlier [15:50:50] (03CR) 10Dzahn: [C: 032] performance.wikimedia.org: serve from webperfX001 [puppet] - 10https://gerrit.wikimedia.org/r/431659 (https://phabricator.wikimedia.org/T159354) (owner: 10Imarlier) [15:51:13] ottomata: ack, thanks [15:51:30] mutante: i just merged your patch [15:52:03] ottomata: maybe 0.9 in hiera vs '0.9' in the ? block [15:52:17] ottomata: alright, i was sitting at the yes/no prompt ;) [15:52:33] OH, maybe it made it a decimal value you mean? yeah it probably did [15:52:34] doh [15:53:24] !log switching performance.wikimedia.org from graphite to webperf backends - running puppet on cache::misc servers (T158837) [15:53:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:29] T158837: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837 [15:53:32] 10Operations, 10MediaWiki-Parser, 10MediaWiki-Platform-Team, 10Parsing-Team, and 2 others: Different production servers have different versions of tidy installed, resulting in varying output - https://phabricator.wikimedia.org/T193414#4191075 (10Legoktm) >>! In T193414#4190334, @ssastry wrote: > If all ser... [15:53:44] elukey: re broker version [15:53:49] i'm overriding it in the site specific hiera [15:54:11] and when setting to new value (the one we want to keep), i remove it from site specific override [15:54:14] and the common one sticks [15:54:15] (03CR) 10Ema: [C: 04-1] numa_networking: move setting to tlsproxy::instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/430902 (https://phabricator.wikimedia.org/T193865) (owner: 10Ema) [15:54:23] ottomata: ack thanks [15:54:31] was just triple checking everything :) [15:54:37] 10Operations, 10DC-Ops, 10Traffic, 10monitoring, and 2 others: memory errors not showing in icinga - https://phabricator.wikimedia.org/T183177#4191082 (10fgiunchedi) The correctable errors check has been deployed and it is yielding some results already. Myself and @herron took at the list of hosts and ther... [15:58:19] ok ottomata finally out of the meeting, are you done with your consumers? [15:58:25] yes! [15:58:26] just finished [15:58:32] statsv is fine now [15:58:35] statsv vk [15:58:40] it was all producing to eqiad anway [15:58:46] shoudn't have even botherd with it today :/ [15:58:56] so yes, Pchelolo please proceed with cp/jq [15:59:42] ok, cool. going with job queue first, it's not doing anything in codfw [15:59:45] k [16:00:05] godog, moritzm, and _joe_: That opportune time is upon us again. Time for a Puppet SWAT(Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180508T1600). [16:00:05] No GERRIT patches in the queue for this window AFAICS. [16:01:00] !log ppchelko@tin Started deploy [cpjobqueue/deploy@58935d5]: Allow protocol version negotiation. Codfw only. T167039 [16:01:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:04] T167039: Upgrade Kafka on main cluster with security features - https://phabricator.wikimedia.org/T167039 [16:01:42] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@58935d5]: Allow protocol version negotiation. Codfw only. T167039 (duration: 00m 42s) [16:01:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:16] jobqueue done, give me a minute to look at the logs [16:02:21] (03PS1) 10Ema: prometheus: fix aggregate varnish uptime resets expression [puppet] - 10https://gerrit.wikimedia.org/r/431777 [16:02:52] 10Operations, 10ops-codfw: mw2213 correctable memory errors - https://phabricator.wikimedia.org/T194172#4191094 (10fgiunchedi) [16:03:07] (03CR) 10Filippo Giunchedi: [C: 031] prometheus: fix aggregate varnish uptime resets expression [puppet] - 10https://gerrit.wikimedia.org/r/431777 (owner: 10Ema) [16:03:30] !log ppchelko@tin Started deploy [changeprop/deploy@e468d8e]: Allow protocol version negotiation. Codfw only. T167039 [16:03:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:40] (03CR) 10Ema: [C: 032] prometheus: fix aggregate varnish uptime resets expression [puppet] - 10https://gerrit.wikimedia.org/r/431777 (owner: 10Ema) [16:03:47] looks solid, proceeding with change-prop [16:03:51] gr8 [16:04:16] 10Operations, 10Traffic, 10netops: cr1-eqsin 4 onboard interfaces down - https://phabricator.wikimedia.org/T193897#4191107 (10ayounsi) 05Open>03Resolved [16:04:32] !log ppchelko@tin Finished deploy [changeprop/deploy@e468d8e]: Allow protocol version negotiation. Codfw only. T167039 (duration: 01m 03s) [16:04:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:55] heya -- there's a new version of scap to be deployed, I wanted to do that today but don't clash with the kafka upgrade, how much time do you think is left for the upgrade? [16:05:11] 20mins more or less [16:05:20] 10Operations, 10DBA, 10Traffic: Framework to transfer files over the LAN - https://phabricator.wikimedia.org/T156462#4191109 (10jcrespo) [16:05:26] if all goes well and we don't need any rollbacks knock on wood [16:05:27] so far so good [16:05:40] kk, thanks ottomata elukey ! [16:05:45] thcipriani: ^ [16:05:58] i'd reserve another hour godog if that's ok with you. i have another rolling restart of the cluster to do (which only takes a min, but would be nice to wathc it a while) [16:06:00] :) [16:06:09] ottomata: yup, no problem [16:06:30] ottomata: elukey ok, both JQ and CP are doooone, and it looks good - events are being consumed, no issues in logs [16:07:00] greaaat [16:07:00] 10Operations, 10DBA, 10Traffic: Framework to transfer files over the LAN - https://phabricator.wikimedia.org/T156462#4191122 (10jcrespo) @Vgutierrez suggested using https://github.com/vstakhov/hpenc , which I don't think is a bad idea at all- it would just change some of the executions of openssl and netcat... [16:07:03] yeah all looks good here too [16:07:06] \o/ [16:07:11] didn't spot anything weird [16:07:17] \o/ [16:08:10] RECOVERY - Memory correctable errors -EDAC- on db1051 is OK: (C)3 ge (W)1 ge 0 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=db1051&var-datasource=eqiad%2520prometheus%252Fops [16:08:17] ok [16:08:25] proceeding to log message format version step, restart 3. [16:08:31] ack [16:08:48] (03CR) 10Ottomata: [C: 032] Kafka main-codfw patch 2 [puppet] - 10https://gerrit.wikimedia.org/r/430450 (https://phabricator.wikimedia.org/T167039) (owner: 10Ottomata) [16:08:50] (03PS5) 10Ottomata: Kafka main-codfw patch 2 [puppet] - 10https://gerrit.wikimedia.org/r/430450 (https://phabricator.wikimedia.org/T167039) [16:08:52] (03CR) 10Ottomata: [V: 032 C: 032] Kafka main-codfw patch 2 [puppet] - 10https://gerrit.wikimedia.org/r/430450 (https://phabricator.wikimedia.org/T167039) (owner: 10Ottomata) [16:09:41] !log failing traffic over lvs2004 - T193677 [16:09:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:45] T193677: Interface errors on asw-d-codfw:xe-2/0/47 - https://phabricator.wikimedia.org/T193677 [16:10:34] bouncing 2001 [16:10:37] !log dzahn@neodymium conftool action : set/pooled=yes; selector: name=mw2215.codfw.wmnet [16:10:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:56] 10Operations, 10ops-codfw: wtp2013 memory correctable errors - https://phabricator.wikimedia.org/T194174#4191135 (10fgiunchedi) [16:14:06] !log dzahn@neodymium conftool action : set/pooled=yes; selector: name=mw2222.codfw.wmnet [16:14:08] bouncing 2002 [16:14:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:22] hm before i do [16:14:24] cp is cp ok [16:14:25] ? [16:14:39] Member 495153-1a91f0c4-35d4-45ca-8162-247bcfac0088 in group change-prop-on_transclusion_update has failed, removing it from the group etc. [16:14:43] in kafka server logs [16:14:47] could be normal operation, not sure [16:15:00] it is probably from my leader rebalance [16:15:13] !log dzahn@neodymium conftool action : set/pooled=yes; selector: name=mw2223.codfw.wmnet [16:15:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:19] Pchelolo: ^^^ [16:15:30] ottomata: silence from me means it's fine [16:15:33] haha ok [16:15:36] ok bouncing 2002 [16:15:54] burrow is not screaming too [16:16:49] 10Operations, 10ops-codfw: mw2213 correctable memory errors - https://phabricator.wikimedia.org/T194172#4191150 (10RobH) Unfortunately, this system is out of warranty as of 2018-01-16. In looking at the service event log, it appears this server has had problems for awhile: ``` /admin1-> racadm getsel Record... [16:18:30] RECOVERY - Memory correctable errors -EDAC- on elastic1029 is OK: (C)3 ge (W)1 ge 0 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=elastic1029&var-datasource=eqiad%2520prometheus%252Fops [16:19:06] !log force (split) compaction of wikipedia_T_mobile__ng_lead.data, restbase1016 - T192689 [16:19:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:10] T192689: Unchecked storage growth - https://phabricator.wikimedia.org/T192689 [16:19:19] bounced 2003 [16:19:22] allrighty.... [16:19:35] done done done?? [16:19:36] akosiaris: Pchelolo elukey codfw upgraded. looking good so far! [16:19:40] woooooww [16:19:42] nice! [16:19:43] mm is down in codfw [16:19:49] outstanding ottomata, great work! [16:19:50] and will be til after we upgrade eqiad tomorrow [16:20:00] RECOVERY - MegaRAID on db1073 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [16:20:02] (03PS1) 10Imarlier: performance website: allow traffic [puppet] - 10https://gerrit.wikimedia.org/r/431779 (https://phabricator.wikimedia.org/T159354) [16:20:30] (03CR) 10jerkins-bot: [V: 04-1] performance website: allow traffic [puppet] - 10https://gerrit.wikimedia.org/r/431779 (https://phabricator.wikimedia.org/T159354) (owner: 10Imarlier) [16:21:05] elukey: FYI, i think you know this, but auto.leader.rebalacne is now enabled [16:21:09] that's what I call a well planned operation :) [16:21:10] thank you gentlemen. [16:21:16] it works wayyy better in these later versions [16:21:24] yep yep I saw it [16:21:26] so you no longer need to do that step after rebooting brokers [16:21:27] :) [16:21:27] like we have in jumbo [16:21:30] yuppers [16:21:30] great [16:21:39] :-) [16:21:53] same bat channel same bat time tomorrow ? [16:22:06] 1h earlier [16:22:07] !log cleared low count edac counters on hosts mw2205 dbstore1002 db1051 elastic1029 T183177 [16:22:07] 14 utc [16:22:08] for eqiad [16:22:10] cool [16:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:11] T183177: memory errors not showing in icinga - https://phabricator.wikimedia.org/T183177 [16:22:11] (03PS1) 10Dzahn: disable icinga notifications on mw22* hosts [puppet] - 10https://gerrit.wikimedia.org/r/431780 [16:22:17] ya Pchelolo 14utc still ok tomorrow for you? [16:22:32] (03CR) 10jerkins-bot: [V: 04-1] disable icinga notifications on mw22* hosts [puppet] - 10https://gerrit.wikimedia.org/r/431780 (owner: 10Dzahn) [16:22:34] (03PS2) 10Imarlier: performance website: allow traffic [puppet] - 10https://gerrit.wikimedia.org/r/431779 (https://phabricator.wikimedia.org/T159354) [16:22:49] ottomata: ye, no problem. Just 1 more 6 am wake-up [16:22:59] (03PS2) 10Dzahn: disable icinga notifications on mw22* hosts [puppet] - 10https://gerrit.wikimedia.org/r/431780 [16:23:47] ouch [16:24:20] Pchelolo: we're just taking advantage of you while you have a little bit of jet lag left [16:24:22] i hope [16:24:36] (thank you :) [16:24:37] ) [16:25:05] haha that's exactly why I agree doing that, didn't even need an alarm today [16:25:37] (03CR) 10Dzahn: [C: 032] disable icinga notifications on mw22* hosts [puppet] - 10https://gerrit.wikimedia.org/r/431780 (owner: 10Dzahn) [16:26:41] 10Operations, 10ops-codfw: wtp2020 correctable memory errors - https://phabricator.wikimedia.org/T194176#4191187 (10fgiunchedi) [16:27:56] !log mw2251,mw2252,mw2201 - reinstall with stretch [16:27:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:17] (03PS3) 10Imarlier: performance website: allow traffic [puppet] - 10https://gerrit.wikimedia.org/r/431779 (https://phabricator.wikimedia.org/T159354) [16:31:04] (03PS1) 10Alexandros Kosiaris: WIP: Provision RSA keys for ganeti root auth [puppet] - 10https://gerrit.wikimedia.org/r/431782 [16:36:07] (03CR) 10Imarlier: "bblack and dzahn -- I haven't played with our firewall config things at all, so let me know if I'm missing something." [puppet] - 10https://gerrit.wikimedia.org/r/431779 (https://phabricator.wikimedia.org/T159354) (owner: 10Imarlier) [16:36:13] 10Operations, 10ops-codfw: Degraded RAID on wasat - https://phabricator.wikimedia.org/T193394#4191265 (10Papaul) a:05Papaul>03RobH @RobH disk replacement complete [16:36:53] !log mwmaint1001 - reinstalling one more time after proxysql issues are resolved, PXE booting (T192092) [16:36:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:57] T192092: setup replacement for terbium (maintenance_server) on stretch - https://phabricator.wikimedia.org/T192092 [16:37:00] PROBLEM - Host mwmaint1001 is DOWN: PING CRITICAL - Packet loss = 100% [16:37:05] to confirm, I can go ahead with scap upgrade elukey ottomata|lunch ? [16:37:30] RECOVERY - Host mwmaint1001 is UP: PING WARNING - Packet loss = 93%, RTA = 0.20 ms [16:37:42] godog: yeah everything seems fine [16:37:54] ack, thanks! cc thcipriani [16:38:11] I'm around to test [16:39:12] jouncebot: next [16:39:12] In 0 hour(s) and 20 minute(s): Services – Graphoid / Parsoid / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180508T1700) [16:39:23] (03PS1) 10Dzahn: Revert "add mwmaint1001 to scap hosts" [puppet] - 10https://gerrit.wikimedia.org/r/431785 [16:40:05] (03CR) 10Dzahn: [C: 032] "remove from scap hosts during reinstall to avoid warnings for deployers during scap run" [puppet] - 10https://gerrit.wikimedia.org/r/431785 (owner: 10Dzahn) [16:40:28] !log re-pooling lvs2001 - T193677 [16:40:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:32] T193677: Interface errors on asw-d-codfw:xe-2/0/47 - https://phabricator.wikimedia.org/T193677 [16:40:39] (03PS2) 10Dzahn: Revert "add mwmaint1001 to scap hosts" [puppet] - 10https://gerrit.wikimedia.org/r/431785 [16:40:39] (03PS2) 10Filippo Giunchedi: Scap: bump version to 3.8.1-1 [puppet] - 10https://gerrit.wikimedia.org/r/430820 (https://phabricator.wikimedia.org/T127762) (owner: 10Thcipriani) [16:41:10] 10Operations, 10ops-codfw, 10Traffic, 10netops: Interface errors on asw-d-codfw:xe-2/0/47 - https://phabricator.wikimedia.org/T193677#4191293 (10ayounsi) 05Open>03Resolved No more errors. [16:41:22] (03CR) 10Filippo Giunchedi: [C: 032] Scap: bump version to 3.8.1-1 [puppet] - 10https://gerrit.wikimedia.org/r/430820 (https://phabricator.wikimedia.org/T127762) (owner: 10Thcipriani) [16:41:27] (03PS3) 10Filippo Giunchedi: Scap: bump version to 3.8.1-1 [puppet] - 10https://gerrit.wikimedia.org/r/430820 (https://phabricator.wikimedia.org/T127762) (owner: 10Thcipriani) [16:41:44] mutante: gah, rebase clashes :( [16:42:02] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] Scap: bump version to 3.8.1-1 [puppet] - 10https://gerrit.wikimedia.org/r/430820 (https://phabricator.wikimedia.org/T127762) (owner: 10Thcipriani) [16:42:28] godog: :/ just trying to prevent that scap'pers get warnings while i reinstall [16:42:52] 10Operations, 10Performance-Team, 10monitoring, 10Patch-For-Review: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837#4191298 (10Imarlier) [16:42:59] mutante: indeed, sounds like a good idea [16:43:23] thcipriani: upgraded on tin [16:43:35] !log upload scap 3.8.1-1 - T127762 [16:43:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:37] T127762: Update Debian Package for Scap3 - https://phabricator.wikimedia.org/T127762 [16:43:39] godog: cool, testing [16:43:44] (03PS1) 10Andrew Bogott: Horizon: added some keystone policy checks [puppet] - 10https://gerrit.wikimedia.org/r/431787 [16:45:18] !log thcipriani@tin Synchronized README: Testing Scap 3.8.1-1 (duration: 01m 02s) [16:45:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:39] (03PS3) 10Andrew Bogott: Horizon: add a few config settings for the upcoming wikimediamemberdashboard [puppet] - 10https://gerrit.wikimedia.org/r/431658 [16:45:41] (03PS2) 10Andrew Bogott: Horizon: added some keystone policy checks [puppet] - 10https://gerrit.wikimedia.org/r/431787 [16:47:09] godog: > Executing check 'Check endpoints for mwdebug1001.eqiad.wmnet' so new checks are running! sync looks like it went fine, thank you for the update! [16:47:31] (03PS4) 10Dzahn: performance website: allow traffic [puppet] - 10https://gerrit.wikimedia.org/r/431779 (https://phabricator.wikimedia.org/T159354) (owner: 10Imarlier) [16:47:32] thcipriani: np! will be rolling out fully at the next puppet run [16:48:09] 10Operations, 10ops-codfw, 10DC-Ops: rigel.frack.codfw.wmnet (fundraising codfw bastion) will not boot after a power cycle - https://phabricator.wikimedia.org/T193891#4191314 (10Jgreen) 05Open>03Resolved Papaul and I spent some more time on this, and found that "BIOS Serial Console" was set to auto, not... [16:48:22] godog: awesome, most of the changes were changes that affect deployment-tin apart from changes to git-lfs-backed repos for scap3 so that deploy tested a good chunk of stuff. [16:48:55] (03CR) 10Andrew Bogott: [C: 032] Horizon: add a few config settings for the upcoming wikimediamemberdashboard [puppet] - 10https://gerrit.wikimedia.org/r/431658 (owner: 10Andrew Bogott) [16:49:12] (03CR) 10Dzahn: [C: 032] performance website: allow traffic [puppet] - 10https://gerrit.wikimedia.org/r/431779 (https://phabricator.wikimedia.org/T159354) (owner: 10Imarlier) [16:49:29] (03PS5) 10Dzahn: performance website: allow traffic [puppet] - 10https://gerrit.wikimedia.org/r/431779 (https://phabricator.wikimedia.org/T159354) (owner: 10Imarlier) [16:50:04] (03CR) 10Muehlenhoff: [C: 031] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/431779 (https://phabricator.wikimedia.org/T159354) (owner: 10Imarlier) [16:50:12] thcipriani: excellent! [16:54:37] PROBLEM - puppet last run on labtestweb2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 57 seconds ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [16:55:38] uh oh, I'll take a look [16:57:05] should be recovering, puppet agent ran fine [16:59:35] RECOVERY - puppet last run on labtestweb2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:00:04] cscott, arlolra, subbu, halfak, and Amir1: #bothumor My software never has bugs. It just develops random features. Rise for Services – Graphoid / Parsoid / Citoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180508T1700). [17:00:28] !log Clear botpassword throttle for [[User:TaxonBot]] (T194160) [17:00:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:32] T194160: Unlock the login of bot user TaxonBot@TaxonBot to dewiki - https://phabricator.wikimedia.org/T194160 [17:01:46] (03PS1) 10Imarlier: performance website: remove from graphite hosts [puppet] - 10https://gerrit.wikimedia.org/r/431792 (https://phabricator.wikimedia.org/T159354) [17:04:41] 10Operations, 10ops-eqiad, 10Discovery, 10Discovery-Search, and 4 others: rack/setup/install wdqs10[09|10].eqiad.wmnet - https://phabricator.wikimedia.org/T194184#4191367 (10RobH) p:05Triage>03Normal [17:07:25] 10Operations, 10ops-eqiad, 10Discovery, 10Discovery-Search, and 4 others: rack/setup/install wdqs10[09|10].eqiad.wmnet - https://phabricator.wikimedia.org/T194184#4191395 (10RobH) [17:11:24] PROBLEM - Check systemd state on kafka2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:12:04] PROBLEM - Check systemd state on kafka2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:12:05] PROBLEM - Check systemd state on kafka2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:14:23] checking --^ [17:14:37] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: rack/setup/install cloudelastic100[1-4].eqiad.wmnet systems - https://phabricator.wikimedia.org/T194186#4191413 (10RobH) p:05Triage>03Normal [17:16:32] (03PS3) 10Andrew Bogott: Horizon: added some keystone policy checks [puppet] - 10https://gerrit.wikimedia.org/r/431787 [17:16:36] those are mirror maker instances [17:17:12] a systemctl reset-failed is enough, will wait for ottomata [17:18:11] ah [17:18:17] makes sense, puppet removed the mm instance systemd units, ya? [17:18:20] need reset-failed elukey? [17:19:37] yeah exactly [17:20:38] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: rack/setup/install cloudelastic100[1-4].eqiad.wmnet systems - https://phabricator.wikimedia.org/T194186#4191439 (10RobH) @bd808 or @chasemp: Before @Cmjohnson racks these, I'd like to confirm the networking requirements. These have 10Gbit net... [17:20:44] RECOVERY - mediawiki-installation DSH group on mw2161 is OK: OK [17:25:44] RECOVERY - Check systemd state on kafka2002 is OK: OK - running: The system is fully operational [17:25:44] RECOVERY - Check systemd state on kafka2001 is OK: OK - running: The system is fully operational [17:26:15] RECOVERY - Check systemd state on kafka2003 is OK: OK - running: The system is fully operational [17:26:35] 10Operations, 10DBA, 10Traffic: Framework to transfer files over the LAN - https://phabricator.wikimedia.org/T156462#4191453 (10jcrespo) The recommended cipher, which is an easier change, is chacha20 or, alternatively, AES-GCM rather than the randomly selected one on the commit. [17:27:05] RECOVERY - mediawiki-installation DSH group on mw2160 is OK: OK [17:27:37] ACKNOWLEDGEMENT - HP RAID on db2067 is CRITICAL: CRITICAL: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:11, 1I:1:12 - Failed: 1I:1:10 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T194187 [17:28:02] 10Operations, 10ops-codfw: Degraded RAID on db2067 - https://phabricator.wikimedia.org/T194187#4191458 (10ops-monitoring-bot) [17:29:30] (03CR) 10Chad: [V: 032 C: 032] Update non-core plugins to their respective stable-2.14 tips [software/gerrit/gerrit] (wmf/stable-2.14) - 10https://gerrit.wikimedia.org/r/431675 (owner: 10Chad) [17:29:44] (03PS2) 10Framawiki: Create the 'eventcoordinator' user group on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430418 (https://phabricator.wikimedia.org/T193075) [17:32:11] RECOVERY - mediawiki-installation DSH group on mw2159 is OK: OK [17:37:34] (03PS1) 10Ottomata: Stop main-codfw -> main-eqiad MirrorMaker during Kafka main upgrade [puppet] - 10https://gerrit.wikimedia.org/r/431799 (https://phabricator.wikimedia.org/T167039) [17:37:36] (03PS1) 10Ottomata: Kafka main-eqiad inter_broker_protocol_version: 1.1.0 [puppet] - 10https://gerrit.wikimedia.org/r/431800 (https://phabricator.wikimedia.org/T167039) [17:37:38] (03PS1) 10Ottomata: Kafka main-eqiad - remove api.version [puppet] - 10https://gerrit.wikimedia.org/r/431801 (https://phabricator.wikimedia.org/T167039) [17:37:40] (03PS1) 10Ottomata: Kafka main-eqiad - log.message.format.version [puppet] - 10https://gerrit.wikimedia.org/r/431802 (https://phabricator.wikimedia.org/T167039) [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180508T1800) [18:03:29] (03PS1) 10Dzahn: Revert "disable icinga notifications on mw22* hosts" [puppet] - 10https://gerrit.wikimedia.org/r/431807 [18:03:36] (03PS1) 10Dzahn: Revert "Revert "add mwmaint1001 to scap hosts"" [puppet] - 10https://gerrit.wikimedia.org/r/431808 [18:03:47] !log andrew@tin Started deploy [horizon/deploy@9245ca9]: rolling out member dashboard [18:03:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:35] !log dzahn@neodymium conftool action : set/pooled=yes; selector: name=mw2251.codfw.wmnet [18:06:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:05] !log andrew@tin Finished deploy [horizon/deploy@9245ca9]: rolling out member dashboard (duration: 03m 18s) [18:07:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:24] !log dzahn@neodymium conftool action : set/pooled=yes; selector: name=mw2252.codfw.wmnet [18:08:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:36] (03Abandoned) 10Krinkle: mtail: Update a /w/load.php test case from a current varnishncsa sample [puppet] - 10https://gerrit.wikimedia.org/r/431608 (https://phabricator.wikimedia.org/T184942) (owner: 10Krinkle) [18:15:08] (03PS1) 10Dzahn: mw-maintenance/wikidata: set $ensure for rebuildTermSqlIndex.log [puppet] - 10https://gerrit.wikimedia.org/r/431810 (https://phabricator.wikimedia.org/T192092) [18:15:36] (03CR) 10jerkins-bot: [V: 04-1] mw-maintenance/wikidata: set $ensure for rebuildTermSqlIndex.log [puppet] - 10https://gerrit.wikimedia.org/r/431810 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [18:16:27] (03PS2) 10Dzahn: mw-maintenance/wikidata: set $ensure for rebuildTermSqlIndex.log [puppet] - 10https://gerrit.wikimedia.org/r/431810 (https://phabricator.wikimedia.org/T192092) [18:17:04] (03CR) 10Dzahn: [C: 032] mw-maintenance/wikidata: set $ensure for rebuildTermSqlIndex.log [puppet] - 10https://gerrit.wikimedia.org/r/431810 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [18:18:45] !log Branching MediaWiki master to wmf/1.32.0-wmf.3 refs T191049 [18:18:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:49] T191049: 1.32.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T191049 [18:19:02] (03PS4) 10Andrew Bogott: Horizon: added some keystone policy checks [puppet] - 10https://gerrit.wikimedia.org/r/431787 (https://phabricator.wikimedia.org/T194191) [18:22:39] !log mwmaint1001 - rebooting [18:22:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:50] 10Operations, 10ops-codfw: Degraded RAID on db2067 - https://phabricator.wikimedia.org/T194187#4191616 (10Marostegui) [18:23:55] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2067 - https://phabricator.wikimedia.org/T194103#4191618 (10Marostegui) [18:24:21] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2067 - https://phabricator.wikimedia.org/T194103#4188843 (10Marostegui) The disk has failed to rebuild, can we try another one?: ``` physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, Failed) ``` Thanks! [18:27:19] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1073 - https://phabricator.wikimedia.org/T194155#4191631 (10Marostegui) 05Open>03Resolved Disk #9 finished rebuilding: ``` root@db1073:~# megacli -PDRbld -ShowProg -PhysDrv [32:9] -aALL Device(Encl-32 Slot-9) is not in rebuild process Exit Code: 0x0... [18:29:32] 10Operations, 10Discovery, 10Maps, 10Maps-Sprint, and 2 others: Decide on Cache-Control headers for map tiles - https://phabricator.wikimedia.org/T186732#4191642 (10jmatazzoni) [18:30:47] say hi if you're using terbium to run manual maintenance commands [18:31:03] i'll want to move stuff to mwmaint1001 instead and test [18:31:38] also let's see if we can maybe puppetize it if there are regular but manual commands left [18:32:02] (03PS1) 10Ottomata: Remove version requirements for kafkacat and librdkafka from kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/431815 [18:32:14] mutante:for the 2 I know about, I have told the users to puppetize them [18:32:30] but they may need some logs to control task status [18:33:07] jynus: aha, thank you [18:33:17] i should write a list mail before i switch the host over [18:33:25] you can see them referring to local logs [18:33:26] but first mentioning here [18:33:27] on puppet [18:33:50] it literally just reinstalled and the puppet class is fixed [18:33:55] (03PS2) 10Ottomata: Remove version requirements for kafkacat and librdkafka from kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/431815 (https://phabricator.wikimedia.org/T182163) [18:33:58] works without errors on stretch now [18:34:14] ok jynus, will check, *nod* [18:34:29] (03CR) 10Ottomata: [C: 032] Remove version requirements for kafkacat and librdkafka from kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/431815 (https://phabricator.wikimedia.org/T182163) (owner: 10Ottomata) [18:35:20] jynus: and thanks for the merges re: sqlproxy etc, it's all green on mwmaint1001 now :) [18:35:49] for the nutcracker part i just had to make sure it gets rebooted once [18:36:45] (03PS2) 10Dzahn: Revert "Revert "add mwmaint1001 to scap hosts"" [puppet] - 10https://gerrit.wikimedia.org/r/431808 [18:41:54] (03CR) 10Dzahn: [C: 032] Revert "Revert "add mwmaint1001 to scap hosts"" [puppet] - 10https://gerrit.wikimedia.org/r/431808 (owner: 10Dzahn) [18:43:30] mutante: see things such as https://gerrit.wikimedia.org/r/#/c/427202/5/modules/mediawiki/manifests/maintenance/wikidata.pp [18:44:15] while fully idempotent, it checks /var/log/wikidata/* logs to check progress [18:45:14] migrating that host is not trivial anyway, you may want to coordinate a lot [18:45:24] joe had lots of issues last time [18:46:54] *nod*, i see [18:47:01] i just fixed something else related to that /var/log/wikidata dir [18:47:13] which was an issue on the inactive host where it's not running [18:47:31] indeed i will want to coordinate with hoo on the wikidata part [18:48:10] actually it is Amir3-2 who deployed most of those [18:48:43] with my help/bugging him to puppetize them [18:50:48] 10Operations, 10Performance-Team, 10Patch-For-Review: Move coal from graphite#001 nodes to webperf#001 - https://phabricator.wikimedia.org/T159354#4191743 (10Imarlier) [18:50:52] oh, good to know, thanks [18:51:32] (03PS1) 10Chad: 2.14.8-22-g07c8aa9910 [software/gerrit] (stable-2.14) - 10https://gerrit.wikimedia.org/r/431818 [18:52:32] !log Manually fail disk #7 on db1073 to get it replaced [18:52:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:58] (03CR) 10Ottomata: [C: 032] icinga-downtime - fail if given FQDN [puppet] - 10https://gerrit.wikimedia.org/r/430079 (owner: 10Ottomata) [18:55:00] (03PS4) 10Ottomata: icinga-downtime - fail if given FQDN [puppet] - 10https://gerrit.wikimedia.org/r/430079 [18:55:02] (03CR) 10Ottomata: [V: 032 C: 032] icinga-downtime - fail if given FQDN [puppet] - 10https://gerrit.wikimedia.org/r/430079 (owner: 10Ottomata) [18:55:16] (03CR) 10Paladox: [C: 031] 2.14.8-22-g07c8aa9910 [software/gerrit] (stable-2.14) - 10https://gerrit.wikimedia.org/r/431818 (owner: 10Chad) [18:55:32] !log mw2202, mw2203, mw2204 - reinstall with stretch [18:55:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:46] PROBLEM - MegaRAID on db1073 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [18:58:47] ACKNOWLEDGEMENT - MegaRAID on db1073 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T194197 [18:58:53] 10Operations, 10ops-eqiad: Degraded RAID on db1073 - https://phabricator.wikimedia.org/T194197#4191790 (10ops-monitoring-bot) [19:00:04] twentyafterfour: That opportune time is upon us again. Time for a MediaWiki train deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180508T1900). [19:00:39] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1073 - https://phabricator.wikimedia.org/T194197#4191793 (10Marostegui) p:05Triage>03Normal This disk was manually failed to get it replaced and clear the SMART alert. It has already been swapped by Chris, and it is rebuilding: ``` root@db1073:~# meg... [19:08:06] RECOVERY - Device not healthy -SMART- on db1073 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1073&var-datasource=eqiad%2520prometheus%252Fops [19:11:48] !log updated mediawiki changelog https://www.mediawiki.org/wiki/MediaWiki_1.32/wmf.3/Changelog refs T191049 [19:11:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:53] T191049: 1.32.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T191049 [19:13:36] (03PS1) 1020after4: testwikis wikis to 1.32.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431821 [19:13:38] (03CR) 1020after4: [C: 032] testwikis wikis to 1.32.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431821 (owner: 1020after4) [19:14:25] !log testwikis to 1.32.0-wmf.3 - https://gerrit.wikimedia.org/r/#/c/431821/ refs T191049 [19:14:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:02] (03Merged) 10jenkins-bot: testwikis wikis to 1.32.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431821 (owner: 1020after4) [19:15:27] !log twentyafterfour@tin Started scap: testwikis wikis to 1.32.0-wmf.3 [19:15:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:06] (03CR) 10jenkins-bot: testwikis wikis to 1.32.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431821 (owner: 1020after4) [19:30:49] (03PS1) 10Bstorm: WIP: wiki replicas - prepare for refactored actor storage [puppet] - 10https://gerrit.wikimedia.org/r/431823 (https://phabricator.wikimedia.org/T188299) [19:37:26] (03PS1) 10Bstorm: wiki replicas: remove the SQL reference file for indexes since it is obsolete [puppet] - 10https://gerrit.wikimedia.org/r/431825 [19:50:13] 10Operations, 10MediaWiki-Parser, 10MediaWiki-Platform-Team, 10Parsing-Team, and 2 others: Different production servers have different versions of tidy installed, resulting in varying output - https://phabricator.wikimedia.org/T193414#4191968 (10hashar) T191771 is MediaWiki parser tests failing under CI wh... [19:54:19] (03PS1) 10Herron: logstash: add tcp tls input for syslogs [puppet] - 10https://gerrit.wikimedia.org/r/431830 (https://phabricator.wikimedia.org/T193766) [19:54:46] (03CR) 10jerkins-bot: [V: 04-1] logstash: add tcp tls input for syslogs [puppet] - 10https://gerrit.wikimedia.org/r/431830 (https://phabricator.wikimedia.org/T193766) (owner: 10Herron) [19:54:55] PROBLEM - Hadoop NodeManager on analytics1036 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [19:55:29] ! [19:55:30] ha [19:55:31] sorry [19:55:33] my downtime expired [19:55:45] PROBLEM - Hadoop NodeManager on analytics1034 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [19:55:50] fixing [19:55:54] (03PS2) 10Herron: logstash: add tcp tls input for syslogs [puppet] - 10https://gerrit.wikimedia.org/r/431830 (https://phabricator.wikimedia.org/T193766) [19:59:32] RECOVERY - MegaRAID on db1073 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [20:00:45] (03CR) 10Herron: [C: 04-2] "need to test if the existing filters applied to type syslog will behave the same with tcp input" [puppet] - 10https://gerrit.wikimedia.org/r/431830 (https://phabricator.wikimedia.org/T193766) (owner: 10Herron) [20:01:48] !log mw2205,mw2206,mw2207 - reinstalling with stretch - mw2202 - wmf-auto-reimage failed: Timeout of 60 minutes reached waiting for reboot [20:01:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:31] !log dzahn@neodymium conftool action : set/pooled=yes; selector: name=mw2201.codfw.wmnet [20:02:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:16] (03PS1) 10Herron: ELK: change elasticsearch index prefix to logstash-syslog for syslog type [puppet] - 10https://gerrit.wikimedia.org/r/431860 (https://phabricator.wikimedia.org/T193766) [20:17:43] !log milimetric@tin Started deploy [analytics/refinery@2a4633c]: Deploying renamed geowiki jobs as geoeditors [20:17:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:48] (03PS5) 10Andrew Bogott: Horizon: added some keystone policy checks [puppet] - 10https://gerrit.wikimedia.org/r/431787 (https://phabricator.wikimedia.org/T194191) [20:24:51] !log milimetric@tin Finished deploy [analytics/refinery@2a4633c]: Deploying renamed geowiki jobs as geoeditors (duration: 07m 07s) [20:24:52] PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation=compareAndSwap https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:24:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:15] (03PS3) 10Urbanecm: Create the 'eventcoordinator' user group on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430418 (https://phabricator.wikimedia.org/T193075) (owner: 10Framawiki) [20:25:31] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=PATCH https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:26:33] (03CR) 10Andrew Bogott: [C: 032] Horizon: added some keystone policy checks [puppet] - 10https://gerrit.wikimedia.org/r/431787 (https://phabricator.wikimedia.org/T194191) (owner: 10Andrew Bogott) [20:29:07] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:33:57] RECOVERY - MegaRAID on analytics1032 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy [20:38:27] RECOVERY - Hadoop NodeManager on analytics1036 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [20:38:48] RECOVERY - Hadoop NodeManager on analytics1034 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [20:39:17] RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:53:13] (03PS1) 10MaxSem: Deploy CongressLookup to betalabs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431989 [20:55:31] mutante: mw2204.codfw.wmnet returned [255]: Permission denied [20:55:47] same for mw2203 [20:56:15] MaxSem: woah can we slow down a bit? [20:56:54] twentyafterfour: being reinstalled [20:57:42] mutante: ^ did they not get depooled? [20:58:11] argg. that is a failure of the reinstall script that normally does that automatically [20:58:40] :( [20:58:59] !log dzahn@neodymium conftool action : set/pooled=inactive; selector: name=mw2203.codfw.wmnet [20:59:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:09] !log dzahn@neodymium conftool action : set/pooled=inactive; selector: name=mw2204.codfw.wmnet [20:59:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:55] twentyafterfour: ^ depooled, normally happens automatically, sorry [21:04:21] PROBLEM - MegaRAID on analytics1032 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough [21:07:50] (03CR) 10Anomie: WIP: wiki replicas - prepare for refactored actor storage (0317 comments) [puppet] - 10https://gerrit.wikimedia.org/r/431823 (https://phabricator.wikimedia.org/T188299) (owner: 10Bstorm) [21:09:49] !log twentyafterfour@tin Finished scap: testwikis wikis to 1.32.0-wmf.3 (duration: 114m 21s) [21:09:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:21] legoktm: ? [21:10:39] re: CongressLookup. discussing in -dev [21:12:45] (03PS1) 10Ayounsi: Revert "Smokeping, remove Rigel" [puppet] - 10https://gerrit.wikimedia.org/r/431991 [21:12:52] (03PS2) 10Ayounsi: Revert "Smokeping, remove Rigel" [puppet] - 10https://gerrit.wikimedia.org/r/431991 [21:13:20] legoktm: the window is in several hours, don't worry:) [21:14:59] (03CR) 10Ayounsi: [C: 032] Revert "Smokeping, remove Rigel" [puppet] - 10https://gerrit.wikimedia.org/r/431991 (owner: 10Ayounsi) [21:29:28] (03PS1) 10Smalyshev: Add string and external-id types to indexing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431994 (https://phabricator.wikimedia.org/T163642) [21:30:44] (03CR) 10jerkins-bot: [V: 04-1] Add string and external-id types to indexing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431994 (https://phabricator.wikimedia.org/T163642) (owner: 10Smalyshev) [21:38:51] 10Operations, 10Cloud-Services, 10netops: Allocate public v4 IPs for Neutron setup in eqiad - https://phabricator.wikimedia.org/T193496#4192431 (10ayounsi) @chasemp Can you provide an ETA for returning the /25? [21:39:10] (03PS2) 10Smalyshev: Add string and external-id types to Wikibase indexing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431994 (https://phabricator.wikimedia.org/T163642) [21:40:19] (03CR) 10jerkins-bot: [V: 04-1] Add string and external-id types to Wikibase indexing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431994 (https://phabricator.wikimedia.org/T163642) (owner: 10Smalyshev) [21:41:13] (03CR) 10Krinkle: [C: 031] performance website: remove from graphite hosts [puppet] - 10https://gerrit.wikimedia.org/r/431792 (https://phabricator.wikimedia.org/T159354) (owner: 10Imarlier) [21:41:37] (03PS1) 1020after4: group0 wikis to 1.32.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431998 [21:41:39] (03CR) 1020after4: [C: 032] group0 wikis to 1.32.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431998 (owner: 1020after4) [21:42:00] (03PS2) 10Dzahn: performance website: remove from graphite hosts [puppet] - 10https://gerrit.wikimedia.org/r/431792 (https://phabricator.wikimedia.org/T159354) (owner: 10Imarlier) [21:42:49] marlier: did you want to also manually delete the site from graphite servers etc? [21:42:54] (03Merged) 10jenkins-bot: group0 wikis to 1.32.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431998 (owner: 1020after4) [21:43:03] (03CR) 10Dzahn: [C: 032] performance website: remove from graphite hosts [puppet] - 10https://gerrit.wikimedia.org/r/431792 (https://phabricator.wikimedia.org/T159354) (owner: 10Imarlier) [21:45:20] !log twentyafterfour@tin rebuilt and synchronized wikiversions files: group0 wikis to 1.32.0-wmf.3 [21:45:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:36] (03CR) 10Bstorm: WIP: wiki replicas - prepare for refactored actor storage (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/431823 (https://phabricator.wikimedia.org/T188299) (owner: 10Bstorm) [21:59:06] !log MediaWiki train for 1.32.0-wmf.3 group0 is complete. Will resume with group1 tomorrow, same bat time, same bat channel (refs T191049) [21:59:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:10] T191049: 1.32.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T191049 [21:59:29] (03CR) 10jenkins-bot: group0 wikis to 1.32.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431998 (owner: 1020after4) [22:04:08] 10Operations, 10Fundraising-Backlog, 10Traffic, 10fundraising-tech-ops: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561#4192548 (10cwdent) The problems I see are: - content served over http - weak DH supported (https://weakdh.org/) resulting in "B" grade from Qualys I d... [22:05:15] !log progressively push updated BGP_sanitize_in bogon ASN filters to routers - T190317 [22:05:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:19] T190317: Update BGP_sanitize_in filter - https://phabricator.wikimedia.org/T190317 [22:09:51] (03PS3) 10Smalyshev: Add string and external-id types to Wikibase indexing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431994 (https://phabricator.wikimedia.org/T163642) [22:19:16] (03PS1) 10Chad: Adding deploy_artifacts.py wrapper to easily push stuff with mvn deploy:deploy-file [software/gerrit] (stable-2.14) - 10https://gerrit.wikimedia.org/r/432007 [22:33:51] (03PS1) 10Krinkle: Remove unused vendor/autoload.php from missing.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/432010 [22:33:53] (03PS1) 10Krinkle: multiversion: Remove unused vendor/autoload from getMWVersion. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/432011 [22:33:55] (03PS1) 10Krinkle: multiversion: Move vendor/autoload from MWMultiVersion to profiler.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/432012 [22:33:57] (03PS1) 10Krinkle: Move multiversion/vendor/ to vendor/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/432013 [22:34:08] no_justification: :) [22:35:45] !log remove PREFERRED-TRANSIT Tele2-DTAG from esams/knams routers [22:35:47] RECOVERY - MegaRAID on analytics1032 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy [22:35:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:59] Wheeee [22:43:13] (03PS2) 10Chad: Adding deploy_artifacts.py wrapper to easily push stuff with mvn deploy:deploy-file [software/gerrit] (stable-2.14) - 10https://gerrit.wikimedia.org/r/432007 [22:47:24] !log awight@tin Started deploy [ores/deploy@5b27205]: Deploy LFS files to ores1002 [22:47:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:21] !log awight@tin Finished deploy [ores/deploy@5b27205]: Deploy LFS files to ores1002 (duration: 01m 59s) [22:49:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:52:56] !log awight@tin Started deploy [ores/deploy@bf182e2]: Rollback ores1002 to master [22:52:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:14] !log awight@tin Finished deploy [ores/deploy@bf182e2]: Rollback ores1002 to master (duration: 00m 19s) [22:53:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180508T2300). [23:00:04] Lucas_WMDE and Smalyshev: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:13] o/ [23:00:51] here [23:03:55] I can start with a disclaimer that my backport isn’t directly testable, I don’t have a reliable way to trigger the exception it fixes [23:04:14] it also has a CI failure, unfortunately – I have to hope that the SWATter will agree the failure is unrelated :) [23:04:53] oh boy [23:04:55] I can SWAT [23:05:28] great [23:06:08] Lucas_WMDE: I guess let's backport https://gerrit.wikimedia.org/r/#/c/430577/ and then rebase your patch. [23:06:24] okay [23:06:38] (I don’t think a rebase will be necessary, since it’s in a different extension?) [23:06:55] right :) [23:07:00] ok :) [23:07:20] also, it looks like I found a semi-reliable way to test my change after all [23:08:55] nice [23:09:32] while I wait for jenkins to do its thing I'll get the config change done. [23:09:45] ok [23:09:46] (03PS4) 10Thcipriani: Add string and external-id types to Wikibase indexing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431994 (https://phabricator.wikimedia.org/T163642) (owner: 10Smalyshev) [23:09:55] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431994 (https://phabricator.wikimedia.org/T163642) (owner: 10Smalyshev) [23:11:19] (03Merged) 10jenkins-bot: Add string and external-id types to Wikibase indexing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431994 (https://phabricator.wikimedia.org/T163642) (owner: 10Smalyshev) [23:11:35] (03CR) 10jenkins-bot: Add string and external-id types to Wikibase indexing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431994 (https://phabricator.wikimedia.org/T163642) (owner: 10Smalyshev) [23:12:49] !log lowering ospf metric of ulsfo-codfw to 390 [23:12:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:02] SMalyshev: your change is live on mwdebug1002, check please [23:13:48] thcipriani: checking [23:16:49] thcipriani: hmm actually I am not sure I can check it on mwdebug since it depends on jobs... and jobs run on different hosts, right? [23:17:28] I may be able to check it on tin though [23:17:42] ah, cool, yeah it should be live there as well [23:17:44] erh I mean terbium [23:17:56] * thcipriani pulls to terbium [23:18:22] SMalyshev: live on terbium [23:18:31] (03PS3) 10Chad: Adding deploy_artifacts.py wrapper to easily push stuff with mvn deploy:deploy-file [software/gerrit] (stable-2.14) - 10https://gerrit.wikimedia.org/r/432007 [23:19:40] thcipriani: aha, great. Seems to be working just fine [23:20:00] SMalyshev: cool, going live [23:21:34] thanks! [23:22:45] !log thcipriani@tin Synchronized wmf-config: SWAT: [[gerrit:431994|Add string and external-id types to Wikibase indexing]] T163642 T99899 (duration: 01m 26s) [23:22:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:50] T99899: [Story] Looking up entities by external identifiers - https://phabricator.wikimedia.org/T99899 [23:22:50] T163642: Index Wikidata strings in statements in the search engine - https://phabricator.wikimedia.org/T163642 [23:22:51] ^ SMalyshev live now [23:25:24] (03PS1) 10Krinkle: Use perftools/xhgui-collector instead of perftools/xhgui [mediawiki-config] - 10https://gerrit.wikimedia.org/r/432016 [23:26:02] no_justification: -104097, +712 :) [23:26:15] OMG I LOVE YOU [23:26:49] :o [23:26:57] PROBLEM - MegaRAID on analytics1032 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough [23:27:09] (03CR) 10Krinkle: "I've done a plain git-mv in this commit, but that might not actually work. There's a couple of references in vendor/composer that try to f" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/432013 (owner: 10Krinkle) [23:27:20] (03CR) 10Krinkle: [C: 04-1] Move multiversion/vendor/ to vendor/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/432013 (owner: 10Krinkle) [23:27:40] (03CR) 10Krinkle: [C: 04-1] "Untested. Will run tests later this/next week on HHVM and PHP7" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/432016 (owner: 10Krinkle) [23:28:41] (03CR) 10Krinkle: "TODO: Move within XWD conditional. That would mean one less autoloader on all MediaWIki php requests." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/432012 (owner: 10Krinkle) [23:31:16] (03PS1) 10Chad: Minimal pom.xml so output from mvn looks sane [software/gerrit] (stable-2.14) - 10https://gerrit.wikimedia.org/r/432017 [23:32:36] (03CR) 10Jdlrobson: [C: 031] Remove unused PopupsAnonsExperimentalGroupSize config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431759 (https://phabricator.wikimedia.org/T173952) (owner: 10Pmiazga) [23:42:04] !log progressively push BGP_sanitize_in as-path too-many-hops to routers - T190317 [23:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:08] T190317: Update BGP_sanitize_in filter - https://phabricator.wikimedia.org/T190317 [23:42:30] jouncebot: now [23:42:30] For the next 0 hour(s) and 17 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180508T2300) [23:49:31] (03CR) 10Chad: [V: 032 C: 032] 2.14.8-22-g07c8aa9910 [software/gerrit] (stable-2.14) - 10https://gerrit.wikimedia.org/r/431818 (owner: 10Chad) [23:50:00] (03CR) 10Chad: [V: 032 C: 032] Adding deploy_artifacts.py wrapper to easily push stuff with mvn deploy:deploy-file [software/gerrit] (stable-2.14) - 10https://gerrit.wikimedia.org/r/432007 (owner: 10Chad) [23:50:27] Lucas_WMDE: sorry for the delay, your change is on mwdebug1002, check please [23:50:33] will do, no problem [23:51:20] (my test is pretty simple, open https://www.wikidata.org/wiki/Special:ConstraintReport/Q23 several times and check that I never get a BadMethodCallException) [23:51:32] :) [23:54:27] no errors in about ten requests, that’s good enough for me [23:54:34] thcipriani feel free to proceed :) [23:54:37] * thcipriani does [23:58:25] !log thcipriani@tin Synchronized php-1.32.0-wmf.2/extensions/WikibaseQualityConstraints/src/ConstraintCheck/Helper/LoggingHelper.php: SWAT: [[gerrit:431805|Do not try to access null message message key]] T194140 (duration: 01m 32s) [23:58:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:58:29] T194140: Fatal exception of type "BadMethodCallException" on Special:ConstraintReport - https://phabricator.wikimedia.org/T194140 [23:58:51] ^ Lucas_WMDE live everywhere [23:58:59] great, thanks! [23:59:09] I’ll check logstash tomorrow to see if the errors stopped [23:59:17] cool, thanks :)