[00:00:04] Deploy window No deploys - US Holiday (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190527T0000) [00:01:05] RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational [00:07:17] PROBLEM - Check systemd state on cloudcontrol1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:15:07] PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:20:43] RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational [00:26:15] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational [00:26:19] PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:30:33] RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational [00:36:13] PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:38:57] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:40:27] RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational [00:44:37] PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:47:01] PROBLEM - Check systemd state on ms-be2033 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:50:13] RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational [01:09:47] 10Operations, 10Discovery-Search, 10Operations-Software-Development, 10User-Joe, 10User-jijiki: Create WDQS reboot cookbook - https://phabricator.wikimedia.org/T224385 (10Mathew.onipe) [01:09:58] 10Operations, 10Discovery-Search, 10Operations-Software-Development, 10User-Joe, 10User-jijiki: Create WDQS reboot cookbook - https://phabricator.wikimedia.org/T224385 (10Mathew.onipe) p:05Triage→03Normal [01:12:59] (03PS3) 10Mathew.onipe: cloudelastic: remove ocsp_proxy [puppet] - 10https://gerrit.wikimedia.org/r/511381 (https://phabricator.wikimedia.org/T223519) [01:16:57] PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [01:21:09] RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational [01:25:17] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational [01:25:23] PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [01:31:03] RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational [01:32:21] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [01:35:17] PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [01:39:11] RECOVERY - Check systemd state on ms-be2033 is OK: OK - running: The system is fully operational [01:40:51] RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational [01:45:03] PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [01:50:41] RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational [01:54:51] PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:00:31] RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational [02:07:33] PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:10:21] RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational [02:14:33] PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:20:11] RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational [02:25:41] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational [02:34:09] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:46:57] PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:51:07] RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational [02:56:51] PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [02:59:23] 10Operations, 10Cassandra, 10RESTBase, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 2 others: Decommission restbase10(0[7-9]|1[0-5]) - https://phabricator.wikimedia.org/T223976 (10Eevans) [02:59:56] !log decommissioning restbase1013-a -- T223976 [03:00:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:00:03] T223976: Decommission restbase10(0[7-9]|1[0-5]) - https://phabricator.wikimedia.org/T223976 [03:17:45] PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:20:35] RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational [03:22:03] PROBLEM - Disk space on maps2004 is CRITICAL: DISK CRITICAL - free space: /srv 53927 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [03:26:03] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational [03:27:37] PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:29:09] PROBLEM - Disk space on maps2004 is CRITICAL: DISK CRITICAL - free space: /srv 54697 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [03:29:11] RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [03:30:25] RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational [03:33:07] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:34:39] PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:40:17] RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational [03:44:31] PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:50:09] RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational [03:54:19] PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:17:49] PROBLEM - Host db2091 is DOWN: PING CRITICAL - Packet loss = 100% [04:20:59] RECOVERY - Host db2091 is UP: PING OK - Packet loss = 0%, RTA = 36.23 ms [04:21:03] RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational [04:23:10] !log gilles@deploy1001 Started deploy [performance/asoranking@61039f1]: (no justification provided) [04:23:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:23:39] !log gilles@deploy1001 Finished deploy [performance/asoranking@61039f1]: (no justification provided) (duration: 00m 28s) [04:23:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:23:51] PROBLEM - MariaDB Slave IO: s4 on db2091 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [04:24:09] PROBLEM - MariaDB Slave IO: s2 on db2091 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [04:24:15] PROBLEM - MariaDB Slave SQL: s2 on db2091 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [04:24:19] PROBLEM - MariaDB Slave SQL: s4 on db2091 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [04:24:23] PROBLEM - MariaDB read only s2 on db2091 is CRITICAL: Could not connect to localhost:3312 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [04:24:23] PROBLEM - MariaDB read only s4 on db2091 is CRITICAL: Could not connect to localhost:3314 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [04:25:11] PROBLEM - mysqld processes on db2091 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [04:25:17] PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:25:52] well that ain't good [04:30:53] RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational [04:31:58] just leave it (mysql), it's codfw, it's a slave, our dbas will see it when they come on line [04:32:07] Ah oke doke [04:32:11] thanks [04:32:13] sure [04:32:42] marostegui: ^^ (for when you arrive) [04:34:09] PROBLEM - MariaDB Slave Lag: s2 on db2091 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [04:34:15] PROBLEM - MariaDB Slave Lag: s4 on db2091 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [04:35:07] PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:40:45] RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational [04:44:57] PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:49:05] PROBLEM - puppet last run on kraz is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [04:50:39] RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational [04:54:28] <_joe_> ok cloudcontrol will drive me insane [04:54:53] PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [05:00:31] RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational [05:04:43] PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [05:05:10] 10Operations, 10Analytics, 10Performance-Team, 10Traffic: Only serve debug HTTP headers when x-wikimedia-debug is present - https://phabricator.wikimedia.org/T210484 (10Gilles) p:05Low→03Lowest [05:10:21] RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational [05:14:35] PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [05:16:07] RECOVERY - puppet last run on kraz is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [05:20:15] RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational [05:24:27] PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [05:30:07] RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational [05:34:19] PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [05:51:17] RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational [06:04:03] PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:08:46] 10Operations, 10MediaWiki-Cache, 10serviceops, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 5 others: Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10Joe) >>! In T212129#5199087, @Krinkle wrote: > Regard... [06:11:05] RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational [06:15:17] PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:20:53] RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational [06:25:05] PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:28:59] PROBLEM - puppet last run on dbmonitor2001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [06:29:57] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [06:30:41] RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational [06:31:13] PROBLEM - puppet last run on etcd1003 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [06:32:01] PROBLEM - puppet last run on acmechief1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/puppet-enabled] [06:32:39] 10Operations, 10Continuous-Integration-Config: Fix operations/puppet.git "rebase hell" - https://phabricator.wikimedia.org/T224033 (10Joe) I am 100% against having ci handle merges of ops/puppet. Think of the case ci is down and we need puppet for anything. I am also all for moving to rebase-if-necessary and... [06:34:53] PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:35:35] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [06:38:11] 04Critical Alert for device cr1-codfw.wikimedia.org - Juniper alarm active [06:38:49] I am really having a hard time keeping track of the real alerts [06:40:31] RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational [06:42:48] <_joe_> me too [06:47:33] PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:50:21] RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational [06:50:28] 10Operations, 10DBA: db2091 mysql service stopped running - https://phabricator.wikimedia.org/T224393 (10jijiki) [06:54:31] PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:56:23] 10Operations, 10DBA: db2091 mysql service stopped running - https://phabricator.wikimedia.org/T224393 (10jijiki) p:05Triage→03High [06:58:11] RECOVERY - puppet last run on etcd1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:58:40] 10Operations, 10Maps: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395 (10Mathew.onipe) p:05Triage→03Unbreak! [06:58:57] RECOVERY - puppet last run on acmechief1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [06:59:01] 10Operations, 10MediaWiki-Cache, 10serviceops, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 5 others: Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10daniel) >>! In T212129#5211137, @EvanProdromou wrote:... [07:00:07] RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational [07:00:25] PROBLEM - puppet last run on oresrdb1001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [07:01:23] RECOVERY - puppet last run on dbmonitor2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:05:13] 10Operations, 10Maps: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395 (10Gehel) Previous instance of a similar problem: T194966 Note that we've reimaged the servers since then, and we might have lost some configuration in the process. [07:05:31] !log running nodetool repair on maps2004 -T224395 [07:05:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:38] T224395: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395 [07:11:42] 10Operations, 10DBA: db2091 mysql service stopped running - https://phabricator.wikimedia.org/T224393 (10Marostegui) p:05High→03Normal Decreasing to normal, codfw isn't in use at the moment. Thanks for creating the task and working out the depool patch! [07:22:17] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:22:33] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:27:23] RECOVERY - puppet last run on oresrdb1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:31:15] PROBLEM - cassandra CQL 10.192.48.57:9042 on maps2004 is CRITICAL: connect to address 10.192.48.57 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [07:35:13] PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:35:29] (03PS1) 10Effie Mouzeli: db-codfw.php: Depool 2109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512634 (https://phabricator.wikimedia.org/T224393) [07:37:42] (03CR) 10jerkins-bot: [V: 04-1] db-codfw.php: Depool 2109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512634 (https://phabricator.wikimedia.org/T224393) (owner: 10Effie Mouzeli) [07:37:49] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:38:05] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:40:55] RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational [07:41:40] (03CR) 10KartikMistry: "Ping!" [puppet] - 10https://gerrit.wikimedia.org/r/506043 (https://phabricator.wikimedia.org/T219819) (owner: 10Santhosh) [07:43:46] (03PS2) 10Effie Mouzeli: db-codfw.php: Depool 2109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512634 (https://phabricator.wikimedia.org/T224393) [07:44:44] (03CR) 10jerkins-bot: [V: 04-1] db-codfw.php: Depool 2109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512634 (https://phabricator.wikimedia.org/T224393) (owner: 10Effie Mouzeli) [07:45:09] PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:48:25] 10Operations, 10Traffic: ATS: log mode cannot depend on log filters being configured - https://phabricator.wikimedia.org/T224397 (10Vgutierrez) [07:49:28] 10Operations, 10Traffic: ATS: log mode cannot depend on log filters being configured - https://phabricator.wikimedia.org/T224397 (10Vgutierrez) p:05Triage→03Normal [07:50:47] RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational [07:51:33] (03PS3) 10Effie Mouzeli: db-codfw.php: Depool 2019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512634 (https://phabricator.wikimedia.org/T224393) [07:52:37] (03PS4) 10Effie Mouzeli: db-codfw.php: Depool 2019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512634 (https://phabricator.wikimedia.org/T224393) [07:53:54] (03PS1) 10Vgutierrez: ATS: Set log mode independently of log filters [puppet] - 10https://gerrit.wikimedia.org/r/512636 (https://phabricator.wikimedia.org/T224397) [07:54:23] (03CR) 10jerkins-bot: [V: 04-1] db-codfw.php: Depool 2019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512634 (https://phabricator.wikimedia.org/T224393) (owner: 10Effie Mouzeli) [07:54:33] PROBLEM - cassandra service on maps2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:54:57] PROBLEM - Check systemd state on maps2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:54:59] PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:56:43] (03PS5) 10Effie Mouzeli: db-codfw.php: Depool 2019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512634 (https://phabricator.wikimedia.org/T224393) [07:57:51] ACKNOWLEDGEMENT - MariaDB Slave IO: s2 on db2091 is CRITICAL: CRITICAL slave_io_state could not connect Effie Mouzeli Server restarted - T224393 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [07:57:51] ACKNOWLEDGEMENT - MariaDB Slave IO: s4 on db2091 is CRITICAL: CRITICAL slave_io_state could not connect Effie Mouzeli Server restarted - T224393 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [07:57:51] ACKNOWLEDGEMENT - MariaDB Slave Lag: s2 on db2091 is CRITICAL: CRITICAL slave_sql_lag could not connect Effie Mouzeli Server restarted - T224393 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [07:57:51] ACKNOWLEDGEMENT - MariaDB Slave Lag: s4 on db2091 is CRITICAL: CRITICAL slave_sql_lag could not connect Effie Mouzeli Server restarted - T224393 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [07:57:51] ACKNOWLEDGEMENT - MariaDB Slave SQL: s2 on db2091 is CRITICAL: CRITICAL slave_sql_state could not connect Effie Mouzeli Server restarted - T224393 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [07:57:52] ACKNOWLEDGEMENT - MariaDB Slave SQL: s4 on db2091 is CRITICAL: CRITICAL slave_sql_state could not connect Effie Mouzeli Server restarted - T224393 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [07:57:52] ACKNOWLEDGEMENT - MariaDB read only s2 on db2091 is CRITICAL: Could not connect to localhost:3312 Effie Mouzeli Server restarted - T224393 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [07:57:53] ACKNOWLEDGEMENT - MariaDB read only s4 on db2091 is CRITICAL: Could not connect to localhost:3314 Effie Mouzeli Server restarted - T224393 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [07:57:53] ACKNOWLEDGEMENT - mysqld processes on db2091 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld Effie Mouzeli Server restarted - T224393 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [08:00:24] (03CR) 10Volans: db-codfw.php: Depool 2019 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512634 (https://phabricator.wikimedia.org/T224393) (owner: 10Effie Mouzeli) [08:00:37] RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational [08:03:01] !log depool maps2004 - T224395 [08:03:05] (03PS6) 10Effie Mouzeli: db-codfw.php: Depool 2091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512634 (https://phabricator.wikimedia.org/T224393) [08:03:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:06] T224395: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395 [08:04:49] RECOVERY - Check systemd state on maps2004 is OK: OK - running: The system is fully operational [08:04:49] PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:06:36] 10Operations: exim paniclog on $HOST has non-zero size - https://phabricator.wikimedia.org/T224399 (10Volans) [08:07:50] 10Operations: exim paniclog on $HOST has non-zero size - https://phabricator.wikimedia.org/T224399 (10Volans) p:05Triage→03Normal [08:07:52] (03CR) 10Vgutierrez: "basically a NOOP on existent servers, but it's going to trigger a reload of the configuration on existent ATS instances: https://puppet-co" [puppet] - 10https://gerrit.wikimedia.org/r/512636 (https://phabricator.wikimedia.org/T224397) (owner: 10Vgutierrez) [08:08:27] (03PS7) 10Effie Mouzeli: db-codfw.php: Depool db2091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512634 (https://phabricator.wikimedia.org/T224393) [08:09:41] (03CR) 10Volans: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512634 (https://phabricator.wikimedia.org/T224393) (owner: 10Effie Mouzeli) [08:10:31] RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational [08:11:53] 10Operations, 10DBA, 10Patch-For-Review: db2091 rebooted unexpectedly - https://phabricator.wikimedia.org/T224393 (10Volans) [08:14:43] PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:15:48] (03CR) 10Effie Mouzeli: [C: 03+2] db-codfw.php: Depool db2091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512634 (https://phabricator.wikimedia.org/T224393) (owner: 10Effie Mouzeli) [08:17:05] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512634 (https://phabricator.wikimedia.org/T224393) (owner: 10Effie Mouzeli) [08:17:19] (03CR) 10jenkins-bot: db-codfw.php: Depool db2091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512634 (https://phabricator.wikimedia.org/T224393) (owner: 10Effie Mouzeli) [08:19:38] 10Operations, 10decommission, 10media-storage, 10User-fgiunchedi: Decom ms-be101[345] - https://phabricator.wikimedia.org/T220590 (10Volans) @fgiunchedi FYI we got some email to `root@` from `ms-be1014` with the following: ` Cron test -x /usr/sbin/anacron || ( cd / && run-parts --report /e... [08:20:21] RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational [08:20:54] arturo: you around? cloudcontrol1003 is flapping the systemd alert since 2019-05-25 21:46 [08:22:34] (03PS1) 10Gehel: maps: all maps servers use RAID10 [puppet] - 10https://gerrit.wikimedia.org/r/512639 (https://phabricator.wikimedia.org/T224395) [08:24:33] PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:25:12] 10Operations, 10Maps, 10Patch-For-Review: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395 (10Gehel) For whatever reason, only maps1004 was reimaged to RAID10 (instead of RAID1) when adding new disks (so we have 2 unused disks in each server). Note that since we have disks... [08:27:04] !log jiji@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2091 - T224393 (duration: 00m 49s) [08:27:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:10] T224393: db2091 rebooted unexpectedly - https://phabricator.wikimedia.org/T224393 [08:29:30] (03PS59) 10Vgutierrez: ATS: Provide a TLS terminator profile [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) [08:30:09] RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational [08:32:41] (03CR) 10Mathew.onipe: [C: 03+1] maps: all maps servers use RAID10 [puppet] - 10https://gerrit.wikimedia.org/r/512639 (https://phabricator.wikimedia.org/T224395) (owner: 10Gehel) [08:34:21] PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:40:35] PROBLEM - HHVM rendering on mw2239 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:41:51] RECOVERY - HHVM rendering on mw2239 is OK: HTTP OK: HTTP/1.1 200 OK - 79975 bytes in 0.402 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:45:19] 10Operations, 10SRE-Access-Requests, 10observability: Requesting access to icinga for tonycepo - https://phabricator.wikimedia.org/T224313 (10Aklapper) 05Open→03Stalled [08:51:06] yes I know volans [08:51:54] ack [08:52:45] !log 1 day downtime systemd check for cloudcontrol1003 [08:52:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:26] 10Operations, 10ops-codfw, 10media-storage, 10observability, 10User-fgiunchedi: ms-be2043 'sdd' throwing lots of errors - https://phabricator.wikimedia.org/T222654 (10fgiunchedi) Thanks for taking a look! >>! In T222654#5207032, @faidon wrote: > I'm not at all sure, but I don't see an LD 5 at all. Is it... [08:57:12] (03CR) 10Gehel: [C: 03+2] maps: all maps servers use RAID10 [puppet] - 10https://gerrit.wikimedia.org/r/512639 (https://phabricator.wikimedia.org/T224395) (owner: 10Gehel) [08:57:30] 10Operations, 10Release-Engineering-Team, 10SRE-Access-Requests: Request access to analytics cluster for Alaa Sarhan - https://phabricator.wikimedia.org/T223697 (10alaa_wmde) Hi @Nuria yes access to both would be ideal .. thank you [09:00:55] RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational [09:04:55] RECOVERY - cassandra service on maps2004 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:10:42] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/510985 (https://phabricator.wikimedia.org/T223496) (owner: 10Dzahn) [09:11:50] 10Operations, 10Release-Engineering-Team, 10SRE-Access-Requests: Request access to deployment cluster for Alaa Sarhan - https://phabricator.wikimedia.org/T223698 (10alaa_wmde) Thanks @Volans I've fully and carefully read the Server Access Responsibilities document, and signed it. I believe we are awaiting on... [09:15:19] (03PS1) 10Vgutierrez: ATS: Ensure proper permissions for ATS layouts [puppet] - 10https://gerrit.wikimedia.org/r/512643 (https://phabricator.wikimedia.org/T221217) [09:16:06] !log remove maps2004 from maps cassandra cluster - T224395 [09:16:11] T224395: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395 [09:16:12] 10Operations, 10Icinga, 10Core Platform Team Backlog (Watching / External), 10Services (watching): Incorrect icinga settings for mobrovac - https://phabricator.wikimedia.org/T224406 (10mobrovac) [09:17:05] 10Operations, 10DBA, 10Patch-For-Review: db2091 rebooted unexpectedly - https://phabricator.wikimedia.org/T224393 (10jijiki) `-------------------------------------------------------------------------------- SeqNumber = 138 Message ID = LOG007 Category = Audit AgentID = DE Severity... [09:18:08] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/510985 (https://phabricator.wikimedia.org/T223496) (owner: 10Dzahn) [09:18:19] 10Operations, 10Maps, 10Patch-For-Review: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on cumin2001.codfw.wmnet for hosts: ` ['maps2004.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimag... [09:18:31] 10Operations, 10Wikimedia-Site-requests: Add more bad words to fancycaptcha/badwords - https://phabricator.wikimedia.org/T224343 (10Volans) p:05Triage→03Normal [09:27:31] (03CR) 10Vgutierrez: [C: 03+1] "NOOP on existing instances: https://puppet-compiler.wmflabs.org/compiler1001/16762/. In labs this is enough to start the trafficserver-tls" [puppet] - 10https://gerrit.wikimedia.org/r/512643 (https://phabricator.wikimedia.org/T221217) (owner: 10Vgutierrez) [09:28:03] PROBLEM - tileratorui on maps2001 is CRITICAL: connect to address 10.192.0.144 and port 6535: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui [09:28:13] PROBLEM - tilerator on maps2001 is CRITICAL: connect to address 10.192.0.144 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [09:28:25] PROBLEM - tileratorui on maps2002 is CRITICAL: connect to address 10.192.16.179 and port 6535: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui [09:28:27] looking [09:28:27] PROBLEM - tileratorui on maps2003 is CRITICAL: connect to address 10.192.32.146 and port 6535: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui [09:28:45] PROBLEM - tilerator on maps2003 is CRITICAL: connect to address 10.192.32.146 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [09:28:47] probably related to T224395 [09:28:48] T224395: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395 [09:29:25] PROBLEM - tilerator on maps2002 is CRITICAL: connect to address 10.192.16.179 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [09:29:49] RECOVERY - tileratorui on maps2002 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 0.138 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui [09:30:31] (03PS1) 10Arturo Borrero Gonzalez: openstack: mitaka: stretch: don't install python3-msgpack from jessie-backports [puppet] - 10https://gerrit.wikimedia.org/r/512646 (https://phabricator.wikimedia.org/T224345) [09:30:53] 10Operations, 10DBA, 10Patch-For-Review: db2091 rebooted unexpectedly - https://phabricator.wikimedia.org/T224393 (10Volans) Related documentation for the most useful messages: - PWR2262 https://www.dell.com/support/manuals/it/it/itbsdt1/dell-opnmang-sw-v8.2/eemi_13g_v1.3-v2/pwr-event-messages?guid=guid-5bc... [09:33:05] (03CR) 10Vgutierrez: "traffic_layout verify is happy as well: https://phabricator.wikimedia.org/P8561" [puppet] - 10https://gerrit.wikimedia.org/r/512636 (https://phabricator.wikimedia.org/T224397) (owner: 10Vgutierrez) [09:33:08] 10Operations, 10DBA, 10Patch-For-Review: db2091 rebooted unexpectedly - https://phabricator.wikimedia.org/T224393 (10Volans) Forgot to mention, nothing in syslog or journalctl for MySQL on s2/s4 units. [09:34:01] PROBLEM - tileratorui on maps2002 is CRITICAL: connect to address 10.192.16.179 and port 6535: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui [09:34:30] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: mitaka: stretch: don't install python3-msgpack from jessie-backports [puppet] - 10https://gerrit.wikimedia.org/r/512646 (https://phabricator.wikimedia.org/T224345) (owner: 10Arturo Borrero Gonzalez) [09:35:01] RECOVERY - tilerator on maps2002 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 0.109 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [09:35:05] RECOVERY - tileratorui on maps2001 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 0.105 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui [09:35:15] RECOVERY - tilerator on maps2001 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 0.106 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [09:35:27] RECOVERY - tileratorui on maps2002 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 0.114 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui [09:35:41] RECOVERY - tileratorui on maps2003 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 0.106 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui [09:35:47] RECOVERY - tilerator on maps2003 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 0.111 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [09:39:09] Ok. time to downtime [09:39:55] PROBLEM - tileratorui on maps2003 is CRITICAL: connect to address 10.192.32.146 and port 6535: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui [09:40:45] PROBLEM - tileratorui on maps2001 is CRITICAL: connect to address 10.192.0.144 and port 6535: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui [09:42:01] RECOVERY - tileratorui on maps2001 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 0.103 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui [09:46:57] RECOVERY - tileratorui on maps2003 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 0.105 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui [09:52:20] <_joe_> !log disabling puppet on mw1261, running some tests for T223180 [09:52:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:25] T223180: Monitoring PHP 7 APC usage - https://phabricator.wikimedia.org/T223180 [09:52:39] RECOVERY - Disk space on maps2004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [09:58:20] !log jiji@deploy1001 Started deploy [cpjobqueue/deploy@421c029]: Migrating wikibase-addUsagesForPage to PHP7 - T219148 [09:58:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:26] T219148: Use PHP7 to run all async jobs - https://phabricator.wikimedia.org/T219148 [09:59:29] !log jiji@deploy1001 Finished deploy [cpjobqueue/deploy@421c029]: Migrating wikibase-addUsagesForPage to PHP7 - T219148 (duration: 01m 09s) [09:59:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:43] (03PS2) 10Volans: admin: add urbanecm to the deployment group [puppet] - 10https://gerrit.wikimedia.org/r/512401 (https://phabricator.wikimedia.org/T192830) [10:04:53] 10Operations, 10decommission, 10media-storage, 10User-fgiunchedi: Decom ms-be101[345] - https://phabricator.wikimedia.org/T220590 (10fgiunchedi) >>! In T220590#5213945, @Volans wrote: > @fgiunchedi FYI we got some email to `root@` from `ms-be1014` with the following: thanks! these are spare hosts now so I... [10:05:17] 10Operations, 10Maps, 10Patch-For-Review: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['maps2004.codfw.wmnet'] ` and were **ALL** successful. [10:06:29] onimisionipe: ^ [10:06:43] yep [10:11:50] (03CR) 10Filippo Giunchedi: [C: 03+1] Include Swift analytics_admin auth .env file in HDFS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/512210 (https://phabricator.wikimedia.org/T219544) (owner: 10Ottomata) [10:17:57] (03CR) 10Muehlenhoff: [C: 03+1] admin: add urbanecm to the deployment group [puppet] - 10https://gerrit.wikimedia.org/r/512401 (https://phabricator.wikimedia.org/T192830) (owner: 10Volans) [10:19:32] (03CR) 10Volans: [C: 03+2] admin: add urbanecm to the deployment group [puppet] - 10https://gerrit.wikimedia.org/r/512401 (https://phabricator.wikimedia.org/T192830) (owner: 10Volans) [10:21:05] 10Operations, 10Release Pipeline, 10Services, 10serviceops, and 4 others: Deploy the RESTBase front-end service (RESTRouter) to Kubernetes - https://phabricator.wikimedia.org/T223953 (10mobrovac) [PR #1141](https://github.com/wikimedia/restbase/pull/1141) adds the needed Blubber config. [10:21:29] 10Operations, 10Release Pipeline, 10serviceops, 10Core Platform Team (RESTBase Split (CDP2)), and 4 others: Deploy the RESTBase front-end service (RESTRouter) to Kubernetes - https://phabricator.wikimedia.org/T223953 (10mobrovac) [10:27:34] (03PS1) 10Giuseppe Lavagetto: pbuilder: add proxy configuration for security-cdn.d.o [puppet] - 10https://gerrit.wikimedia.org/r/512650 [10:28:19] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512651 (https://phabricator.wikimedia.org/T128546) [10:28:21] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), and 3 others: Requesting access to production for SWAT deploy for Urbanecm - https://phabricator.wikimedia.org/T192830 (10Volans) As per docs added `Urbanecm` to the `wmf-deployment` group in Gerrit. [10:31:55] !log rebooting maps2004 - cassandra unit failed and got stuck [10:31:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:27] !log decommission restbase1013-b - T223976 [10:35:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:32] T223976: Decommission restbase10(0[7-9]|1[0-5]) - https://phabricator.wikimedia.org/T223976 [10:35:36] (03CR) 10Alexandros Kosiaris: [C: 03+2] pbuilder: add proxy configuration for security-cdn.d.o [puppet] - 10https://gerrit.wikimedia.org/r/512650 (owner: 10Giuseppe Lavagetto) [10:44:59] (03PS1) 10Volans: PuppetDB: fix handle of FAILED status [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/512653 [10:48:37] (03CR) 10Faidon Liambotis: [C: 03+2] PuppetDB: fix handle of FAILED status [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/512653 (owner: 10Volans) [10:52:09] <_joe_> !log uploading service-checker 0.1.5 to {jessie,stretch}-wikimedia [10:53:43] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: add deprecated-input tag to deprecated inputs [puppet] - 10https://gerrit.wikimedia.org/r/512193 (https://phabricator.wikimedia.org/T220103) (owner: 10Cwhite) [11:23:19] (03CR) 10Michael Große: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510204 (https://phabricator.wikimedia.org/T223312) (owner: 10Michael Große) [11:25:19] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), and 3 others: Requesting access to production for SWAT deploy for Urbanecm - https://phabricator.wikimedia.org/T192830 (10Volans) I've added `urbanecm` to the LDAP group `nda` as per request above given that it's... [11:27:03] RECOVERY - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is OK: (C)130 ge (W)110 ge 106.5 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [11:30:53] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), and 3 others: Requesting access to production for SWAT deploy for Urbanecm - https://phabricator.wikimedia.org/T192830 (10zeljkofilipin) Thanks! I'll resolve it after the first successful deployment. [11:33:33] !log starting osm initial import on maps2004 - T224395 [11:33:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:39] T224395: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395 [11:39:22] (03CR) 10Lucas Werkmeister (WMDE): Add a list of IDs to skip in production (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511753 (owner: 10Michael Große) [11:39:33] (03PS1) 10Arturo Borrero Gonzalez: openstack: clientpackages: mitaka: stretch: special case for py3 client libs [puppet] - 10https://gerrit.wikimedia.org/r/512658 (https://phabricator.wikimedia.org/T224345) [11:44:05] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Wikidata, and 5 others: Introduce wikidata termbox SSR to kubernetes - https://phabricator.wikimedia.org/T220402 (10mobrovac) >>! In T220402#5199730, @Tarrow wrote: > Should we be using http rather than https internally? Yes, indeed, sorry th... [11:44:11] 10Operations, 10Icinga, 10Core Platform Team Backlog (Watching / External), 10Services (watching): Incorrect icinga settings for mobrovac - https://phabricator.wikimedia.org/T224406 (10Volans) p:05Triage→03Normal a:03Volans [11:45:32] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC as expected: https://puppet-compiler.wmflabs.org/compiler1002/16763/" [puppet] - 10https://gerrit.wikimedia.org/r/512658 (https://phabricator.wikimedia.org/T224345) (owner: 10Arturo Borrero Gonzalez) [11:46:32] (03CR) 10Michael Große: "More importantly it makes no sense to add this config to beta instead of live 🤦" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511753 (owner: 10Michael Große) [11:46:56] (03PS4) 10Michael Große: Add a list of IDs to skip in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511753 [11:47:56] (03CR) 10jerkins-bot: [V: 04-1] Add a list of IDs to skip in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511753 (owner: 10Michael Große) [11:49:50] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Wikidata, and 5 others: Introduce wikidata termbox SSR to kubernetes - https://phabricator.wikimedia.org/T220402 (10Tarrow) @mobrovac Thanks! I think we've now taken most of this onboard and merged it. @akosiaris could you take a look at out... [11:49:56] 10Operations, 10Icinga, 10Core Platform Team Backlog (Watching / External), 10Services (watching): Incorrect icinga settings for mobrovac - https://phabricator.wikimedia.org/T224406 (10MoritzMuehlenhoff) When was the last time this worked for you? modules/icinga/files/cgi.cfg has your shell user name, but... [11:50:36] (03CR) 10Lucas Werkmeister (WMDE): [C: 04-1] Add a list of IDs to skip in production (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511753 (owner: 10Michael Große) [11:51:26] 10Operations, 10Icinga, 10Core Platform Team Backlog (Watching / External), 10Services (watching): Incorrect icinga settings for mobrovac - https://phabricator.wikimedia.org/T224406 (10Volans) @mobrovac - Regarding the notification AFAICT the RESTBase alerts notify the `team-services` group (`services@`).... [11:54:43] 10Operations, 10Icinga, 10Core Platform Team Backlog (Watching / External), 10Services (watching): Incorrect icinga settings for mobrovac - https://phabricator.wikimedia.org/T224406 (10mobrovac) >>! In T224406#5214472, @MoritzMuehlenhoff wrote: > When was the last time this worked for you? modules/icinga/f... [11:54:54] (03PS1) 10Arturo Borrero Gonzalez: Revert "cloudcontrol: temporarily mark out prometheus classes on Stretch" [puppet] - 10https://gerrit.wikimedia.org/r/512663 (https://phabricator.wikimedia.org/T224345) [11:56:46] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Revert "cloudcontrol: temporarily mark out prometheus classes on Stretch" [puppet] - 10https://gerrit.wikimedia.org/r/512663 (https://phabricator.wikimedia.org/T224345) (owner: 10Arturo Borrero Gonzalez) [11:57:40] 10Operations, 10Icinga, 10Core Platform Team Backlog (Watching / External), 10Services (watching): Incorrect icinga settings for mobrovac - https://phabricator.wikimedia.org/T224406 (10mobrovac) >>! In T224406#5214478, @Volans wrote: > @mobrovac > - Regarding the notification AFAICT the RESTBase alerts not... [12:03:07] (03PS5) 10Michael Große: Add a list of IDs to skip in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511753 [12:08:49] PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [12:08:51] PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [12:11:40] 10Operations, 10PHP 7.2 support, 10Performance-Team (Radar): Monitoring PHP 7 APC usage - https://phabricator.wikimedia.org/T223180 (10Joe) 05Open→03Resolved So, I modified the APC dashboards in the php7-transition table to show the information you mentioned in the ticket, and I think it's fair. I realiz... [12:31:17] RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [12:32:45] RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [12:44:11] (03CR) 10Faidon Liambotis: [C: 04-1] "See inline for a few comments. Also, this doesn't seem to add the report to the README." (0316 comments) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/510256 (https://phabricator.wikimedia.org/T221507) (owner: 10CRusnov) [12:45:15] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/507632 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [12:47:38] (03CR) 10Filippo Giunchedi: [C: 03+1] firewall loggin: enable firewall logging on wmcs servers [puppet] - 10https://gerrit.wikimedia.org/r/511701 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond) [12:48:31] (03CR) 10Filippo Giunchedi: [C: 03+1] firewall loggin: enable firewall logging on analytics servers [puppet] - 10https://gerrit.wikimedia.org/r/511702 (owner: 10Jbond) [12:48:49] (03CR) 10Filippo Giunchedi: [C: 03+1] firewall logging: enable firewall logging on external servers [puppet] - 10https://gerrit.wikimedia.org/r/511703 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond) [12:49:07] (03CR) 10Filippo Giunchedi: [C: 03+1] firewall logging: Enable logging on external servers [puppet] - 10https://gerrit.wikimedia.org/r/511704 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond) [12:49:49] (03CR) 10Filippo Giunchedi: [C: 03+1] firewall logging: enable loggin on internal servers [puppet] - 10https://gerrit.wikimedia.org/r/511700 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond) [12:50:00] (03CR) 10Filippo Giunchedi: [C: 03+1] firewall logging: add firewall logging to kafak servers [puppet] - 10https://gerrit.wikimedia.org/r/511705 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond) [12:50:25] (03CR) 10Filippo Giunchedi: [C: 03+1] firewall logging: Enable logging on misc services [puppet] - 10https://gerrit.wikimedia.org/r/511706 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond) [12:50:37] (03CR) 10Filippo Giunchedi: [C: 03+1] firewall logging: enable logging on ores [puppet] - 10https://gerrit.wikimedia.org/r/511707 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond) [12:50:48] (03CR) 10Filippo Giunchedi: [C: 03+1] firewall logging: Enable firewall logging on mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/511708 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond) [12:51:06] 10Operations, 10hardware-requests: Two test hosts for SREs - https://phabricator.wikimedia.org/T214024 (10faidon) I don't know what the status of this is, it's been a while it seems. I see it was pending for my approval, which I've missed -- apologies! Approved now. [12:51:12] (03CR) 10Filippo Giunchedi: [C: 03+1] firewall logging: enable firewall logging on remaining roles [puppet] - 10https://gerrit.wikimedia.org/r/511709 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond) [12:54:55] (03CR) 10Filippo Giunchedi: [C: 03+1] Prometheus, add Routinator endpoint [puppet] - 10https://gerrit.wikimedia.org/r/508956 (https://phabricator.wikimedia.org/T220669) (owner: 10Ayounsi) [12:56:52] (03CR) 10Filippo Giunchedi: [C: 03+1] rsyslog: add netdev_kafka_relay compatibility endpoint [puppet] - 10https://gerrit.wikimedia.org/r/495980 (https://phabricator.wikimedia.org/T224128) (owner: 10Herron) [12:57:27] (03CR) 10Lucas Werkmeister (WMDE): Add a list of IDs to skip in production (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511753 (owner: 10Michael Große) [13:02:58] !log swift eqiad-prod: ms-be1033 weight to 0 - T223518 [13:03:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:05] T223518: ms-be1033 not powering up - https://phabricator.wikimedia.org/T223518 [13:05:14] (03PS1) 10Alexandros Kosiaris: Add log_level, tls, openapi config options [deployment-charts] - 10https://gerrit.wikimedia.org/r/512673 (https://phabricator.wikimedia.org/T220401) [13:09:55] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] "Tested locally, worked fine. We will need a newer service-checker image but otherwise LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/512673 (https://phabricator.wikimedia.org/T220401) (owner: 10Alexandros Kosiaris) [13:14:37] (03CR) 10Filippo Giunchedi: [C: 03+2] graphite: fix dashboard links for thumbnail alerts [puppet] - 10https://gerrit.wikimedia.org/r/512133 (owner: 10Filippo Giunchedi) [13:14:45] (03PS2) 10Filippo Giunchedi: graphite: fix dashboard links for thumbnail alerts [puppet] - 10https://gerrit.wikimedia.org/r/512133 [13:18:51] (03CR) 10Michael Große: Add a list of IDs to skip in production (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511753 (owner: 10Michael Große) [13:21:36] (03CR) 10Lucas Werkmeister (WMDE): Add a list of IDs to skip in production (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511753 (owner: 10Michael Große) [13:22:20] 10Operations, 10Wikimedia-Site-requests: Add more bad words to fancycaptcha/badwords - https://phabricator.wikimedia.org/T224343 (10Reedy) 05Open→03Resolved a:03ArielGlenn Captchas should've been re-generated last night, so these words should have taken affect Whether we want to make the blacklist publi... [13:32:55] (03CR) 10Michael Große: Add a list of IDs to skip in production (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511753 (owner: 10Michael Große) [13:36:34] (03PS1) 10Filippo Giunchedi: graphite: more robust swift thumbnail alerts [puppet] - 10https://gerrit.wikimedia.org/r/512679 [13:42:36] (03PS2) 10Filippo Giunchedi: graphite: more robust swift thumbnail alerts [puppet] - 10https://gerrit.wikimedia.org/r/512679 [13:43:57] (03CR) 10Filippo Giunchedi: [C: 03+2] graphite: more robust swift thumbnail alerts [puppet] - 10https://gerrit.wikimedia.org/r/512679 (owner: 10Filippo Giunchedi) [13:53:10] (03PS1) 10Volans: icinga: fix Mobrovac case for authorization [puppet] - 10https://gerrit.wikimedia.org/r/512680 (https://phabricator.wikimedia.org/T224406) [13:56:01] PROBLEM - Check systemd state on ms-be1026 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:56:16] (03CR) 10Muehlenhoff: [C: 03+1] icinga: fix Mobrovac case for authorization [puppet] - 10https://gerrit.wikimedia.org/r/512680 (https://phabricator.wikimedia.org/T224406) (owner: 10Volans) [13:56:37] (03PS2) 10Volans: icinga: fix Mobrovac case for authorization [puppet] - 10https://gerrit.wikimedia.org/r/512680 (https://phabricator.wikimedia.org/T224406) [13:59:13] (03CR) 10Volans: [C: 03+2] icinga: fix Mobrovac case for authorization [puppet] - 10https://gerrit.wikimedia.org/r/512680 (https://phabricator.wikimedia.org/T224406) (owner: 10Volans) [14:03:08] RECOVERY - cassandra CQL 10.192.48.57:9042 on maps2004 is OK: TCP OK - 0.036 second response time on 10.192.48.57 port 9042 https://phabricator.wikimedia.org/T93886 [14:03:15] 10Operations, 10Icinga, 10Core Platform Team Backlog (Watching / External), 10Services (watching): Incorrect icinga settings for mobrovac - https://phabricator.wikimedia.org/T224406 (10Maintenance_bot) [14:05:20] 10Operations, 10Icinga, 10Core Platform Team Backlog (Watching / External), 10Services (watching): Incorrect icinga settings for mobrovac - https://phabricator.wikimedia.org/T224406 (10Volans) @mobrovac can you retry actions on the Icinga UI? [14:07:54] RECOVERY - Check systemd state on ms-be1026 is OK: OK - running: The system is fully operational [14:13:40] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [14:13:56] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [14:14:00] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [14:14:11] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [14:14:26] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [14:14:26] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [14:14:34] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [14:14:44] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [14:14:46] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [14:15:04] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [14:15:14] PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [14:15:58] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [14:17:18] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [14:17:38] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [14:17:52] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [14:17:56] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [14:18:06] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [14:18:18] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [14:18:18] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [14:18:38] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [14:18:40] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [14:21:12] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [14:22:38] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [14:23:06] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [14:23:14] RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [14:24:00] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [14:28:04] ACKNOWLEDGEMENT - Check systemd state on restbase1012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Mobrovac test check [14:28:58] (03PS4) 10Michael Große: Add feature flag config for breaking Wikibase API change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510204 (https://phabricator.wikimedia.org/T223312) [14:30:08] (03PS5) 10Michael Große: Add feature flag config for breaking Wikibase API change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510204 (https://phabricator.wikimedia.org/T223300) [14:30:17] (03PS60) 10Vgutierrez: ATS: Provide a TLS terminator profile [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) [14:30:23] 10Operations, 10Icinga, 10Core Platform Team Backlog (Watching / External), 10Services (watching): Incorrect icinga settings for mobrovac - https://phabricator.wikimedia.org/T224406 (10mobrovac) >>! In T224406#5214843, @Volans wrote: > @mobrovac can you retry actions on the Icinga UI? Icinga UI acking now... [14:32:09] (03PS2) 10Vgutierrez: ATS: Set log mode independently of log filters [puppet] - 10https://gerrit.wikimedia.org/r/512636 (https://phabricator.wikimedia.org/T224397) [14:32:11] (03PS61) 10Vgutierrez: ATS: Provide a TLS terminator profile [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) [14:36:09] 10Operations, 10Traffic: ATS: traffic_layout currently forces to use its own copy of shared libraries - https://phabricator.wikimedia.org/T224428 (10Vgutierrez) [14:36:20] 10Operations, 10Traffic: ATS: traffic_layout currently forces to use its own copy of shared libraries - https://phabricator.wikimedia.org/T224428 (10Vgutierrez) p:05Triage→03Normal [15:01:56] 10Operations, 10Icinga, 10Core Platform Team Backlog (Watching / External), 10Services (watching): Incorrect icinga settings for mobrovac - https://phabricator.wikimedia.org/T224406 (10Volans) So after a bit of debugging with @mobrovac it seems that the alarm that is not notifying the `team-services` conta... [15:14:28] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [15:14:46] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [15:14:58] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [15:14:58] PROBLEM - HTTP availability for Varnish at eqsin on icinga1001 is CRITICAL: job=varnish-text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [15:15:06] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [15:15:16] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [15:15:30] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [15:15:30] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [15:15:44] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [15:15:50] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [15:16:10] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [15:16:20] PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [15:16:24] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [15:16:56] (03PS3) 10Gehel: Convert cirrus data retention from cron to systemd [puppet] - 10https://gerrit.wikimedia.org/r/512235 (https://phabricator.wikimedia.org/T224200) (owner: 10EBernhardson) [15:17:00] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [15:17:06] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [15:18:38] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [15:18:40] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [15:18:58] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [15:19:10] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [15:19:10] RECOVERY - HTTP availability for Varnish at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [15:19:24] (03CR) 10Gehel: [C: 03+2] Convert cirrus data retention from cron to systemd [puppet] - 10https://gerrit.wikimedia.org/r/512235 (https://phabricator.wikimedia.org/T224200) (owner: 10EBernhardson) [15:19:26] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [15:19:40] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [15:19:41] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [15:20:40] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [15:22:26] PROBLEM - Check systemd state on ms-be1024 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:24:04] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [15:24:06] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [15:24:07] (03PS1) 10Gehel: Convert cirrus data retention from cron to systemd. [puppet] - 10https://gerrit.wikimedia.org/r/512702 (https://phabricator.wikimedia.org/T224200) [15:24:34] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [15:24:44] RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [15:24:48] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [15:25:24] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [15:32:18] RECOVERY - Check systemd state on ms-be1024 is OK: OK - running: The system is fully operational [15:36:45] !log initialize sessionstore namespace on eqiad/codfw/staging kubernetes clusters T220401 [15:36:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:51] T220401: Introduce kask session storage service to kubernetes - https://phabricator.wikimedia.org/T220401 [15:36:54] (03CR) 10Gehel: [C: 04-1] Add postgres slave init cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) (owner: 10Mathew.onipe) [15:40:07] !log initialize termbox namespace on eqiad/codfw/staging kubernetes clusters T220402 [15:40:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:14] T220402: Introduce wikidata termbox SSR to kubernetes - https://phabricator.wikimedia.org/T220402 [15:42:59] (03PS1) 10Faidon Liambotis: autoinstall: drop 9600 and 57600 baud variants [puppet] - 10https://gerrit.wikimedia.org/r/512708 [15:43:01] (03PS1) 10Faidon Liambotis: autoinstall: drop explicit references to lpxelinux [puppet] - 10https://gerrit.wikimedia.org/r/512709 [15:43:03] (03PS1) 10Faidon Liambotis: autoinstall: cleanup pxelinux options [puppet] - 10https://gerrit.wikimedia.org/r/512710 [15:43:05] (03PS1) 10Faidon Liambotis: autoinstall: configure DHCP for UEFI with syslinux [puppet] - 10https://gerrit.wikimedia.org/r/512711 (https://phabricator.wikimedia.org/T93208) [15:44:06] (03CR) 10jerkins-bot: [V: 04-1] autoinstall: cleanup pxelinux options [puppet] - 10https://gerrit.wikimedia.org/r/512710 (owner: 10Faidon Liambotis) [15:44:21] (03CR) 10jerkins-bot: [V: 04-1] autoinstall: configure DHCP for UEFI with syslinux [puppet] - 10https://gerrit.wikimedia.org/r/512711 (https://phabricator.wikimedia.org/T93208) (owner: 10Faidon Liambotis) [15:46:12] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/512708 (owner: 10Faidon Liambotis) [15:46:27] 10Operations, 10MediaWiki-Logging, 10Wikimedia-Logstash, 10wmerrors, and 6 others: Port mediawiki/php/wmerrors to PHP7 and deploy - https://phabricator.wikimedia.org/T187147 (10Joe) >>! In T187147#5207128, @tstarling wrote: > Basic porting work on wmerrors is hopefully complete. > > It still writes a text... [15:47:20] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/512709 (owner: 10Faidon Liambotis) [15:50:38] 10Operations, 10Patch-For-Review: (U)EFI support - https://phabricator.wikimedia.org/T93208 (10faidon) So I just pushed a change that uses syslinux.efi above. This may prove to be short-lived, as we may switch to another PXE implementation (iPXE or GRUB, more on that later) but should work. It /may/ require to... [15:52:41] (03PS2) 10Faidon Liambotis: autoinstall: cleanup pxelinux options [puppet] - 10https://gerrit.wikimedia.org/r/512710 [15:52:43] (03PS2) 10Faidon Liambotis: autoinstall: configure DHCP for UEFI with syslinux [puppet] - 10https://gerrit.wikimedia.org/r/512711 (https://phabricator.wikimedia.org/T93208) [15:56:43] (03CR) 10Volans: [C: 03+2] autoinstall: drop 9600 and 57600 baud variants [puppet] - 10https://gerrit.wikimedia.org/r/512708 (owner: 10Faidon Liambotis) [15:57:19] (03CR) 10Volans: [C: 03+2] autoinstall: drop explicit references to lpxelinux [puppet] - 10https://gerrit.wikimedia.org/r/512709 (owner: 10Faidon Liambotis) [16:04:40] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Add a list of IDs to skip in production (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511753 (owner: 10Michael Große) [16:07:59] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/512710 (owner: 10Faidon Liambotis) [16:08:44] (03PS1) 10Alexandros Kosiaris: sessionstore: Populate kubernetes stanzas [puppet] - 10https://gerrit.wikimedia.org/r/512720 (https://phabricator.wikimedia.org/T220401) [16:12:03] 10Operations, 10Wikimedia-Site-requests: Add more bad words to fancycaptcha/badwords - https://phabricator.wikimedia.org/T224343 (10Sphilbrick) Sounds great. For what it's worth, I let the individual who originally contacted us know that multiple people were working on resolving this and they seemed impressed... [16:17:27] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/512711 (https://phabricator.wikimedia.org/T93208) (owner: 10Faidon Liambotis) [16:17:50] (03CR) 10Volans: [C: 03+2] autoinstall: cleanup pxelinux options [puppet] - 10https://gerrit.wikimedia.org/r/512710 (owner: 10Faidon Liambotis) [16:18:31] (03CR) 10Volans: [C: 03+2] autoinstall: configure DHCP for UEFI with syslinux [puppet] - 10https://gerrit.wikimedia.org/r/512711 (https://phabricator.wikimedia.org/T93208) (owner: 10Faidon Liambotis) [16:19:43] (03PS2) 10Alexandros Kosiaris: sessionstore: Populate kubernetes stanzas [puppet] - 10https://gerrit.wikimedia.org/r/512720 (https://phabricator.wikimedia.org/T220401) [16:23:52] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 57.14% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [16:25:53] (03CR) 10Alexandros Kosiaris: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/16765/deploy1001.eqiad.wmnet/ says fine, merging" [puppet] - 10https://gerrit.wikimedia.org/r/512720 (https://phabricator.wikimedia.org/T220401) (owner: 10Alexandros Kosiaris) [16:27:34] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:27:58] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [16:28:04] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:28:06] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:28:26] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [16:28:46] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:28:54] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [16:29:04] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [16:29:44] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [16:30:00] PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [16:30:30] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [16:31:34] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:31:44] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:32:06] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [16:32:14] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:32:14] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:32:36] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [16:34:32] !log decommission restbase1013-c - T223976 [16:34:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:37] T223976: Decommission restbase10(0[7-9]|1[0-5]) - https://phabricator.wikimedia.org/T223976 [16:35:16] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [16:36:02] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [16:36:04] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [16:36:06] PROBLEM - etcd request latencies on acrab is CRITICAL: instance=10.192.16.26:6443 operation=list https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:36:59] probably a result of the apiservers being restarted ^ [16:37:16] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [16:37:30] RECOVERY - etcd request latencies on acrab is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:38:22] RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [16:39:12] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [16:40:43] !log removed unreferenced files in /etc/dhcp/ on install[12]002 [16:40:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:58] (03PS1) 10Bartosz Dziewoński: Fix order of "Edit" tabs when multi-tab mode used on single-tab wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512732 (https://phabricator.wikimedia.org/T223793) [16:50:57] (03CR) 10Elukey: [C: 03+1] eventlogging.my.cnf: Increase buffer pool from 50G to 300G [puppet] - 10https://gerrit.wikimedia.org/r/512365 (https://phabricator.wikimedia.org/T224291) (owner: 10Marostegui) [16:53:20] PROBLEM - Check systemd state on ms-be1024 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:10:17] 10Operations, 10Maps, 10Patch-For-Review: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395 (10Mathew.onipe) p:05Unbreak!→03High [17:11:01] 10Operations, 10Maps: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395 (10Maintenance_bot) [17:11:04] (03PS1) 10Arturo Borrero Gonzalez: openstack: designate: mitaka: stretch: use pdns 3.x [puppet] - 10https://gerrit.wikimedia.org/r/512734 (https://phabricator.wikimedia.org/T224354) [17:11:16] 10Operations: (U)EFI support - https://phabricator.wikimedia.org/T93208 (10Maintenance_bot) [17:11:45] (03CR) 10jerkins-bot: [V: 04-1] openstack: designate: mitaka: stretch: use pdns 3.x [puppet] - 10https://gerrit.wikimedia.org/r/512734 (https://phabricator.wikimedia.org/T224354) (owner: 10Arturo Borrero Gonzalez) [17:13:33] (03PS2) 10Arturo Borrero Gonzalez: openstack: designate: mitaka: stretch: use pdns 3.x [puppet] - 10https://gerrit.wikimedia.org/r/512734 (https://phabricator.wikimedia.org/T224354) [17:14:25] (03CR) 10jerkins-bot: [V: 04-1] openstack: designate: mitaka: stretch: use pdns 3.x [puppet] - 10https://gerrit.wikimedia.org/r/512734 (https://phabricator.wikimedia.org/T224354) (owner: 10Arturo Borrero Gonzalez) [17:15:22] (03PS3) 10Arturo Borrero Gonzalez: openstack: designate: mitaka: stretch: use pdns 3.x [puppet] - 10https://gerrit.wikimedia.org/r/512734 (https://phabricator.wikimedia.org/T224354) [17:16:12] (03CR) 10jerkins-bot: [V: 04-1] openstack: designate: mitaka: stretch: use pdns 3.x [puppet] - 10https://gerrit.wikimedia.org/r/512734 (https://phabricator.wikimedia.org/T224354) (owner: 10Arturo Borrero Gonzalez) [17:17:01] (03PS4) 10Arturo Borrero Gonzalez: openstack: designate: mitaka: stretch: use pdns 3.x [puppet] - 10https://gerrit.wikimedia.org/r/512734 (https://phabricator.wikimedia.org/T224354) [17:17:53] (03CR) 10jerkins-bot: [V: 04-1] openstack: designate: mitaka: stretch: use pdns 3.x [puppet] - 10https://gerrit.wikimedia.org/r/512734 (https://phabricator.wikimedia.org/T224354) (owner: 10Arturo Borrero Gonzalez) [17:18:46] 10Operations, 10Continuous-Integration-Infrastructure, 10Release Pipeline, 10Release-Engineering-Team (Kanban): Switch CI Docker Storage Driver to its own partition and to use devicemapper - https://phabricator.wikimedia.org/T178663 (10greg) [17:18:50] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban): contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10greg) 05Open→03Stalled stalled until the disks are installed [17:20:15] (03PS5) 10Arturo Borrero Gonzalez: openstack: designate: mitaka: stretch: use pdns 3.x [puppet] - 10https://gerrit.wikimedia.org/r/512734 (https://phabricator.wikimedia.org/T224354) [17:20:46] (03CR) 10jerkins-bot: [V: 04-1] openstack: designate: mitaka: stretch: use pdns 3.x [puppet] - 10https://gerrit.wikimedia.org/r/512734 (https://phabricator.wikimedia.org/T224354) (owner: 10Arturo Borrero Gonzalez) [17:23:06] (03PS6) 10Arturo Borrero Gonzalez: openstack: designate: mitaka: stretch: use pdns 3.x [puppet] - 10https://gerrit.wikimedia.org/r/512734 (https://phabricator.wikimedia.org/T224354) [17:23:37] (03CR) 10jerkins-bot: [V: 04-1] openstack: designate: mitaka: stretch: use pdns 3.x [puppet] - 10https://gerrit.wikimedia.org/r/512734 (https://phabricator.wikimedia.org/T224354) (owner: 10Arturo Borrero Gonzalez) [17:28:12] (03PS7) 10Arturo Borrero Gonzalez: openstack: designate: mitaka: stretch: use pdns 3.x [puppet] - 10https://gerrit.wikimedia.org/r/512734 (https://phabricator.wikimedia.org/T224354) [17:32:36] RECOVERY - Check systemd state on ms-be1024 is OK: OK - running: The system is fully operational [17:37:27] !log refreshing puppet-compiler facts [17:37:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:44] elukey_: If you're around to monitor operational impact on dbs and mcs, we can roll it out now if you want [17:40:00] it's a holiday for some so wasn't sure whether there'd be enough ppl [17:44:43] (03PS8) 10Andrew Bogott: openstack: designate: mitaka: stretch: use pdns 3.x [puppet] - 10https://gerrit.wikimedia.org/r/512734 (https://phabricator.wikimedia.org/T224354) (owner: 10Arturo Borrero Gonzalez) [17:44:57] (03CR) 10Mathew.onipe: Add postgres slave init cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) (owner: 10Mathew.onipe) [17:47:57] (03PS14) 10Mathew.onipe: Add postgres slave init cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) [17:51:03] (03CR) 10Andrew Bogott: [C: 03+2] openstack: designate: mitaka: stretch: use pdns 3.x [puppet] - 10https://gerrit.wikimedia.org/r/512734 (https://phabricator.wikimedia.org/T224354) (owner: 10Arturo Borrero Gonzalez) [17:56:07] !log re-imaging cloudservices1004 in order to make sure our apt magic is working properly [17:56:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:32] Krinkle: I am! [17:59:52] but we can do it another time if you want [18:00:14] I didn't mean to rush you :) [18:00:34] (03PS1) 10Volans: restbase: add team-services to Icinga notifications [puppet] - 10https://gerrit.wikimedia.org/r/512742 (https://phabricator.wikimedia.org/T224406) [18:01:24] PROBLEM - Host 208.80.154.24 is DOWN: PING CRITICAL - Packet loss = 100% [18:02:00] andrewbogott: related to the reimage? ^^^ [18:02:15] volans: yes [18:02:25] (03PS10) 10Krinkle: errorpages: Remove unused hhvm-fatal-error.php file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412829 (https://phabricator.wikimedia.org/T113114) [18:02:26] and, dammit, no matter how many things I remember to downtime one always gets through [18:02:48] (03CR) 10Krinkle: [C: 03+2] errorpages: Remove unused hhvm-fatal-error.php file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412829 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [18:03:12] ACKNOWLEDGEMENT - Host 208.80.154.24 is DOWN: PING CRITICAL - Packet loss = 100% andrew bogott this is me, rebuilding the host [18:04:01] (03Merged) 10jenkins-bot: errorpages: Remove unused hhvm-fatal-error.php file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412829 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [18:04:03] elukey: np, rolling out this first^, [18:04:16] (03CR) 10jenkins-bot: errorpages: Remove unused hhvm-fatal-error.php file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412829 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [18:05:51] (03PS1) 10Pmiazga: Disable the rdf2latex Collection portlet format [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512743 (https://phabricator.wikimedia.org/T224433) [18:06:33] !log krinkle@deploy1001 Synchronized errorpages/: 4ffcbfc2ba3 (duration: 00m 48s) [18:06:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:33] (03PS2) 10Volans: restbase: add team-services to service::node alert [puppet] - 10https://gerrit.wikimedia.org/r/512742 (https://phabricator.wikimedia.org/T224406) [18:12:00] elukey: ready in 5-6min, at the mercy of Jenkins [18:12:10] Krinkle: ack [18:13:07] Krinkle: as a follow up to this, I'll create bandwidth alarms for the mc hosts so we can catch any issue right after a deployment [18:13:24] PROBLEM - puppet last run on puppetmaster1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:13:34] (03PS3) 10Volans: restbase: add team-services to service::node alert [puppet] - 10https://gerrit.wikimedia.org/r/512742 (https://phabricator.wikimedia.org/T224406) [18:15:32] (03CR) 10Volans: "@mobrovac: if you want that team-services gets alerted for *all* checks on those hosts pick PS1, if you just want that check mentioned in " [puppet] - 10https://gerrit.wikimedia.org/r/512742 (https://phabricator.wikimedia.org/T224406) (owner: 10Volans) [18:16:23] elukey: cool, feel free to tag the perf-radar on a task if there is/willbe a task [18:22:31] RECOVERY - Host 208.80.154.24 is UP: PING OK - Packet loss = 16%, RTA = 0.29 ms [18:23:29] RECOVERY - puppet last run on puppetmaster1001 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [18:32:45] elukey: ok, staging now [18:41:16] !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.6/includes/libs/rdbms: 66556bf37e8 / T223310, T223978 (duration: 00m 50s) [18:41:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:23] T223310: Investigate increase in tx bandwidth usage for mc1033 - https://phabricator.wikimedia.org/T223310 [18:41:24] T223978: 1.34.0-wmf.3 generating lots of temporary tables on MySQL slaves - https://phabricator.wikimedia.org/T223978 [18:45:39] Krinkle: the problem seems fixed! \o/ [18:45:52] I am checking https://grafana.wikimedia.org/d/000000574/t204083-investigation?orgId=1&from=now-1h&to=now&panelId=3&fullscreen&edit [18:45:58] Yeah [18:46:09] mc1033 is down to normal levels [18:46:26] thanks a lot! Cc: AaronSchulz [18:46:30] the dbs looking good as well [18:46:34] 10Operations, 10Growth-Team, 10Performance-Team, 10Wikidata, and 4 others: Investigate increase in tx bandwidth usage for mc1033 - https://phabricator.wikimedia.org/T223310 (10Krinkle) 05Open→03Resolved a:05elukey→03aaron [18:48:09] 10Operations, 10Growth-Team, 10Performance-Team, 10Wikidata, and 4 others: Investigate increase in tx bandwidth usage for mc1033 - https://phabricator.wikimedia.org/T223310 (10Krinkle) Recovery: {F29260036} [18:50:09] Krinkle: going afk, thanks a lot! [18:50:21] I cross posted the results to the sre chan as well [18:53:53] (03PS1) 10Andrew Bogott: pdns3hack: don't pin pdns-recursor to the old repo [puppet] - 10https://gerrit.wikimedia.org/r/512744 (https://phabricator.wikimedia.org/T224354) [18:54:48] (03CR) 10Andrew Bogott: [C: 03+2] pdns3hack: don't pin pdns-recursor to the old repo [puppet] - 10https://gerrit.wikimedia.org/r/512744 (https://phabricator.wikimedia.org/T224354) (owner: 10Andrew Bogott) [18:59:16] RECOVERY - Recursive DNS on 208.80.154.24 is OK: DNS OK: 0.282 seconds response time. www.wikipedia.org returns 208.80.154.224 https://wikitech.wikimedia.org/wiki/DNS [19:05:53] (03PS1) 10Andrew Bogott: designate: improve firewall rules for memc access [puppet] - 10https://gerrit.wikimedia.org/r/512745 [19:06:34] PROBLEM - HHVM rendering on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:07:44] RECOVERY - HHVM rendering on mw1234 is OK: HTTP OK: HTTP/1.1 200 OK - 80252 bytes in 0.594 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:09:33] (03CR) 10Andrew Bogott: [C: 03+2] designate: improve firewall rules for memc access [puppet] - 10https://gerrit.wikimedia.org/r/512745 (owner: 10Andrew Bogott) [19:19:36] 10Operations, 10DNS, 10Tools, 10Traffic: I can't load tools.wmflabs.org - https://phabricator.wikimedia.org/T224442 (10Patriccck) [19:22:30] 10Operations, 10DNS, 10Tools, 10Traffic: I can't load tools.wmflabs.org - https://phabricator.wikimedia.org/T224442 (10Patriccck) [19:24:45] (03PS1) 10Andrew Bogott: designate/pdns: allow db access from standby to primary [puppet] - 10https://gerrit.wikimedia.org/r/512751 [19:29:37] 10Operations, 10DNS, 10Tools, 10Traffic: I can't load tools.wmflabs.org - https://phabricator.wikimedia.org/T224442 (10Krenair) Do you have dig? If so can you `dig wmflabs.org NS @cloud-ns0.wikimedia.org` and `dig wmflabs.org NS`? [19:32:07] 10Operations, 10MediaWiki-Cache, 10serviceops, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 5 others: Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10Krinkle) >>! In T212129#5213676, @Joe wrote: >>>! In... [19:32:23] 10Operations, 10DNS, 10Tools, 10Traffic: I can't load tools.wmflabs.org - https://phabricator.wikimedia.org/T224442 (10Mbch331) For windows that is `nslookup -type=ns wmflabs.org cloud-ns0.wikimedia.org` and `nslookup -type=ns wmflabs.org` [19:33:29] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: db2091 rebooted unexpectedly - https://phabricator.wikimedia.org/T224393 (10Marostegui) @papaul should we contact the vendor with these logs? [19:40:33] 10Operations, 10DNS, 10Tools, 10Traffic: I can't load tools.wmflabs.org - https://phabricator.wikimedia.org/T224442 (10Andrew) Hello! This is probably something I caused as part of maintenance to our dns setup. Is it better now? [19:46:26] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational [19:47:30] (03CR) 10Andrew Bogott: [C: 03+2] designate/pdns: allow db access from standby to primary [puppet] - 10https://gerrit.wikimedia.org/r/512751 (owner: 10Andrew Bogott) [20:00:15] 10Operations, 10Release-Engineering-Team, 10SRE-Access-Requests: Request access to analytics cluster for Alaa Sarhan - https://phabricator.wikimedia.org/T223697 (10Nuria) @alaa_wmde Can you be a bit more specific as to what data you need access to? So you know Logstash and hadoop do not share any data, mayb... [20:00:16] PROBLEM - Host tools.wmflabs.org is DOWN: check_ping: Invalid hostname/address - tools.wmflabs.org [20:00:46] PROBLEM - puppet last run on analytics1048 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [20:01:06] RECOVERY - Host tools.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 1.11 ms [20:01:44] hmm [20:01:50] andrewbogott ^^ [20:02:06] paladox: yep, that's me :/ [20:02:09] oh [20:02:26] PROBLEM - Host tools.wmflabs.org is DOWN: check_ping: Invalid hostname/address - tools.wmflabs.org [20:02:54] PROBLEM - Host checker.tools.wmflabs.org is DOWN: /bin/ping -n -U -w 15 -c 5 checker.tools.wmflabs.org [20:03:14] RECOVERY - Host checker.tools.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 3.22 ms [20:04:06] PROBLEM - Host checker.tools.wmflabs.org is DOWN: check_ping: Invalid hostname/address - checker.tools.wmflabs.org [20:04:30] RECOVERY - Host checker.tools.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 4.77 ms [20:04:40] RECOVERY - Host tools.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 9.11 ms [20:05:23] It looks stable to me, I'm not sure why it's flapping [20:05:46] PROBLEM - puppet last run on lvs3004 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [20:07:22] PROBLEM - Host checker.tools.wmflabs.org is DOWN: /bin/ping -n -U -w 15 -c 5 checker.tools.wmflabs.org [20:08:08] RECOVERY - Host checker.tools.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 1.48 ms [20:11:01] 10Operations, 10ops-codfw, 10DBA: db2091 rebooted unexpectedly - https://phabricator.wikimedia.org/T224393 (10Maintenance_bot) [20:16:54] (03CR) 10Volans: "Looks almost ready, very minor small things inline." (0310 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/493138 (https://phabricator.wikimedia.org/T217072) (owner: 10CRusnov) [20:19:19] !log gilles@deploy1001 Started deploy [performance/asoranking@61039f1]: (no justification provided) [20:19:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:25] !log gilles@deploy1001 Finished deploy [performance/asoranking@61039f1]: (no justification provided) (duration: 00m 06s) [20:19:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:42] PROBLEM - Host checker.tools.wmflabs.org is DOWN: /bin/ping -n -U -w 15 -c 5 checker.tools.wmflabs.org [20:21:12] RECOVERY - Host checker.tools.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 2.44 ms [20:27:50] RECOVERY - puppet last run on analytics1048 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [20:30:59] PROBLEM - toolschecker: All Flannel etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: Name or service not known https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [20:32:04] (03PS1) 10Marostegui: db2091: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/512768 (https://phabricator.wikimedia.org/T224393) [20:32:44] RECOVERY - puppet last run on lvs3004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [20:33:54] PROBLEM - Host checker.tools.wmflabs.org is DOWN: check_ping: Invalid hostname/address - checker.tools.wmflabs.org [20:34:12] RECOVERY - Host checker.tools.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms [20:35:10] PROBLEM - Host tools.wmflabs.org is DOWN: check_ping: Invalid hostname/address - tools.wmflabs.org [20:35:30] (03CR) 10Marostegui: [C: 03+2] db2091: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/512768 (https://phabricator.wikimedia.org/T224393) (owner: 10Marostegui) [20:36:24] RECOVERY - toolschecker: All Flannel etcd nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 1.267 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [20:36:48] RECOVERY - Host tools.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 0.95 ms [20:38:30] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: db2091 rebooted unexpectedly - https://phabricator.wikimedia.org/T224393 (10Marostegui) a:03Papaul Also, upgrade firmware and BIOS I guess? [20:43:30] PROBLEM - Host checker.tools.wmflabs.org is DOWN: /bin/ping -n -U -w 15 -c 5 checker.tools.wmflabs.org [20:43:49] RECOVERY - Host checker.tools.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 1.63 ms [20:50:36] PROBLEM - Host checker.tools.wmflabs.org is DOWN: /bin/ping -n -U -w 15 -c 5 checker.tools.wmflabs.org [20:50:42] RECOVERY - Host checker.tools.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 1.63 ms [20:51:50] PROBLEM - Host tools.wmflabs.org is DOWN: check_ping: Invalid hostname/address - tools.wmflabs.org [20:54:04] RECOVERY - Host tools.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 4.38 ms [21:10:52] 10Operations, 10ops-codfw, 10DBA: db2091 rebooted unexpectedly - https://phabricator.wikimedia.org/T224393 (10Maintenance_bot) [21:26:59] (03CR) 10Urbanecm: [C: 04-1] "Per T221933#5215506." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507932 (https://phabricator.wikimedia.org/T221933) (owner: 10星耀晨曦) [21:32:02] (03CR) 10Urbanecm: Disable the rdf2latex Collection portlet format (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512743 (https://phabricator.wikimedia.org/T224433) (owner: 10Pmiazga) [21:41:41] 10Operations, 10User-herron: Improve visibility of incoming operations tasks - https://phabricator.wikimedia.org/T197624 (10Aklapper) Wondering how to proceed with https://phabricator.wikimedia.org/project/board/1025/ In my understanding: 1. SRE needs to define a date / task ID threshold, up to which task ID... [21:45:42] PROBLEM - NTP peers on dns5002 is CRITICAL: NTP CRITICAL: Offset 0.998671 secs (CRITICAL) https://wikitech.wikimedia.org/wiki/NTP [21:47:06] RECOVERY - NTP peers on dns5002 is OK: NTP OK: Offset 0.00089 secs https://wikitech.wikimedia.org/wiki/NTP [22:19:03] !log gilles@deploy1001 Started deploy [performance/asoranking@d0c156e]: T224388 [22:19:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:09] !log gilles@deploy1001 Finished deploy [performance/asoranking@d0c156e]: T224388 (duration: 00m 05s) [22:19:09] T224388: AS report crashes on generation - https://phabricator.wikimedia.org/T224388 [22:19:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:23] 10Operations, 10decommission, 10media-storage, 10User-fgiunchedi: Decom ms-be101[345] - https://phabricator.wikimedia.org/T220590 (10Volans) I've put the state of those hosts in Netbox back to `active` as they are currently "active" for the `spare::system` role and decomissioning should be set once we run... [22:50:38] 10Operations, 10ops-codfw, 10decommission, 10media-storage, 10User-fgiunchedi: decom ms-be201[345] - https://phabricator.wikimedia.org/T221068 (10Volans) I've put the state of those hosts in Netbox back to `active` as they are currently "active" for the `spare::system` role and decomissioning should be s... [22:52:04] !log gilles@deploy1001 Started deploy [performance/asoranking@bacfc37]: T224388 [22:52:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:52:09] !log gilles@deploy1001 Finished deploy [performance/asoranking@bacfc37]: T224388 (duration: 00m 05s) [22:52:09] T224388: AS report crashes on generation - https://phabricator.wikimedia.org/T224388 [22:52:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:38] !log restarting gerrit due to active threads being stuck being a sendemail thread. [23:10:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:52] my session survived the logout [23:12:58] er the restart [23:14:36] that's good. It should (although I know it doesn't). I know session cleanup happens at 1am UTC. I have a suspicion that following a restart, that's when folks get logged out [23:15:50] interesting [23:16:06] I'll see how it is tomorrow morning then [23:16:16] PROBLEM - puppet last run on cumin2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/cookbooks] [23:16:17] for now, good night :-) [23:16:52] PROBLEM - puppet last run on webperf2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_performance/docroot] [23:17:16] PROBLEM - puppet last run on stat1006 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 4 minutes ago with 4 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config],Exec[git_pull_mediawiki/event-schemas],Exec[git_pull_statistics_mediawiki],Exec[git_pull_analytics/reportupdater] [23:17:30] oh, send mail again? hmm. [23:17:58] PROBLEM - puppet last run on webperf1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_performance/docroot] [23:18:44] yep, sendmail again, filed https://phabricator.wikimedia.org/T224448 [23:19:28] !log gerrit back after restarting due to T224448 [23:19:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:33] T224448: Gerrit http threads stuck behind sendemail thread - https://phabricator.wikimedia.org/T224448 [23:19:34] I doin't think sendmail will show up in the ssh command if threads are stuck thcipriani [23:20:36] right, I added that bit to contrast T131189 [23:20:36] T131189: Only receiving few emails from Gerrit - https://phabricator.wikimedia.org/T131189 [23:22:33] ah ok. [23:43:14] RECOVERY - puppet last run on cumin2001 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [23:43:54] RECOVERY - puppet last run on webperf2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [23:44:16] RECOVERY - puppet last run on stat1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:44:56] RECOVERY - puppet last run on webperf1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures