[00:00:04] <jouncebot>	 Deploy window No deploys - US Holiday (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190527T0000)
[00:01:05] <icinga-wm>	 RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational
[00:07:17] <icinga-wm>	 PROBLEM - Check systemd state on cloudcontrol1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[00:15:07] <icinga-wm>	 PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[00:20:43] <icinga-wm>	 RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational
[00:26:15] <icinga-wm>	 RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational
[00:26:19] <icinga-wm>	 PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[00:30:33] <icinga-wm>	 RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational
[00:36:13] <icinga-wm>	 PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[00:38:57] <icinga-wm>	 PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[00:40:27] <icinga-wm>	 RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational
[00:44:37] <icinga-wm>	 PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[00:47:01] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2033 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[00:50:13] <icinga-wm>	 RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational
[01:09:47] <wikibugs>	 10Operations, 10Discovery-Search, 10Operations-Software-Development, 10User-Joe, 10User-jijiki: Create WDQS reboot cookbook - https://phabricator.wikimedia.org/T224385 (10Mathew.onipe)
[01:09:58] <wikibugs>	 10Operations, 10Discovery-Search, 10Operations-Software-Development, 10User-Joe, 10User-jijiki: Create WDQS reboot cookbook - https://phabricator.wikimedia.org/T224385 (10Mathew.onipe) p:05Triage→03Normal
[01:12:59] <wikibugs>	 (03PS3) 10Mathew.onipe: cloudelastic: remove ocsp_proxy [puppet] - 10https://gerrit.wikimedia.org/r/511381 (https://phabricator.wikimedia.org/T223519)
[01:16:57] <icinga-wm>	 PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[01:21:09] <icinga-wm>	 RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational
[01:25:17] <icinga-wm>	 RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational
[01:25:23] <icinga-wm>	 PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[01:31:03] <icinga-wm>	 RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational
[01:32:21] <icinga-wm>	 PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[01:35:17] <icinga-wm>	 PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[01:39:11] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2033 is OK: OK - running: The system is fully operational
[01:40:51] <icinga-wm>	 RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational
[01:45:03] <icinga-wm>	 PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[01:50:41] <icinga-wm>	 RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational
[01:54:51] <icinga-wm>	 PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[02:00:31] <icinga-wm>	 RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational
[02:07:33] <icinga-wm>	 PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[02:10:21] <icinga-wm>	 RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational
[02:14:33] <icinga-wm>	 PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[02:20:11] <icinga-wm>	 RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational
[02:25:41] <icinga-wm>	 RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational
[02:34:09] <icinga-wm>	 PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[02:46:57] <icinga-wm>	 PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[02:51:07] <icinga-wm>	 RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational
[02:56:51] <icinga-wm>	 PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle.
[02:59:23] <wikibugs>	 10Operations, 10Cassandra, 10RESTBase, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 2 others: Decommission restbase10(0[7-9]|1[0-5]) - https://phabricator.wikimedia.org/T223976 (10Eevans)
[02:59:56] <urandom>	 !log decommissioning restbase1013-a -- T223976
[03:00:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:00:03] <stashbot>	 T223976: Decommission restbase10(0[7-9]|1[0-5]) - https://phabricator.wikimedia.org/T223976
[03:17:45] <icinga-wm>	 PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[03:20:35] <icinga-wm>	 RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational
[03:22:03] <icinga-wm>	 PROBLEM - Disk space on maps2004 is CRITICAL: DISK CRITICAL - free space: /srv 53927 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space
[03:26:03] <icinga-wm>	 RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational
[03:27:37] <icinga-wm>	 PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[03:29:09] <icinga-wm>	 PROBLEM - Disk space on maps2004 is CRITICAL: DISK CRITICAL - free space: /srv 54697 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space
[03:29:11] <icinga-wm>	 RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[03:30:25] <icinga-wm>	 RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational
[03:33:07] <icinga-wm>	 PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[03:34:39] <icinga-wm>	 PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[03:40:17] <icinga-wm>	 RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational
[03:44:31] <icinga-wm>	 PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[03:50:09] <icinga-wm>	 RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational
[03:54:19] <icinga-wm>	 PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[04:17:49] <icinga-wm>	 PROBLEM - Host db2091 is DOWN: PING CRITICAL - Packet loss = 100%
[04:20:59] <icinga-wm>	 RECOVERY - Host db2091 is UP: PING OK - Packet loss = 0%, RTA = 36.23 ms
[04:21:03] <icinga-wm>	 RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational
[04:23:10] <logmsgbot>	 !log gilles@deploy1001 Started deploy [performance/asoranking@61039f1]: (no justification provided)
[04:23:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:23:39] <logmsgbot>	 !log gilles@deploy1001 Finished deploy [performance/asoranking@61039f1]: (no justification provided) (duration: 00m 28s)
[04:23:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:23:51] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s4 on db2091 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[04:24:09] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s2 on db2091 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[04:24:15] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s2 on db2091 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[04:24:19] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s4 on db2091 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[04:24:23] <icinga-wm>	 PROBLEM - MariaDB read only s2 on db2091 is CRITICAL: Could not connect to localhost:3312 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[04:24:23] <icinga-wm>	 PROBLEM - MariaDB read only s4 on db2091 is CRITICAL: Could not connect to localhost:3314 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[04:25:11] <icinga-wm>	 PROBLEM - mysqld processes on db2091 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[04:25:17] <icinga-wm>	 PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[04:25:52] <chaomodus>	 well that ain't good
[04:30:53] <icinga-wm>	 RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational
[04:31:58] <apergos>	 just leave it (mysql), it's codfw, it's a slave, our dbas will see it when they come on line
[04:32:07] <chaomodus>	 Ah oke doke
[04:32:11] <chaomodus>	 thanks
[04:32:13] <apergos>	 sure
[04:32:42] <apergos>	 marostegui: ^^  (for when you arrive)
[04:34:09] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on db2091 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[04:34:15] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db2091 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[04:35:07] <icinga-wm>	 PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[04:40:45] <icinga-wm>	 RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational
[04:44:57] <icinga-wm>	 PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[04:49:05] <icinga-wm>	 PROBLEM - puppet last run on kraz is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle.
[04:50:39] <icinga-wm>	 RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational
[04:54:28] <_joe_>	 ok cloudcontrol will drive me insane
[04:54:53] <icinga-wm>	 PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[05:00:31] <icinga-wm>	 RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational
[05:04:43] <icinga-wm>	 PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[05:05:10] <wikibugs>	 10Operations, 10Analytics, 10Performance-Team, 10Traffic: Only serve debug HTTP headers when x-wikimedia-debug is present - https://phabricator.wikimedia.org/T210484 (10Gilles) p:05Low→03Lowest
[05:10:21] <icinga-wm>	 RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational
[05:14:35] <icinga-wm>	 PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[05:16:07] <icinga-wm>	 RECOVERY - puppet last run on kraz is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[05:20:15] <icinga-wm>	 RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational
[05:24:27] <icinga-wm>	 PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[05:30:07] <icinga-wm>	 RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational
[05:34:19] <icinga-wm>	 PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[05:51:17] <icinga-wm>	 RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational
[06:04:03] <icinga-wm>	 PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[06:08:46] <wikibugs>	 10Operations, 10MediaWiki-Cache, 10serviceops, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 5 others: Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10Joe) >>! In T212129#5199087, @Krinkle wrote: > Regard...
[06:11:05] <icinga-wm>	 RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational
[06:15:17] <icinga-wm>	 PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[06:20:53] <icinga-wm>	 RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational
[06:25:05] <icinga-wm>	 PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[06:28:59] <icinga-wm>	 PROBLEM - puppet last run on dbmonitor2001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle.
[06:29:57] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[06:30:41] <icinga-wm>	 RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational
[06:31:13] <icinga-wm>	 PROBLEM - puppet last run on etcd1003 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle.
[06:32:01] <icinga-wm>	 PROBLEM - puppet last run on acmechief1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/puppet-enabled]
[06:32:39] <wikibugs>	 10Operations, 10Continuous-Integration-Config: Fix operations/puppet.git "rebase hell" - https://phabricator.wikimedia.org/T224033 (10Joe) I am 100% against having ci  handle merges of ops/puppet. Think of the case ci is down and we need puppet for anything.  I am also all for moving to rebase-if-necessary and...
[06:34:53] <icinga-wm>	 PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[06:35:35] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[06:38:11] <librenms-wmf>	 04Critical Alert for device cr1-codfw.wikimedia.org - Juniper alarm active
[06:38:49] <jijiki>	 I am really having a hard time keeping track of the real alerts
[06:40:31] <icinga-wm>	 RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational
[06:42:48] <_joe_>	 me too
[06:47:33] <icinga-wm>	 PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[06:50:21] <icinga-wm>	 RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational
[06:50:28] <wikibugs>	 10Operations, 10DBA: db2091 mysql service stopped running - https://phabricator.wikimedia.org/T224393 (10jijiki)
[06:54:31] <icinga-wm>	 PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[06:56:23] <wikibugs>	 10Operations, 10DBA: db2091 mysql service stopped running - https://phabricator.wikimedia.org/T224393 (10jijiki) p:05Triage→03High
[06:58:11] <icinga-wm>	 RECOVERY - puppet last run on etcd1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[06:58:40] <wikibugs>	 10Operations, 10Maps: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395 (10Mathew.onipe) p:05Triage→03Unbreak!
[06:58:57] <icinga-wm>	 RECOVERY - puppet last run on acmechief1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[06:59:01] <wikibugs>	 10Operations, 10MediaWiki-Cache, 10serviceops, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 5 others: Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10daniel) >>! In T212129#5211137, @EvanProdromou wrote:...
[07:00:07] <icinga-wm>	 RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational
[07:00:25] <icinga-wm>	 PROBLEM - puppet last run on oresrdb1001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle.
[07:01:23] <icinga-wm>	 RECOVERY - puppet last run on dbmonitor2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[07:05:13] <wikibugs>	 10Operations, 10Maps: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395 (10Gehel) Previous instance of a similar problem: T194966  Note that we've reimaged the servers since then, and we might have lost some configuration in the process.
[07:05:31] <gehel>	 !log running nodetool repair on maps2004 -T224395
[07:05:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:05:38] <stashbot>	 T224395: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395
[07:11:42] <wikibugs>	 10Operations, 10DBA: db2091 mysql service stopped running - https://phabricator.wikimedia.org/T224393 (10Marostegui) p:05High→03Normal Decreasing to normal, codfw isn't in use at the moment. Thanks for creating the task and working out the depool patch!
[07:22:17] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:22:33] <icinga-wm>	 PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:27:23] <icinga-wm>	 RECOVERY - puppet last run on oresrdb1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[07:31:15] <icinga-wm>	 PROBLEM - cassandra CQL 10.192.48.57:9042 on maps2004 is CRITICAL: connect to address 10.192.48.57 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[07:35:13] <icinga-wm>	 PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[07:35:29] <wikibugs>	 (03PS1) 10Effie Mouzeli: db-codfw.php: Depool 2109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512634 (https://phabricator.wikimedia.org/T224393)
[07:37:42] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] db-codfw.php: Depool 2109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512634 (https://phabricator.wikimedia.org/T224393) (owner: 10Effie Mouzeli)
[07:37:49] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:38:05] <icinga-wm>	 RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:40:55] <icinga-wm>	 RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational
[07:41:40] <wikibugs>	 (03CR) 10KartikMistry: "Ping!" [puppet] - 10https://gerrit.wikimedia.org/r/506043 (https://phabricator.wikimedia.org/T219819) (owner: 10Santhosh)
[07:43:46] <wikibugs>	 (03PS2) 10Effie Mouzeli: db-codfw.php: Depool 2109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512634 (https://phabricator.wikimedia.org/T224393)
[07:44:44] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] db-codfw.php: Depool 2109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512634 (https://phabricator.wikimedia.org/T224393) (owner: 10Effie Mouzeli)
[07:45:09] <icinga-wm>	 PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[07:48:25] <wikibugs>	 10Operations, 10Traffic: ATS: log mode cannot depend on log filters being configured - https://phabricator.wikimedia.org/T224397 (10Vgutierrez)
[07:49:28] <wikibugs>	 10Operations, 10Traffic: ATS: log mode cannot depend on log filters being configured - https://phabricator.wikimedia.org/T224397 (10Vgutierrez) p:05Triage→03Normal
[07:50:47] <icinga-wm>	 RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational
[07:51:33] <wikibugs>	 (03PS3) 10Effie Mouzeli: db-codfw.php: Depool 2019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512634 (https://phabricator.wikimedia.org/T224393)
[07:52:37] <wikibugs>	 (03PS4) 10Effie Mouzeli: db-codfw.php: Depool 2019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512634 (https://phabricator.wikimedia.org/T224393)
[07:53:54] <wikibugs>	 (03PS1) 10Vgutierrez: ATS: Set log mode independently of log filters [puppet] - 10https://gerrit.wikimedia.org/r/512636 (https://phabricator.wikimedia.org/T224397)
[07:54:23] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] db-codfw.php: Depool 2019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512634 (https://phabricator.wikimedia.org/T224393) (owner: 10Effie Mouzeli)
[07:54:33] <icinga-wm>	 PROBLEM - cassandra service on maps2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[07:54:57] <icinga-wm>	 PROBLEM - Check systemd state on maps2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[07:54:59] <icinga-wm>	 PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[07:56:43] <wikibugs>	 (03PS5) 10Effie Mouzeli: db-codfw.php: Depool 2019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512634 (https://phabricator.wikimedia.org/T224393)
[07:57:51] <icinga-wm>	 ACKNOWLEDGEMENT - MariaDB Slave IO: s2 on db2091 is CRITICAL: CRITICAL slave_io_state could not connect Effie Mouzeli Server restarted - T224393 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[07:57:51] <icinga-wm>	 ACKNOWLEDGEMENT - MariaDB Slave IO: s4 on db2091 is CRITICAL: CRITICAL slave_io_state could not connect Effie Mouzeli Server restarted - T224393 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[07:57:51] <icinga-wm>	 ACKNOWLEDGEMENT - MariaDB Slave Lag: s2 on db2091 is CRITICAL: CRITICAL slave_sql_lag could not connect Effie Mouzeli Server restarted - T224393 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[07:57:51] <icinga-wm>	 ACKNOWLEDGEMENT - MariaDB Slave Lag: s4 on db2091 is CRITICAL: CRITICAL slave_sql_lag could not connect Effie Mouzeli Server restarted - T224393 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[07:57:51] <icinga-wm>	 ACKNOWLEDGEMENT - MariaDB Slave SQL: s2 on db2091 is CRITICAL: CRITICAL slave_sql_state could not connect Effie Mouzeli Server restarted - T224393 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[07:57:52] <icinga-wm>	 ACKNOWLEDGEMENT - MariaDB Slave SQL: s4 on db2091 is CRITICAL: CRITICAL slave_sql_state could not connect Effie Mouzeli Server restarted - T224393 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[07:57:52] <icinga-wm>	 ACKNOWLEDGEMENT - MariaDB read only s2 on db2091 is CRITICAL: Could not connect to localhost:3312 Effie Mouzeli Server restarted - T224393 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[07:57:53] <icinga-wm>	 ACKNOWLEDGEMENT - MariaDB read only s4 on db2091 is CRITICAL: Could not connect to localhost:3314 Effie Mouzeli Server restarted - T224393 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[07:57:53] <icinga-wm>	 ACKNOWLEDGEMENT - mysqld processes on db2091 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld Effie Mouzeli Server restarted - T224393 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[08:00:24] <wikibugs>	 (03CR) 10Volans: db-codfw.php: Depool 2019 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512634 (https://phabricator.wikimedia.org/T224393) (owner: 10Effie Mouzeli)
[08:00:37] <icinga-wm>	 RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational
[08:03:01] <gehel>	 !log depool maps2004 - T224395
[08:03:05] <wikibugs>	 (03PS6) 10Effie Mouzeli: db-codfw.php: Depool 2091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512634 (https://phabricator.wikimedia.org/T224393)
[08:03:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:03:06] <stashbot>	 T224395: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395
[08:04:49] <icinga-wm>	 RECOVERY - Check systemd state on maps2004 is OK: OK - running: The system is fully operational
[08:04:49] <icinga-wm>	 PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[08:06:36] <wikibugs>	 10Operations: exim paniclog on $HOST has non-zero size - https://phabricator.wikimedia.org/T224399 (10Volans)
[08:07:50] <wikibugs>	 10Operations: exim paniclog on $HOST has non-zero size - https://phabricator.wikimedia.org/T224399 (10Volans) p:05Triage→03Normal
[08:07:52] <wikibugs>	 (03CR) 10Vgutierrez: "basically a NOOP on existent servers, but it's going to trigger a reload of the configuration on existent ATS instances: https://puppet-co" [puppet] - 10https://gerrit.wikimedia.org/r/512636 (https://phabricator.wikimedia.org/T224397) (owner: 10Vgutierrez)
[08:08:27] <wikibugs>	 (03PS7) 10Effie Mouzeli: db-codfw.php: Depool db2091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512634 (https://phabricator.wikimedia.org/T224393)
[08:09:41] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512634 (https://phabricator.wikimedia.org/T224393) (owner: 10Effie Mouzeli)
[08:10:31] <icinga-wm>	 RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational
[08:11:53] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review: db2091 rebooted unexpectedly - https://phabricator.wikimedia.org/T224393 (10Volans)
[08:14:43] <icinga-wm>	 PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[08:15:48] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] db-codfw.php: Depool db2091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512634 (https://phabricator.wikimedia.org/T224393) (owner: 10Effie Mouzeli)
[08:17:05] <wikibugs>	 (03Merged) 10jenkins-bot: db-codfw.php: Depool db2091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512634 (https://phabricator.wikimedia.org/T224393) (owner: 10Effie Mouzeli)
[08:17:19] <wikibugs>	 (03CR) 10jenkins-bot: db-codfw.php: Depool db2091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512634 (https://phabricator.wikimedia.org/T224393) (owner: 10Effie Mouzeli)
[08:19:38] <wikibugs>	 10Operations, 10decommission, 10media-storage, 10User-fgiunchedi: Decom ms-be101[345] - https://phabricator.wikimedia.org/T220590 (10Volans) @fgiunchedi FYI we got some email to `root@` from `ms-be1014` with the following: ` Cron <root@ms-be1014> test -x /usr/sbin/anacron || ( cd / && run-parts --report /e...
[08:20:21] <icinga-wm>	 RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational
[08:20:54] <volans>	 arturo: you around? cloudcontrol1003 is flapping the systemd alert since 2019-05-25 21:46
[08:22:34] <wikibugs>	 (03PS1) 10Gehel: maps: all maps servers use RAID10 [puppet] - 10https://gerrit.wikimedia.org/r/512639 (https://phabricator.wikimedia.org/T224395)
[08:24:33] <icinga-wm>	 PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[08:25:12] <wikibugs>	 10Operations, 10Maps, 10Patch-For-Review: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395 (10Gehel) For whatever reason, only maps1004 was reimaged to RAID10 (instead of RAID1) when adding new disks (so we have 2 unused disks in each server). Note that since we have disks...
[08:27:04] <logmsgbot>	 !log jiji@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2091 - T224393 (duration: 00m 49s)
[08:27:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:27:10] <stashbot>	 T224393: db2091 rebooted unexpectedly - https://phabricator.wikimedia.org/T224393
[08:29:30] <wikibugs>	 (03PS59) 10Vgutierrez: ATS: Provide a TLS terminator profile [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594)
[08:30:09] <icinga-wm>	 RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational
[08:32:41] <wikibugs>	 (03CR) 10Mathew.onipe: [C: 03+1] maps: all maps servers use RAID10 [puppet] - 10https://gerrit.wikimedia.org/r/512639 (https://phabricator.wikimedia.org/T224395) (owner: 10Gehel)
[08:34:21] <icinga-wm>	 PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[08:40:35] <icinga-wm>	 PROBLEM - HHVM rendering on mw2239 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[08:41:51] <icinga-wm>	 RECOVERY - HHVM rendering on mw2239 is OK: HTTP OK: HTTP/1.1 200 OK - 79975 bytes in 0.402 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[08:45:19] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10observability: Requesting access to icinga for tonycepo - https://phabricator.wikimedia.org/T224313 (10Aklapper) 05Open→03Stalled
[08:51:06] <arturo>	 yes I know volans 
[08:51:54] <volans>	 ack
[08:52:45] <arturo>	 !log 1 day downtime systemd check for cloudcontrol1003 
[08:52:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:53:26] <wikibugs>	 10Operations, 10ops-codfw, 10media-storage, 10observability, 10User-fgiunchedi: ms-be2043 'sdd' throwing lots of errors - https://phabricator.wikimedia.org/T222654 (10fgiunchedi) Thanks for taking a look!  >>! In T222654#5207032, @faidon wrote: > I'm not at all sure, but I don't see an LD 5 at all. Is it...
[08:57:12] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] maps: all maps servers use RAID10 [puppet] - 10https://gerrit.wikimedia.org/r/512639 (https://phabricator.wikimedia.org/T224395) (owner: 10Gehel)
[08:57:30] <wikibugs>	 10Operations, 10Release-Engineering-Team, 10SRE-Access-Requests: Request access to analytics cluster for Alaa Sarhan - https://phabricator.wikimedia.org/T223697 (10alaa_wmde) Hi @Nuria yes access to both would be ideal .. thank you
[09:00:55] <icinga-wm>	 RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational
[09:04:55] <icinga-wm>	 RECOVERY - cassandra service on maps2004 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:10:42] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/510985 (https://phabricator.wikimedia.org/T223496) (owner: 10Dzahn)
[09:11:50] <wikibugs>	 10Operations, 10Release-Engineering-Team, 10SRE-Access-Requests: Request access to deployment cluster for Alaa Sarhan - https://phabricator.wikimedia.org/T223698 (10alaa_wmde) Thanks @Volans I've fully and carefully read the Server Access Responsibilities document, and signed it. I believe we are awaiting on...
[09:15:19] <wikibugs>	 (03PS1) 10Vgutierrez: ATS: Ensure proper permissions for ATS layouts [puppet] - 10https://gerrit.wikimedia.org/r/512643 (https://phabricator.wikimedia.org/T221217)
[09:16:06] <gehel>	  !log remove maps2004 from maps cassandra cluster - T224395
[09:16:11] <stashbot>	 T224395: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395
[09:16:12] <wikibugs>	 10Operations, 10Icinga, 10Core Platform Team Backlog (Watching / External), 10Services (watching): Incorrect icinga settings for mobrovac - https://phabricator.wikimedia.org/T224406 (10mobrovac)
[09:17:05] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review: db2091 rebooted unexpectedly - https://phabricator.wikimedia.org/T224393 (10jijiki) `-------------------------------------------------------------------------------- SeqNumber       = 138 Message ID      = LOG007 Category        = Audit AgentID         = DE Severity...
[09:18:08] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/510985 (https://phabricator.wikimedia.org/T223496) (owner: 10Dzahn)
[09:18:19] <wikibugs>	 10Operations, 10Maps, 10Patch-For-Review: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on cumin2001.codfw.wmnet for hosts: ` ['maps2004.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimag...
[09:18:31] <wikibugs>	 10Operations, 10Wikimedia-Site-requests: Add more bad words to fancycaptcha/badwords - https://phabricator.wikimedia.org/T224343 (10Volans) p:05Triage→03Normal
[09:27:31] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] "NOOP on existing instances: https://puppet-compiler.wmflabs.org/compiler1001/16762/. In labs this is enough to start the trafficserver-tls" [puppet] - 10https://gerrit.wikimedia.org/r/512643 (https://phabricator.wikimedia.org/T221217) (owner: 10Vgutierrez)
[09:28:03] <icinga-wm>	 PROBLEM - tileratorui on maps2001 is CRITICAL: connect to address 10.192.0.144 and port 6535: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui
[09:28:13] <icinga-wm>	 PROBLEM - tilerator on maps2001 is CRITICAL: connect to address 10.192.0.144 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator
[09:28:25] <icinga-wm>	 PROBLEM - tileratorui on maps2002 is CRITICAL: connect to address 10.192.16.179 and port 6535: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui
[09:28:27] <onimisionipe>	 looking
[09:28:27] <icinga-wm>	 PROBLEM - tileratorui on maps2003 is CRITICAL: connect to address 10.192.32.146 and port 6535: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui
[09:28:45] <icinga-wm>	 PROBLEM - tilerator on maps2003 is CRITICAL: connect to address 10.192.32.146 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator
[09:28:47] <onimisionipe>	 probably related to T224395
[09:28:48] <stashbot>	 T224395: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395
[09:29:25] <icinga-wm>	 PROBLEM - tilerator on maps2002 is CRITICAL: connect to address 10.192.16.179 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator
[09:29:49] <icinga-wm>	 RECOVERY - tileratorui on maps2002 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 0.138 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui
[09:30:31] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: mitaka: stretch: don't install python3-msgpack from jessie-backports [puppet] - 10https://gerrit.wikimedia.org/r/512646 (https://phabricator.wikimedia.org/T224345)
[09:30:53] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review: db2091 rebooted unexpectedly - https://phabricator.wikimedia.org/T224393 (10Volans) Related documentation for the most useful messages:  - PWR2262 https://www.dell.com/support/manuals/it/it/itbsdt1/dell-opnmang-sw-v8.2/eemi_13g_v1.3-v2/pwr-event-messages?guid=guid-5bc...
[09:33:05] <wikibugs>	 (03CR) 10Vgutierrez: "traffic_layout verify is happy as well: https://phabricator.wikimedia.org/P8561" [puppet] - 10https://gerrit.wikimedia.org/r/512636 (https://phabricator.wikimedia.org/T224397) (owner: 10Vgutierrez)
[09:33:08] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review: db2091 rebooted unexpectedly - https://phabricator.wikimedia.org/T224393 (10Volans) Forgot to mention, nothing in syslog or journalctl for MySQL on s2/s4 units.
[09:34:01] <icinga-wm>	 PROBLEM - tileratorui on maps2002 is CRITICAL: connect to address 10.192.16.179 and port 6535: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui
[09:34:30] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: mitaka: stretch: don't install python3-msgpack from jessie-backports [puppet] - 10https://gerrit.wikimedia.org/r/512646 (https://phabricator.wikimedia.org/T224345) (owner: 10Arturo Borrero Gonzalez)
[09:35:01] <icinga-wm>	 RECOVERY - tilerator on maps2002 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 0.109 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator
[09:35:05] <icinga-wm>	 RECOVERY - tileratorui on maps2001 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 0.105 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui
[09:35:15] <icinga-wm>	 RECOVERY - tilerator on maps2001 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 0.106 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator
[09:35:27] <icinga-wm>	 RECOVERY - tileratorui on maps2002 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 0.114 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui
[09:35:41] <icinga-wm>	 RECOVERY - tileratorui on maps2003 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 0.106 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui
[09:35:47] <icinga-wm>	 RECOVERY - tilerator on maps2003 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 0.111 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator
[09:39:09] <onimisionipe>	 Ok. time to downtime
[09:39:55] <icinga-wm>	 PROBLEM - tileratorui on maps2003 is CRITICAL: connect to address 10.192.32.146 and port 6535: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui
[09:40:45] <icinga-wm>	 PROBLEM - tileratorui on maps2001 is CRITICAL: connect to address 10.192.0.144 and port 6535: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui
[09:42:01] <icinga-wm>	 RECOVERY - tileratorui on maps2001 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 0.103 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui
[09:46:57] <icinga-wm>	 RECOVERY - tileratorui on maps2003 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 0.105 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui
[09:52:20] <_joe_>	 !log disabling puppet on mw1261, running some tests for T223180
[09:52:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:52:25] <stashbot>	 T223180: Monitoring PHP 7 APC usage - https://phabricator.wikimedia.org/T223180
[09:52:39] <icinga-wm>	 RECOVERY - Disk space on maps2004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space
[09:58:20] <logmsgbot>	 !log jiji@deploy1001 Started deploy [cpjobqueue/deploy@421c029]: Migrating wikibase-addUsagesForPage  to PHP7 - T219148
[09:58:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:58:26] <stashbot>	 T219148: Use PHP7 to run all async jobs - https://phabricator.wikimedia.org/T219148
[09:59:29] <logmsgbot>	 !log jiji@deploy1001 Finished deploy [cpjobqueue/deploy@421c029]: Migrating wikibase-addUsagesForPage  to PHP7 - T219148 (duration: 01m 09s)
[09:59:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:03:43] <wikibugs>	 (03PS2) 10Volans: admin: add urbanecm to the deployment group [puppet] - 10https://gerrit.wikimedia.org/r/512401 (https://phabricator.wikimedia.org/T192830)
[10:04:53] <wikibugs>	 10Operations, 10decommission, 10media-storage, 10User-fgiunchedi: Decom ms-be101[345] - https://phabricator.wikimedia.org/T220590 (10fgiunchedi) >>! In T220590#5213945, @Volans wrote: > @fgiunchedi FYI we got some email to `root@` from `ms-be1014` with the following:  thanks! these are spare hosts now so I...
[10:05:17] <wikibugs>	 10Operations, 10Maps, 10Patch-For-Review: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['maps2004.codfw.wmnet'] `  and were **ALL** successful.
[10:06:29] <gehel>	 onimisionipe: ^
[10:06:43] <onimisionipe>	 yep
[10:11:50] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Include Swift analytics_admin auth .env file in HDFS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/512210 (https://phabricator.wikimedia.org/T219544) (owner: 10Ottomata)
[10:17:57] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] admin: add urbanecm to the deployment group [puppet] - 10https://gerrit.wikimedia.org/r/512401 (https://phabricator.wikimedia.org/T192830) (owner: 10Volans)
[10:19:32] <wikibugs>	 (03CR) 10Volans: [C: 03+2] admin: add urbanecm to the deployment group [puppet] - 10https://gerrit.wikimedia.org/r/512401 (https://phabricator.wikimedia.org/T192830) (owner: 10Volans)
[10:21:05] <wikibugs>	 10Operations, 10Release Pipeline, 10Services, 10serviceops, and 4 others: Deploy the RESTBase front-end service (RESTRouter) to Kubernetes - https://phabricator.wikimedia.org/T223953 (10mobrovac) [PR #1141](https://github.com/wikimedia/restbase/pull/1141) adds the needed Blubber config.
[10:21:29] <wikibugs>	 10Operations, 10Release Pipeline, 10serviceops, 10Core Platform Team (RESTBase Split (CDP2)), and 4 others: Deploy the RESTBase front-end service (RESTRouter) to Kubernetes - https://phabricator.wikimedia.org/T223953 (10mobrovac)
[10:27:34] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: pbuilder: add proxy configuration for security-cdn.d.o [puppet] - 10https://gerrit.wikimedia.org/r/512650
[10:28:19] <wikibugs>	 (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512651 (https://phabricator.wikimedia.org/T128546)
[10:28:21] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), and 3 others: Requesting access to production for SWAT deploy for Urbanecm - https://phabricator.wikimedia.org/T192830 (10Volans) As per docs added `Urbanecm` to the `wmf-deployment` group in Gerrit.
[10:31:55] <onimisionipe>	 !log rebooting maps2004 - cassandra unit failed and got stuck
[10:31:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:35:27] <mobrovac>	 !log decommission restbase1013-b - T223976
[10:35:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:35:32] <stashbot>	 T223976: Decommission restbase10(0[7-9]|1[0-5]) - https://phabricator.wikimedia.org/T223976
[10:35:36] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] pbuilder: add proxy configuration for security-cdn.d.o [puppet] - 10https://gerrit.wikimedia.org/r/512650 (owner: 10Giuseppe Lavagetto)
[10:44:59] <wikibugs>	 (03PS1) 10Volans: PuppetDB: fix handle of FAILED status [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/512653
[10:48:37] <wikibugs>	 (03CR) 10Faidon Liambotis: [C: 03+2] PuppetDB: fix handle of FAILED status [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/512653 (owner: 10Volans)
[10:52:09] <_joe_>	  !log uploading service-checker 0.1.5 to {jessie,stretch}-wikimedia
[10:53:43] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: add deprecated-input tag to deprecated inputs [puppet] - 10https://gerrit.wikimedia.org/r/512193 (https://phabricator.wikimedia.org/T220103) (owner: 10Cwhite)
[11:23:19] <wikibugs>	 (03CR) 10Michael Große: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510204 (https://phabricator.wikimedia.org/T223312) (owner: 10Michael Große)
[11:25:19] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), and 3 others: Requesting access to production for SWAT deploy for Urbanecm - https://phabricator.wikimedia.org/T192830 (10Volans) I've added `urbanecm` to the LDAP group `nda` as per request above given that it's...
[11:27:03] <icinga-wm>	 RECOVERY - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is OK: (C)130 ge (W)110 ge 106.5 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen
[11:30:53] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), and 3 others: Requesting access to production for SWAT deploy for Urbanecm - https://phabricator.wikimedia.org/T192830 (10zeljkofilipin) Thanks! I'll resolve it after the first successful deployment.
[11:33:33] <onimisionipe>	 !log starting osm initial import on maps2004 - T224395
[11:33:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:33:39] <stashbot>	 T224395: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395
[11:39:22] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): Add a list of IDs to skip in production (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511753 (owner: 10Michael Große)
[11:39:33] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: clientpackages: mitaka: stretch: special case for py3 client libs [puppet] - 10https://gerrit.wikimedia.org/r/512658 (https://phabricator.wikimedia.org/T224345)
[11:44:05] <wikibugs>	 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Wikidata, and 5 others: Introduce wikidata termbox SSR to kubernetes - https://phabricator.wikimedia.org/T220402 (10mobrovac) >>! In T220402#5199730, @Tarrow wrote: > Should we be using http rather than https internally?  Yes, indeed, sorry th...
[11:44:11] <wikibugs>	 10Operations, 10Icinga, 10Core Platform Team Backlog (Watching / External), 10Services (watching): Incorrect icinga settings for mobrovac - https://phabricator.wikimedia.org/T224406 (10Volans) p:05Triage→03Normal a:03Volans
[11:45:32] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC as expected: https://puppet-compiler.wmflabs.org/compiler1002/16763/" [puppet] - 10https://gerrit.wikimedia.org/r/512658 (https://phabricator.wikimedia.org/T224345) (owner: 10Arturo Borrero Gonzalez)
[11:46:32] <wikibugs>	 (03CR) 10Michael Große: "More importantly it makes no sense to add this config to beta instead of live 🤦" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511753 (owner: 10Michael Große)
[11:46:56] <wikibugs>	 (03PS4) 10Michael Große: Add a list of IDs to skip in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511753
[11:47:56] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add a list of IDs to skip in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511753 (owner: 10Michael Große)
[11:49:50] <wikibugs>	 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Wikidata, and 5 others: Introduce wikidata termbox SSR to kubernetes - https://phabricator.wikimedia.org/T220402 (10Tarrow) @mobrovac Thanks! I think we've now taken most of this onboard and merged it.  @akosiaris could you take a look at out...
[11:49:56] <wikibugs>	 10Operations, 10Icinga, 10Core Platform Team Backlog (Watching / External), 10Services (watching): Incorrect icinga settings for mobrovac - https://phabricator.wikimedia.org/T224406 (10MoritzMuehlenhoff) When was the last time this worked for you? modules/icinga/files/cgi.cfg has your shell user name, but...
[11:50:36] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 04-1] Add a list of IDs to skip in production (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511753 (owner: 10Michael Große)
[11:51:26] <wikibugs>	 10Operations, 10Icinga, 10Core Platform Team Backlog (Watching / External), 10Services (watching): Incorrect icinga settings for mobrovac - https://phabricator.wikimedia.org/T224406 (10Volans) @mobrovac - Regarding the notification AFAICT the RESTBase alerts notify the `team-services` group (`services@`)....
[11:54:43] <wikibugs>	 10Operations, 10Icinga, 10Core Platform Team Backlog (Watching / External), 10Services (watching): Incorrect icinga settings for mobrovac - https://phabricator.wikimedia.org/T224406 (10mobrovac) >>! In T224406#5214472, @MoritzMuehlenhoff wrote: > When was the last time this worked for you? modules/icinga/f...
[11:54:54] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: Revert "cloudcontrol: temporarily mark out prometheus classes on Stretch" [puppet] - 10https://gerrit.wikimedia.org/r/512663 (https://phabricator.wikimedia.org/T224345)
[11:56:46] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Revert "cloudcontrol: temporarily mark out prometheus classes on Stretch" [puppet] - 10https://gerrit.wikimedia.org/r/512663 (https://phabricator.wikimedia.org/T224345) (owner: 10Arturo Borrero Gonzalez)
[11:57:40] <wikibugs>	 10Operations, 10Icinga, 10Core Platform Team Backlog (Watching / External), 10Services (watching): Incorrect icinga settings for mobrovac - https://phabricator.wikimedia.org/T224406 (10mobrovac) >>! In T224406#5214478, @Volans wrote: > @mobrovac > - Regarding the notification AFAICT the RESTBase alerts not...
[12:03:07] <wikibugs>	 (03PS5) 10Michael Große: Add a list of IDs to skip in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511753
[12:08:49] <icinga-wm>	 PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[12:08:51] <icinga-wm>	 PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[12:11:40] <wikibugs>	 10Operations, 10PHP 7.2 support, 10Performance-Team (Radar): Monitoring PHP 7 APC usage - https://phabricator.wikimedia.org/T223180 (10Joe) 05Open→03Resolved So, I modified the APC dashboards in the php7-transition table to show the information you mentioned in the ticket, and I think it's fair. I realiz...
[12:31:17] <icinga-wm>	 RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[12:32:45] <icinga-wm>	 RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[12:44:11] <wikibugs>	 (03CR) 10Faidon Liambotis: [C: 04-1] "See inline for a few comments. Also, this doesn't seem to add the report to the README." (0316 comments) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/510256 (https://phabricator.wikimedia.org/T221507) (owner: 10CRusnov)
[12:45:15] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/507632 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite)
[12:47:38] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] firewall loggin: enable firewall logging on wmcs servers [puppet] - 10https://gerrit.wikimedia.org/r/511701 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond)
[12:48:31] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] firewall loggin: enable firewall logging on analytics servers [puppet] - 10https://gerrit.wikimedia.org/r/511702 (owner: 10Jbond)
[12:48:49] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] firewall logging: enable firewall logging on external servers [puppet] - 10https://gerrit.wikimedia.org/r/511703 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond)
[12:49:07] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] firewall logging: Enable logging on external servers [puppet] - 10https://gerrit.wikimedia.org/r/511704 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond)
[12:49:49] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] firewall logging: enable loggin on internal servers [puppet] - 10https://gerrit.wikimedia.org/r/511700 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond)
[12:50:00] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] firewall logging: add firewall logging to kafak servers [puppet] - 10https://gerrit.wikimedia.org/r/511705 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond)
[12:50:25] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] firewall logging: Enable logging on misc services [puppet] - 10https://gerrit.wikimedia.org/r/511706 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond)
[12:50:37] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] firewall logging: enable logging on ores [puppet] - 10https://gerrit.wikimedia.org/r/511707 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond)
[12:50:48] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] firewall logging: Enable firewall logging on mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/511708 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond)
[12:51:06] <wikibugs>	 10Operations, 10hardware-requests: Two test hosts for SREs - https://phabricator.wikimedia.org/T214024 (10faidon) I don't know what the status of this is, it's been a while it seems. I see it was pending for my approval, which I've missed -- apologies! Approved now.
[12:51:12] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] firewall logging: enable firewall logging on remaining roles [puppet] - 10https://gerrit.wikimedia.org/r/511709 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond)
[12:54:55] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Prometheus, add Routinator endpoint [puppet] - 10https://gerrit.wikimedia.org/r/508956 (https://phabricator.wikimedia.org/T220669) (owner: 10Ayounsi)
[12:56:52] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] rsyslog: add netdev_kafka_relay compatibility endpoint [puppet] - 10https://gerrit.wikimedia.org/r/495980 (https://phabricator.wikimedia.org/T224128) (owner: 10Herron)
[12:57:27] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): Add a list of IDs to skip in production (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511753 (owner: 10Michael Große)
[13:02:58] <godog>	 !log swift eqiad-prod: ms-be1033 weight to 0 - T223518
[13:03:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:03:05] <stashbot>	 T223518: ms-be1033 not powering up - https://phabricator.wikimedia.org/T223518
[13:05:14] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Add log_level, tls, openapi config options [deployment-charts] - 10https://gerrit.wikimedia.org/r/512673 (https://phabricator.wikimedia.org/T220401)
[13:09:55] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] "Tested locally, worked fine. We will need a newer service-checker image but otherwise LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/512673 (https://phabricator.wikimedia.org/T220401) (owner: 10Alexandros Kosiaris)
[13:14:37] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] graphite: fix dashboard links for thumbnail alerts [puppet] - 10https://gerrit.wikimedia.org/r/512133 (owner: 10Filippo Giunchedi)
[13:14:45] <wikibugs>	 (03PS2) 10Filippo Giunchedi: graphite: fix dashboard links for thumbnail alerts [puppet] - 10https://gerrit.wikimedia.org/r/512133
[13:18:51] <wikibugs>	 (03CR) 10Michael Große: Add a list of IDs to skip in production (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511753 (owner: 10Michael Große)
[13:21:36] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): Add a list of IDs to skip in production (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511753 (owner: 10Michael Große)
[13:22:20] <wikibugs>	 10Operations, 10Wikimedia-Site-requests: Add more bad words to fancycaptcha/badwords - https://phabricator.wikimedia.org/T224343 (10Reedy) 05Open→03Resolved a:03ArielGlenn Captchas should've been re-generated last night, so these words should have taken affect  Whether we want to make the blacklist publi...
[13:32:55] <wikibugs>	 (03CR) 10Michael Große: Add a list of IDs to skip in production (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511753 (owner: 10Michael Große)
[13:36:34] <wikibugs>	 (03PS1) 10Filippo Giunchedi: graphite: more robust swift thumbnail alerts [puppet] - 10https://gerrit.wikimedia.org/r/512679
[13:42:36] <wikibugs>	 (03PS2) 10Filippo Giunchedi: graphite: more robust swift thumbnail alerts [puppet] - 10https://gerrit.wikimedia.org/r/512679
[13:43:57] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] graphite: more robust swift thumbnail alerts [puppet] - 10https://gerrit.wikimedia.org/r/512679 (owner: 10Filippo Giunchedi)
[13:53:10] <wikibugs>	 (03PS1) 10Volans: icinga: fix Mobrovac case for authorization [puppet] - 10https://gerrit.wikimedia.org/r/512680 (https://phabricator.wikimedia.org/T224406)
[13:56:01] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1026 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[13:56:16] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] icinga: fix Mobrovac case for authorization [puppet] - 10https://gerrit.wikimedia.org/r/512680 (https://phabricator.wikimedia.org/T224406) (owner: 10Volans)
[13:56:37] <wikibugs>	 (03PS2) 10Volans: icinga: fix Mobrovac case for authorization [puppet] - 10https://gerrit.wikimedia.org/r/512680 (https://phabricator.wikimedia.org/T224406)
[13:59:13] <wikibugs>	 (03CR) 10Volans: [C: 03+2] icinga: fix Mobrovac case for authorization [puppet] - 10https://gerrit.wikimedia.org/r/512680 (https://phabricator.wikimedia.org/T224406) (owner: 10Volans)
[14:03:08] <icinga-wm>	 RECOVERY - cassandra CQL 10.192.48.57:9042 on maps2004 is OK: TCP OK - 0.036 second response time on 10.192.48.57 port 9042 https://phabricator.wikimedia.org/T93886
[14:03:15] <wikibugs>	 10Operations, 10Icinga, 10Core Platform Team Backlog (Watching / External), 10Services (watching): Incorrect icinga settings for mobrovac - https://phabricator.wikimedia.org/T224406 (10Maintenance_bot)
[14:05:20] <wikibugs>	 10Operations, 10Icinga, 10Core Platform Team Backlog (Watching / External), 10Services (watching): Incorrect icinga settings for mobrovac - https://phabricator.wikimedia.org/T224406 (10Volans) @mobrovac can you retry actions on the Icinga UI?
[14:07:54] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1026 is OK: OK - running: The system is fully operational
[14:13:40] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[14:13:56] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[14:14:00] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[14:14:11] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[14:14:26] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[14:14:26] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[14:14:34] <icinga-wm>	 PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5
[14:14:44] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[14:14:46] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[14:15:04] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[14:15:14] <icinga-wm>	 PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5
[14:15:58] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[14:17:18] <icinga-wm>	 PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[14:17:38] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[14:17:52] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[14:17:56] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[14:18:06] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[14:18:18] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[14:18:18] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[14:18:38] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[14:18:40] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[14:21:12] <icinga-wm>	 RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5
[14:22:38] <icinga-wm>	 RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[14:23:06] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[14:23:14] <icinga-wm>	 RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5
[14:24:00] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[14:28:04] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on restbase1012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Mobrovac test check
[14:28:58] <wikibugs>	 (03PS4) 10Michael Große: Add feature flag config for breaking Wikibase API change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510204 (https://phabricator.wikimedia.org/T223312)
[14:30:08] <wikibugs>	 (03PS5) 10Michael Große: Add feature flag config for breaking Wikibase API change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510204 (https://phabricator.wikimedia.org/T223300)
[14:30:17] <wikibugs>	 (03PS60) 10Vgutierrez: ATS: Provide a TLS terminator profile [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594)
[14:30:23] <wikibugs>	 10Operations, 10Icinga, 10Core Platform Team Backlog (Watching / External), 10Services (watching): Incorrect icinga settings for mobrovac - https://phabricator.wikimedia.org/T224406 (10mobrovac) >>! In T224406#5214843, @Volans wrote: > @mobrovac can you retry actions on the Icinga UI?  Icinga UI acking now...
[14:32:09] <wikibugs>	 (03PS2) 10Vgutierrez: ATS: Set log mode independently of log filters [puppet] - 10https://gerrit.wikimedia.org/r/512636 (https://phabricator.wikimedia.org/T224397)
[14:32:11] <wikibugs>	 (03PS61) 10Vgutierrez: ATS: Provide a TLS terminator profile [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594)
[14:36:09] <wikibugs>	 10Operations, 10Traffic: ATS: traffic_layout currently forces to use its own copy of shared libraries - https://phabricator.wikimedia.org/T224428 (10Vgutierrez)
[14:36:20] <wikibugs>	 10Operations, 10Traffic: ATS: traffic_layout currently forces to use its own copy of shared libraries - https://phabricator.wikimedia.org/T224428 (10Vgutierrez) p:05Triage→03Normal
[15:01:56] <wikibugs>	 10Operations, 10Icinga, 10Core Platform Team Backlog (Watching / External), 10Services (watching): Incorrect icinga settings for mobrovac - https://phabricator.wikimedia.org/T224406 (10Volans) So after a bit of debugging with @mobrovac it seems that the alarm that is not notifying the `team-services` conta...
[15:14:28] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[15:14:46] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[15:14:58] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[15:14:58] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqsin on icinga1001 is CRITICAL: job=varnish-text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[15:15:06] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[15:15:16] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[15:15:30] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[15:15:30] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[15:15:44] <icinga-wm>	 PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[15:15:50] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[15:16:10] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[15:16:20] <icinga-wm>	 PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5
[15:16:24] <icinga-wm>	 PROBLEM - Codfw HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5
[15:16:56] <wikibugs>	 (03PS3) 10Gehel: Convert cirrus data retention from cron to systemd [puppet] - 10https://gerrit.wikimedia.org/r/512235 (https://phabricator.wikimedia.org/T224200) (owner: 10EBernhardson)
[15:17:00] <icinga-wm>	 PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5
[15:17:06] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[15:18:38] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[15:18:40] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[15:18:58] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[15:19:10] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[15:19:10] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[15:19:24] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] Convert cirrus data retention from cron to systemd [puppet] - 10https://gerrit.wikimedia.org/r/512235 (https://phabricator.wikimedia.org/T224200) (owner: 10EBernhardson)
[15:19:26] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[15:19:40] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[15:19:41] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[15:20:40] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[15:22:26] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1024 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[15:24:04] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[15:24:06] <icinga-wm>	 RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[15:24:07] <wikibugs>	 (03PS1) 10Gehel: Convert cirrus data retention from cron to systemd. [puppet] - 10https://gerrit.wikimedia.org/r/512702 (https://phabricator.wikimedia.org/T224200)
[15:24:34] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[15:24:44] <icinga-wm>	 RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5
[15:24:48] <icinga-wm>	 RECOVERY - Codfw HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5
[15:25:24] <icinga-wm>	 RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5
[15:32:18] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1024 is OK: OK - running: The system is fully operational
[15:36:45] <akosiaris>	 !log initialize sessionstore namespace on eqiad/codfw/staging kubernetes clusters T220401
[15:36:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:36:51] <stashbot>	 T220401: Introduce kask session storage service to kubernetes - https://phabricator.wikimedia.org/T220401
[15:36:54] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] Add postgres slave init cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) (owner: 10Mathew.onipe)
[15:40:07] <akosiaris>	 !log initialize termbox namespace on eqiad/codfw/staging kubernetes clusters T220402
[15:40:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:40:14] <stashbot>	 T220402: Introduce wikidata termbox SSR to kubernetes - https://phabricator.wikimedia.org/T220402
[15:42:59] <wikibugs>	 (03PS1) 10Faidon Liambotis: autoinstall: drop 9600 and 57600 baud variants [puppet] - 10https://gerrit.wikimedia.org/r/512708
[15:43:01] <wikibugs>	 (03PS1) 10Faidon Liambotis: autoinstall: drop explicit references to lpxelinux [puppet] - 10https://gerrit.wikimedia.org/r/512709
[15:43:03] <wikibugs>	 (03PS1) 10Faidon Liambotis: autoinstall: cleanup pxelinux options [puppet] - 10https://gerrit.wikimedia.org/r/512710
[15:43:05] <wikibugs>	 (03PS1) 10Faidon Liambotis: autoinstall: configure DHCP for UEFI with syslinux [puppet] - 10https://gerrit.wikimedia.org/r/512711 (https://phabricator.wikimedia.org/T93208)
[15:44:06] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] autoinstall: cleanup pxelinux options [puppet] - 10https://gerrit.wikimedia.org/r/512710 (owner: 10Faidon Liambotis)
[15:44:21] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] autoinstall: configure DHCP for UEFI with syslinux [puppet] - 10https://gerrit.wikimedia.org/r/512711 (https://phabricator.wikimedia.org/T93208) (owner: 10Faidon Liambotis)
[15:46:12] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/512708 (owner: 10Faidon Liambotis)
[15:46:27] <wikibugs>	 10Operations, 10MediaWiki-Logging, 10Wikimedia-Logstash, 10wmerrors, and 6 others: Port mediawiki/php/wmerrors to PHP7 and deploy - https://phabricator.wikimedia.org/T187147 (10Joe) >>! In T187147#5207128, @tstarling wrote: > Basic porting work on wmerrors is hopefully complete. >  > It still writes a text...
[15:47:20] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/512709 (owner: 10Faidon Liambotis)
[15:50:38] <wikibugs>	 10Operations, 10Patch-For-Review: (U)EFI support - https://phabricator.wikimedia.org/T93208 (10faidon) So I just pushed a change that uses syslinux.efi above. This may prove to be short-lived, as we may switch to another PXE implementation (iPXE or GRUB, more on that later) but should work. It /may/ require to...
[15:52:41] <wikibugs>	 (03PS2) 10Faidon Liambotis: autoinstall: cleanup pxelinux options [puppet] - 10https://gerrit.wikimedia.org/r/512710
[15:52:43] <wikibugs>	 (03PS2) 10Faidon Liambotis: autoinstall: configure DHCP for UEFI with syslinux [puppet] - 10https://gerrit.wikimedia.org/r/512711 (https://phabricator.wikimedia.org/T93208)
[15:56:43] <wikibugs>	 (03CR) 10Volans: [C: 03+2] autoinstall: drop 9600 and 57600 baud variants [puppet] - 10https://gerrit.wikimedia.org/r/512708 (owner: 10Faidon Liambotis)
[15:57:19] <wikibugs>	 (03CR) 10Volans: [C: 03+2] autoinstall: drop explicit references to lpxelinux [puppet] - 10https://gerrit.wikimedia.org/r/512709 (owner: 10Faidon Liambotis)
[16:04:40] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Add a list of IDs to skip in production (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511753 (owner: 10Michael Große)
[16:07:59] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/512710 (owner: 10Faidon Liambotis)
[16:08:44] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: sessionstore: Populate kubernetes stanzas [puppet] - 10https://gerrit.wikimedia.org/r/512720 (https://phabricator.wikimedia.org/T220401)
[16:12:03] <wikibugs>	 10Operations, 10Wikimedia-Site-requests: Add more bad words to fancycaptcha/badwords - https://phabricator.wikimedia.org/T224343 (10Sphilbrick) Sounds great. For what it's worth, I let the individual who originally contacted us know that multiple people were working on resolving this and they seemed impressed...
[16:17:27] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/512711 (https://phabricator.wikimedia.org/T93208) (owner: 10Faidon Liambotis)
[16:17:50] <wikibugs>	 (03CR) 10Volans: [C: 03+2] autoinstall: cleanup pxelinux options [puppet] - 10https://gerrit.wikimedia.org/r/512710 (owner: 10Faidon Liambotis)
[16:18:31] <wikibugs>	 (03CR) 10Volans: [C: 03+2] autoinstall: configure DHCP for UEFI with syslinux [puppet] - 10https://gerrit.wikimedia.org/r/512711 (https://phabricator.wikimedia.org/T93208) (owner: 10Faidon Liambotis)
[16:19:43] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: sessionstore: Populate kubernetes stanzas [puppet] - 10https://gerrit.wikimedia.org/r/512720 (https://phabricator.wikimedia.org/T220401)
[16:23:52] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 57.14% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[16:25:53] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/16765/deploy1001.eqiad.wmnet/ says fine, merging" [puppet] - 10https://gerrit.wikimedia.org/r/512720 (https://phabricator.wikimedia.org/T220401) (owner: 10Alexandros Kosiaris)
[16:27:34] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[16:27:58] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[16:28:04] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[16:28:06] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[16:28:26] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[16:28:46] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[16:28:54] <icinga-wm>	 PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5
[16:29:04] <icinga-wm>	 PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[16:29:44] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[16:30:00] <icinga-wm>	 PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5
[16:30:30] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[16:31:34] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[16:31:44] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[16:32:06] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[16:32:14] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[16:32:14] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[16:32:36] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[16:34:32] <mobrovac>	 !log decommission restbase1013-c - T223976
[16:34:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:34:37] <stashbot>	 T223976: Decommission restbase10(0[7-9]|1[0-5]) - https://phabricator.wikimedia.org/T223976
[16:35:16] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[16:36:02] <icinga-wm>	 RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[16:36:04] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[16:36:06] <icinga-wm>	 PROBLEM - etcd request latencies on acrab is CRITICAL: instance=10.192.16.26:6443 operation=list https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[16:36:59] <akosiaris>	 probably a result of the apiservers being restarted ^ 
[16:37:16] <icinga-wm>	 RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5
[16:37:30] <icinga-wm>	 RECOVERY - etcd request latencies on acrab is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[16:38:22] <icinga-wm>	 RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5
[16:39:12] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[16:40:43] <volans>	 !log removed unreferenced files in /etc/dhcp/ on install[12]002
[16:40:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:44:58] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Fix order of "Edit" tabs when multi-tab mode used on single-tab wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512732 (https://phabricator.wikimedia.org/T223793)
[16:50:57] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] eventlogging.my.cnf: Increase buffer pool from 50G to 300G [puppet] - 10https://gerrit.wikimedia.org/r/512365 (https://phabricator.wikimedia.org/T224291) (owner: 10Marostegui)
[16:53:20] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1024 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[17:10:17] <wikibugs>	 10Operations, 10Maps, 10Patch-For-Review: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395 (10Mathew.onipe) p:05Unbreak!→03High
[17:11:01] <wikibugs>	 10Operations, 10Maps: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395 (10Maintenance_bot)
[17:11:04] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: designate: mitaka: stretch: use pdns 3.x [puppet] - 10https://gerrit.wikimedia.org/r/512734 (https://phabricator.wikimedia.org/T224354)
[17:11:16] <wikibugs>	 10Operations: (U)EFI support - https://phabricator.wikimedia.org/T93208 (10Maintenance_bot)
[17:11:45] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] openstack: designate: mitaka: stretch: use pdns 3.x [puppet] - 10https://gerrit.wikimedia.org/r/512734 (https://phabricator.wikimedia.org/T224354) (owner: 10Arturo Borrero Gonzalez)
[17:13:33] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: openstack: designate: mitaka: stretch: use pdns 3.x [puppet] - 10https://gerrit.wikimedia.org/r/512734 (https://phabricator.wikimedia.org/T224354)
[17:14:25] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] openstack: designate: mitaka: stretch: use pdns 3.x [puppet] - 10https://gerrit.wikimedia.org/r/512734 (https://phabricator.wikimedia.org/T224354) (owner: 10Arturo Borrero Gonzalez)
[17:15:22] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: openstack: designate: mitaka: stretch: use pdns 3.x [puppet] - 10https://gerrit.wikimedia.org/r/512734 (https://phabricator.wikimedia.org/T224354)
[17:16:12] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] openstack: designate: mitaka: stretch: use pdns 3.x [puppet] - 10https://gerrit.wikimedia.org/r/512734 (https://phabricator.wikimedia.org/T224354) (owner: 10Arturo Borrero Gonzalez)
[17:17:01] <wikibugs>	 (03PS4) 10Arturo Borrero Gonzalez: openstack: designate: mitaka: stretch: use pdns 3.x [puppet] - 10https://gerrit.wikimedia.org/r/512734 (https://phabricator.wikimedia.org/T224354)
[17:17:53] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] openstack: designate: mitaka: stretch: use pdns 3.x [puppet] - 10https://gerrit.wikimedia.org/r/512734 (https://phabricator.wikimedia.org/T224354) (owner: 10Arturo Borrero Gonzalez)
[17:18:46] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10Release Pipeline, 10Release-Engineering-Team (Kanban): Switch CI Docker Storage Driver to its own partition and to use devicemapper - https://phabricator.wikimedia.org/T178663 (10greg)
[17:18:50] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban): contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10greg) 05Open→03Stalled stalled until the disks are installed
[17:20:15] <wikibugs>	 (03PS5) 10Arturo Borrero Gonzalez: openstack: designate: mitaka: stretch: use pdns 3.x [puppet] - 10https://gerrit.wikimedia.org/r/512734 (https://phabricator.wikimedia.org/T224354)
[17:20:46] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] openstack: designate: mitaka: stretch: use pdns 3.x [puppet] - 10https://gerrit.wikimedia.org/r/512734 (https://phabricator.wikimedia.org/T224354) (owner: 10Arturo Borrero Gonzalez)
[17:23:06] <wikibugs>	 (03PS6) 10Arturo Borrero Gonzalez: openstack: designate: mitaka: stretch: use pdns 3.x [puppet] - 10https://gerrit.wikimedia.org/r/512734 (https://phabricator.wikimedia.org/T224354)
[17:23:37] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] openstack: designate: mitaka: stretch: use pdns 3.x [puppet] - 10https://gerrit.wikimedia.org/r/512734 (https://phabricator.wikimedia.org/T224354) (owner: 10Arturo Borrero Gonzalez)
[17:28:12] <wikibugs>	 (03PS7) 10Arturo Borrero Gonzalez: openstack: designate: mitaka: stretch: use pdns 3.x [puppet] - 10https://gerrit.wikimedia.org/r/512734 (https://phabricator.wikimedia.org/T224354)
[17:32:36] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1024 is OK: OK - running: The system is fully operational
[17:37:27] <andrewbogott>	 !log refreshing puppet-compiler facts
[17:37:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:39:44] <Krinkle>	 elukey_: If you're around to monitor operational impact on dbs and mcs, we can roll it out now if you want
[17:40:00] <Krinkle>	 it's a holiday for some so wasn't sure whether there'd be enough ppl
[17:44:43] <wikibugs>	 (03PS8) 10Andrew Bogott: openstack: designate: mitaka: stretch: use pdns 3.x [puppet] - 10https://gerrit.wikimedia.org/r/512734 (https://phabricator.wikimedia.org/T224354) (owner: 10Arturo Borrero Gonzalez)
[17:44:57] <wikibugs>	 (03CR) 10Mathew.onipe: Add postgres slave init cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) (owner: 10Mathew.onipe)
[17:47:57] <wikibugs>	 (03PS14) 10Mathew.onipe: Add postgres slave init cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946)
[17:51:03] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] openstack: designate: mitaka: stretch: use pdns 3.x [puppet] - 10https://gerrit.wikimedia.org/r/512734 (https://phabricator.wikimedia.org/T224354) (owner: 10Arturo Borrero Gonzalez)
[17:56:07] <andrewbogott>	 !log re-imaging cloudservices1004 in order to make sure our apt magic is working properly
[17:56:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:59:32] <elukey_>	 Krinkle: I am!
[17:59:52] <elukey_>	 but we can do it another time if you want
[18:00:14] <elukey_>	 I didn't mean to rush you :)
[18:00:34] <wikibugs>	 (03PS1) 10Volans: restbase: add team-services to Icinga notifications [puppet] - 10https://gerrit.wikimedia.org/r/512742 (https://phabricator.wikimedia.org/T224406)
[18:01:24] <icinga-wm>	 PROBLEM - Host 208.80.154.24 is DOWN: PING CRITICAL - Packet loss = 100%
[18:02:00] <volans>	 andrewbogott: related to the reimage? ^^^
[18:02:15] <andrewbogott>	 volans: yes
[18:02:25] <wikibugs>	 (03PS10) 10Krinkle: errorpages: Remove unused hhvm-fatal-error.php file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412829 (https://phabricator.wikimedia.org/T113114)
[18:02:26] <andrewbogott>	 and, dammit, no matter how many things I remember to downtime one always gets through
[18:02:48] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] errorpages: Remove unused hhvm-fatal-error.php file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412829 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle)
[18:03:12] <icinga-wm>	 ACKNOWLEDGEMENT - Host 208.80.154.24 is DOWN: PING CRITICAL - Packet loss = 100% andrew bogott this is me, rebuilding the host
[18:04:01] <wikibugs>	 (03Merged) 10jenkins-bot: errorpages: Remove unused hhvm-fatal-error.php file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412829 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle)
[18:04:03] <Krinkle>	 elukey: np, rolling out this first^, 
[18:04:16] <wikibugs>	 (03CR) 10jenkins-bot: errorpages: Remove unused hhvm-fatal-error.php file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412829 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle)
[18:05:51] <wikibugs>	 (03PS1) 10Pmiazga: Disable the rdf2latex Collection portlet format [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512743 (https://phabricator.wikimedia.org/T224433)
[18:06:33] <logmsgbot>	 !log krinkle@deploy1001 Synchronized errorpages/: 4ffcbfc2ba3 (duration: 00m 48s)
[18:06:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:07:33] <wikibugs>	 (03PS2) 10Volans: restbase: add team-services to service::node alert [puppet] - 10https://gerrit.wikimedia.org/r/512742 (https://phabricator.wikimedia.org/T224406)
[18:12:00] <Krinkle>	 elukey: ready in 5-6min, at the mercy of Jenkins
[18:12:10] <elukey>	 Krinkle: ack
[18:13:07] <elukey>	 Krinkle: as a follow up to this, I'll create bandwidth alarms for the mc hosts so we can catch any issue right after a deployment
[18:13:24] <icinga-wm>	 PROBLEM - puppet last run on puppetmaster1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:13:34] <wikibugs>	 (03PS3) 10Volans: restbase: add team-services to service::node alert [puppet] - 10https://gerrit.wikimedia.org/r/512742 (https://phabricator.wikimedia.org/T224406)
[18:15:32] <wikibugs>	 (03CR) 10Volans: "@mobrovac: if you want that team-services gets alerted for *all* checks on those hosts pick PS1, if you just want that check mentioned in " [puppet] - 10https://gerrit.wikimedia.org/r/512742 (https://phabricator.wikimedia.org/T224406) (owner: 10Volans)
[18:16:23] <Krinkle>	 elukey: cool, feel free to tag the perf-radar on a task if there is/willbe a task
[18:22:31] <icinga-wm>	 RECOVERY - Host 208.80.154.24 is UP: PING OK - Packet loss = 16%, RTA = 0.29 ms
[18:23:29] <icinga-wm>	 RECOVERY - puppet last run on puppetmaster1001 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures
[18:32:45] <Krinkle>	 elukey: ok, staging now
[18:41:16] <logmsgbot>	 !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.6/includes/libs/rdbms: 66556bf37e8 / T223310, T223978 (duration: 00m 50s)
[18:41:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:41:23] <stashbot>	 T223310: Investigate increase in tx bandwidth usage for mc1033 - https://phabricator.wikimedia.org/T223310
[18:41:24] <stashbot>	 T223978: 1.34.0-wmf.3 generating lots of temporary tables on MySQL slaves - https://phabricator.wikimedia.org/T223978
[18:45:39] <elukey>	 Krinkle: the problem seems fixed! \o/
[18:45:52] <elukey>	 I am checking https://grafana.wikimedia.org/d/000000574/t204083-investigation?orgId=1&from=now-1h&to=now&panelId=3&fullscreen&edit
[18:45:58] <Krinkle>	 Yeah
[18:46:09] <elukey>	 mc1033 is down to normal levels
[18:46:26] <elukey>	 thanks a lot! Cc: AaronSchulz 
[18:46:30] <Krinkle>	 the dbs looking good as well
[18:46:34] <wikibugs>	 10Operations, 10Growth-Team, 10Performance-Team, 10Wikidata, and 4 others: Investigate increase in tx bandwidth usage for mc1033 - https://phabricator.wikimedia.org/T223310 (10Krinkle) 05Open→03Resolved a:05elukey→03aaron
[18:48:09] <wikibugs>	 10Operations, 10Growth-Team, 10Performance-Team, 10Wikidata, and 4 others: Investigate increase in tx bandwidth usage for mc1033 - https://phabricator.wikimedia.org/T223310 (10Krinkle) Recovery: {F29260036}
[18:50:09] <elukey>	 Krinkle: going afk, thanks a lot!
[18:50:21] <elukey>	 I cross posted the results to the sre chan as well
[18:53:53] <wikibugs>	 (03PS1) 10Andrew Bogott: pdns3hack: don't pin pdns-recursor to the old repo [puppet] - 10https://gerrit.wikimedia.org/r/512744 (https://phabricator.wikimedia.org/T224354)
[18:54:48] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] pdns3hack: don't pin pdns-recursor to the old repo [puppet] - 10https://gerrit.wikimedia.org/r/512744 (https://phabricator.wikimedia.org/T224354) (owner: 10Andrew Bogott)
[18:59:16] <icinga-wm>	 RECOVERY - Recursive DNS on 208.80.154.24 is OK: DNS OK: 0.282 seconds response time. www.wikipedia.org returns 208.80.154.224 https://wikitech.wikimedia.org/wiki/DNS
[19:05:53] <wikibugs>	 (03PS1) 10Andrew Bogott: designate: improve firewall rules for memc access [puppet] - 10https://gerrit.wikimedia.org/r/512745
[19:06:34] <icinga-wm>	 PROBLEM - HHVM rendering on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[19:07:44] <icinga-wm>	 RECOVERY - HHVM rendering on mw1234 is OK: HTTP OK: HTTP/1.1 200 OK - 80252 bytes in 0.594 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:09:33] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] designate: improve firewall rules for memc access [puppet] - 10https://gerrit.wikimedia.org/r/512745 (owner: 10Andrew Bogott)
[19:19:36] <wikibugs>	 10Operations, 10DNS, 10Tools, 10Traffic: I can't load tools.wmflabs.org - https://phabricator.wikimedia.org/T224442 (10Patriccck)
[19:22:30] <wikibugs>	 10Operations, 10DNS, 10Tools, 10Traffic: I can't load tools.wmflabs.org - https://phabricator.wikimedia.org/T224442 (10Patriccck)
[19:24:45] <wikibugs>	 (03PS1) 10Andrew Bogott: designate/pdns: allow db access from standby to primary [puppet] - 10https://gerrit.wikimedia.org/r/512751
[19:29:37] <wikibugs>	 10Operations, 10DNS, 10Tools, 10Traffic: I can't load tools.wmflabs.org - https://phabricator.wikimedia.org/T224442 (10Krenair) Do you have dig? If so can you `dig wmflabs.org NS @cloud-ns0.wikimedia.org` and `dig wmflabs.org NS`?
[19:32:07] <wikibugs>	 10Operations, 10MediaWiki-Cache, 10serviceops, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 5 others: Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10Krinkle) >>! In T212129#5213676, @Joe wrote: >>>! In...
[19:32:23] <wikibugs>	 10Operations, 10DNS, 10Tools, 10Traffic: I can't load tools.wmflabs.org - https://phabricator.wikimedia.org/T224442 (10Mbch331) For windows that is `nslookup -type=ns wmflabs.org cloud-ns0.wikimedia.org` and `nslookup -type=ns wmflabs.org`
[19:33:29] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: db2091 rebooted unexpectedly - https://phabricator.wikimedia.org/T224393 (10Marostegui) @papaul should we contact the vendor with these logs?
[19:40:33] <wikibugs>	 10Operations, 10DNS, 10Tools, 10Traffic: I can't load tools.wmflabs.org - https://phabricator.wikimedia.org/T224442 (10Andrew) Hello!  This is probably something I caused as part of maintenance to our dns setup.  Is it better now?
[19:46:26] <icinga-wm>	 RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational
[19:47:30] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] designate/pdns: allow db access from standby to primary [puppet] - 10https://gerrit.wikimedia.org/r/512751 (owner: 10Andrew Bogott)
[20:00:15] <wikibugs>	 10Operations, 10Release-Engineering-Team, 10SRE-Access-Requests: Request access to analytics cluster for Alaa Sarhan - https://phabricator.wikimedia.org/T223697 (10Nuria) @alaa_wmde Can you be a bit more specific as to what data you need access to?  So you know Logstash and hadoop do not share any data, mayb...
[20:00:16] <icinga-wm>	 PROBLEM - Host tools.wmflabs.org is DOWN: check_ping: Invalid hostname/address - tools.wmflabs.org
[20:00:46] <icinga-wm>	 PROBLEM - puppet last run on analytics1048 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle.
[20:01:06] <icinga-wm>	 RECOVERY - Host tools.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 1.11 ms
[20:01:44] <paladox>	 hmm
[20:01:50] <paladox>	 andrewbogott ^^
[20:02:06] <andrewbogott>	 paladox: yep, that's me :/
[20:02:09] <paladox>	 oh
[20:02:26] <icinga-wm>	 PROBLEM - Host tools.wmflabs.org is DOWN: check_ping: Invalid hostname/address - tools.wmflabs.org
[20:02:54] <icinga-wm>	 PROBLEM - Host checker.tools.wmflabs.org is DOWN: /bin/ping -n -U -w 15 -c 5 checker.tools.wmflabs.org
[20:03:14] <icinga-wm>	 RECOVERY - Host checker.tools.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 3.22 ms
[20:04:06] <icinga-wm>	 PROBLEM - Host checker.tools.wmflabs.org is DOWN: check_ping: Invalid hostname/address - checker.tools.wmflabs.org
[20:04:30] <icinga-wm>	 RECOVERY - Host checker.tools.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 4.77 ms
[20:04:40] <icinga-wm>	 RECOVERY - Host tools.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 9.11 ms
[20:05:23] <andrewbogott>	 It looks stable to me, I'm not sure why it's flapping
[20:05:46] <icinga-wm>	 PROBLEM - puppet last run on lvs3004 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle.
[20:07:22] <icinga-wm>	 PROBLEM - Host checker.tools.wmflabs.org is DOWN: /bin/ping -n -U -w 15 -c 5 checker.tools.wmflabs.org
[20:08:08] <icinga-wm>	 RECOVERY - Host checker.tools.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 1.48 ms
[20:11:01] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: db2091 rebooted unexpectedly - https://phabricator.wikimedia.org/T224393 (10Maintenance_bot)
[20:16:54] <wikibugs>	 (03CR) 10Volans: "Looks almost ready, very minor small things inline." (0310 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/493138 (https://phabricator.wikimedia.org/T217072) (owner: 10CRusnov)
[20:19:19] <logmsgbot>	 !log gilles@deploy1001 Started deploy [performance/asoranking@61039f1]: (no justification provided)
[20:19:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:19:25] <logmsgbot>	 !log gilles@deploy1001 Finished deploy [performance/asoranking@61039f1]: (no justification provided) (duration: 00m 06s)
[20:19:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:20:42] <icinga-wm>	 PROBLEM - Host checker.tools.wmflabs.org is DOWN: /bin/ping -n -U -w 15 -c 5 checker.tools.wmflabs.org
[20:21:12] <icinga-wm>	 RECOVERY - Host checker.tools.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 2.44 ms
[20:27:50] <icinga-wm>	 RECOVERY - puppet last run on analytics1048 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[20:30:59] <icinga-wm>	 PROBLEM - toolschecker: All Flannel etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: Name or service not known https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker
[20:32:04] <wikibugs>	 (03PS1) 10Marostegui: db2091: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/512768 (https://phabricator.wikimedia.org/T224393)
[20:32:44] <icinga-wm>	 RECOVERY - puppet last run on lvs3004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[20:33:54] <icinga-wm>	 PROBLEM - Host checker.tools.wmflabs.org is DOWN: check_ping: Invalid hostname/address - checker.tools.wmflabs.org
[20:34:12] <icinga-wm>	 RECOVERY - Host checker.tools.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms
[20:35:10] <icinga-wm>	 PROBLEM - Host tools.wmflabs.org is DOWN: check_ping: Invalid hostname/address - tools.wmflabs.org
[20:35:30] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2091: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/512768 (https://phabricator.wikimedia.org/T224393) (owner: 10Marostegui)
[20:36:24] <icinga-wm>	 RECOVERY - toolschecker: All Flannel etcd nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 1.267 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker
[20:36:48] <icinga-wm>	 RECOVERY - Host tools.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 0.95 ms
[20:38:30] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: db2091 rebooted unexpectedly - https://phabricator.wikimedia.org/T224393 (10Marostegui) a:03Papaul Also, upgrade firmware and BIOS I guess?
[20:43:30] <icinga-wm>	 PROBLEM - Host checker.tools.wmflabs.org is DOWN: /bin/ping -n -U -w 15 -c 5 checker.tools.wmflabs.org
[20:43:49] <icinga-wm>	 RECOVERY - Host checker.tools.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 1.63 ms
[20:50:36] <icinga-wm>	 PROBLEM - Host checker.tools.wmflabs.org is DOWN: /bin/ping -n -U -w 15 -c 5 checker.tools.wmflabs.org
[20:50:42] <icinga-wm>	 RECOVERY - Host checker.tools.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 1.63 ms
[20:51:50] <icinga-wm>	 PROBLEM - Host tools.wmflabs.org is DOWN: check_ping: Invalid hostname/address - tools.wmflabs.org
[20:54:04] <icinga-wm>	 RECOVERY - Host tools.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 4.38 ms
[21:10:52] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: db2091 rebooted unexpectedly - https://phabricator.wikimedia.org/T224393 (10Maintenance_bot)
[21:26:59] <wikibugs>	 (03CR) 10Urbanecm: [C: 04-1] "Per T221933#5215506." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507932 (https://phabricator.wikimedia.org/T221933) (owner: 10星耀晨曦)
[21:32:02] <wikibugs>	 (03CR) 10Urbanecm: Disable the rdf2latex Collection portlet format (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512743 (https://phabricator.wikimedia.org/T224433) (owner: 10Pmiazga)
[21:41:41] <wikibugs>	 10Operations, 10User-herron: Improve visibility of incoming operations tasks - https://phabricator.wikimedia.org/T197624 (10Aklapper) Wondering how to proceed with https://phabricator.wikimedia.org/project/board/1025/  In my understanding:  1. SRE needs to define a date / task ID threshold, up to which task ID...
[21:45:42] <icinga-wm>	 PROBLEM - NTP peers on dns5002 is CRITICAL: NTP CRITICAL: Offset 0.998671 secs (CRITICAL) https://wikitech.wikimedia.org/wiki/NTP
[21:47:06] <icinga-wm>	 RECOVERY - NTP peers on dns5002 is OK: NTP OK: Offset 0.00089 secs https://wikitech.wikimedia.org/wiki/NTP
[22:19:03] <logmsgbot>	 !log gilles@deploy1001 Started deploy [performance/asoranking@d0c156e]: T224388
[22:19:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:19:09] <logmsgbot>	 !log gilles@deploy1001 Finished deploy [performance/asoranking@d0c156e]: T224388 (duration: 00m 05s)
[22:19:09] <stashbot>	 T224388: AS report crashes on generation - https://phabricator.wikimedia.org/T224388
[22:19:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:50:23] <wikibugs>	 10Operations, 10decommission, 10media-storage, 10User-fgiunchedi: Decom ms-be101[345] - https://phabricator.wikimedia.org/T220590 (10Volans) I've put the state of those hosts in Netbox back to `active` as they are currently "active" for the `spare::system` role and decomissioning should be set once we run...
[22:50:38] <wikibugs>	 10Operations, 10ops-codfw, 10decommission, 10media-storage, 10User-fgiunchedi: decom ms-be201[345] - https://phabricator.wikimedia.org/T221068 (10Volans) I've put the state of those hosts in Netbox back to `active` as they are currently "active" for the `spare::system` role and decomissioning should be s...
[22:52:04] <logmsgbot>	 !log gilles@deploy1001 Started deploy [performance/asoranking@bacfc37]: T224388
[22:52:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:52:09] <logmsgbot>	 !log gilles@deploy1001 Finished deploy [performance/asoranking@bacfc37]: T224388 (duration: 00m 05s)
[22:52:09] <stashbot>	 T224388: AS report crashes on generation - https://phabricator.wikimedia.org/T224388
[22:52:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:10:38] <thcipriani>	 !log restarting gerrit due to active threads being stuck being a sendemail thread.
[23:10:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:12:52] <apergos>	 my session survived the logout
[23:12:58] <apergos>	 er the restart
[23:14:36] <thcipriani>	 that's good. It should (although I know it doesn't). I know session cleanup happens at 1am UTC. I have a suspicion that following a restart, that's when folks get logged out
[23:15:50] <apergos>	 interesting
[23:16:06] <apergos>	 I'll see how it is tomorrow morning then
[23:16:16] <icinga-wm>	 PROBLEM - puppet last run on cumin2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/cookbooks]
[23:16:17] <apergos>	 for now, good night :-)
[23:16:52] <icinga-wm>	 PROBLEM - puppet last run on webperf2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_performance/docroot]
[23:17:16] <icinga-wm>	 PROBLEM - puppet last run on stat1006 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 4 minutes ago with 4 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config],Exec[git_pull_mediawiki/event-schemas],Exec[git_pull_statistics_mediawiki],Exec[git_pull_analytics/reportupdater]
[23:17:30] <paladox>	 oh, send mail again? hmm.
[23:17:58] <icinga-wm>	 PROBLEM - puppet last run on webperf1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_performance/docroot]
[23:18:44] <thcipriani>	 yep, sendmail again, filed https://phabricator.wikimedia.org/T224448
[23:19:28] <thcipriani>	 !log gerrit back after restarting due to T224448
[23:19:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:19:33] <stashbot>	 T224448: Gerrit http threads stuck behind sendemail thread - https://phabricator.wikimedia.org/T224448
[23:19:34] <paladox>	 I doin't think sendmail will show up in the ssh command if threads are stuck thcipriani 
[23:20:36] <thcipriani>	 right, I added that bit to contrast T131189
[23:20:36] <stashbot>	 T131189: Only receiving few emails from Gerrit - https://phabricator.wikimedia.org/T131189
[23:22:33] <paladox>	 ah ok.
[23:43:14] <icinga-wm>	 RECOVERY - puppet last run on cumin2001 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures
[23:43:54] <icinga-wm>	 RECOVERY - puppet last run on webperf2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[23:44:16] <icinga-wm>	 RECOVERY - puppet last run on stat1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[23:44:56] <icinga-wm>	 RECOVERY - puppet last run on webperf1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures