[00:52:57] <icinga-wm>	 PROBLEM - Host db2088 is DOWN: PING CRITICAL - Packet loss = 100%
[00:55:17] <icinga-wm>	 RECOVERY - Host db2088 is UP: PING OK - Packet loss = 0%, RTA = 36.18 ms
[00:57:27] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s2 on db2088 is CRITICAL: CRITICAL slave_io_state could not connect
[00:57:48] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s1 on db2088 is CRITICAL: CRITICAL slave_sql_state could not connect
[00:57:48] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s2 on db2088 is CRITICAL: CRITICAL slave_sql_state could not connect
[00:57:57] <icinga-wm>	 PROBLEM - MariaDB read only s2 on db2088 is CRITICAL: Could not connect to localhost:3312
[00:57:57] <icinga-wm>	 PROBLEM - MariaDB read only s1 on db2088 is CRITICAL: Could not connect to localhost:3311
[00:58:00] <icinga-wm>	 PROBLEM - mysqld processes on db2088 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld
[00:58:18] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s1 on db2088 is CRITICAL: CRITICAL slave_io_state could not connect
[01:05:17] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on db2088 is CRITICAL: CRITICAL slave_sql_lag could not connect
[01:05:47] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on db2088 is CRITICAL: CRITICAL slave_sql_lag could not connect
[01:33:17] <wikibugs>	 (03PS1) 10星耀晨曦: Allow subpages in main namespace in zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455380 (https://phabricator.wikimedia.org/T202007)
[01:34:32] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Allow subpages in main namespace in zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455380 (https://phabricator.wikimedia.org/T202007) (owner: 10星耀晨曦)
[01:34:39] <andrewbogott>	 db2088 is back up and looks ok — I assume that mysql needs a manual start though
[01:34:47] <andrewbogott>	 (and I'm not 100% sure that's the right thing to do…)
[01:39:47] <icinga-wm>	 PROBLEM - Check systemd state on db2088 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[01:47:44] <wikibugs>	 10Operations, 10DBA: db2088 rebooted itself and came back sick - https://phabricator.wikimedia.org/T202822 (10Andrew)
[01:50:05] <wikibugs>	 10Operations, 10DBA: db2088 rebooted itself and came back sick - https://phabricator.wikimedia.org/T202822 (10Andrew) I don't see anything in the syslog to warn about the coming crash... it just stops dead at 00:50:01
[01:53:58] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on db2088 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. andrew bogott https://phabricator.wikimedia.org/T202822
[01:53:58] <icinga-wm>	 ACKNOWLEDGEMENT - MariaDB Slave IO: s1 on db2088 is CRITICAL: CRITICAL slave_io_state could not connect andrew bogott https://phabricator.wikimedia.org/T202822
[01:53:58] <icinga-wm>	 ACKNOWLEDGEMENT - MariaDB Slave IO: s2 on db2088 is CRITICAL: CRITICAL slave_io_state could not connect andrew bogott https://phabricator.wikimedia.org/T202822
[01:53:58] <icinga-wm>	 ACKNOWLEDGEMENT - MariaDB Slave Lag: s1 on db2088 is CRITICAL: CRITICAL slave_sql_lag could not connect andrew bogott https://phabricator.wikimedia.org/T202822
[01:53:58] <icinga-wm>	 ACKNOWLEDGEMENT - MariaDB Slave Lag: s2 on db2088 is CRITICAL: CRITICAL slave_sql_lag could not connect andrew bogott https://phabricator.wikimedia.org/T202822
[01:53:59] <icinga-wm>	 ACKNOWLEDGEMENT - MariaDB Slave SQL: s1 on db2088 is CRITICAL: CRITICAL slave_sql_state could not connect andrew bogott https://phabricator.wikimedia.org/T202822
[01:53:59] <icinga-wm>	 ACKNOWLEDGEMENT - MariaDB Slave SQL: s2 on db2088 is CRITICAL: CRITICAL slave_sql_state could not connect andrew bogott https://phabricator.wikimedia.org/T202822
[01:54:00] <icinga-wm>	 ACKNOWLEDGEMENT - MariaDB read only s1 on db2088 is CRITICAL: Could not connect to localhost:3311 andrew bogott https://phabricator.wikimedia.org/T202822
[01:54:00] <icinga-wm>	 ACKNOWLEDGEMENT - MariaDB read only s2 on db2088 is CRITICAL: Could not connect to localhost:3312 andrew bogott https://phabricator.wikimedia.org/T202822
[01:54:01] <icinga-wm>	 ACKNOWLEDGEMENT - mysqld processes on db2088 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld andrew bogott https://phabricator.wikimedia.org/T202822
[02:10:36] <wikibugs>	 (03PS2) 10星耀晨曦: Allow subpages in main namespace in zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455380 (https://phabricator.wikimedia.org/T202007)
[02:11:19] <wikibugs>	 (03PS3) 10Krinkle: Document meaning and origin of 'cluster' in mc.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454740
[02:33:06] <wikibugs>	 (03CR) 10Rxy: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455380 (https://phabricator.wikimedia.org/T202007) (owner: 10星耀晨曦)
[03:28:08] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 867.38 seconds
[03:36:18] <icinga-wm>	 PROBLEM - puppet last run on analytics1064 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz],File[/usr/share/GeoIP/GeoIP2-City.mmdb.test]
[03:50:48] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 285.62 seconds
[03:59:37] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet operation_type={create_container,start_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[03:59:50] <icinga-wm>	 ACKNOWLEDGEMENT - HP RAID on db2058 is CRITICAL: CRITICAL: Slot 0: Failed: 1I:1:11 - OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:12 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T202824
[03:59:55] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on db2058 - https://phabricator.wikimedia.org/T202824 (10ops-monitoring-bot)
[04:00:38] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[04:01:47] <icinga-wm>	 RECOVERY - puppet last run on analytics1064 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[05:43:56] <wikibugs>	 (03PS1) 10Marostegui: db2088.yaml: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/455383 (https://phabricator.wikimedia.org/T202822)
[05:45:01] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db2088.yaml: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/455383 (https://phabricator.wikimedia.org/T202822) (owner: 10Marostegui)
[05:48:26] <wikibugs>	 (03PS1) 10Marostegui: db-codfw.php: Depool db2088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455384 (https://phabricator.wikimedia.org/T202822)
[05:50:31] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455384 (https://phabricator.wikimedia.org/T202822) (owner: 10Marostegui)
[05:51:51] <wikibugs>	 (03Merged) 10jenkins-bot: db-codfw.php: Depool db2088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455384 (https://phabricator.wikimedia.org/T202822) (owner: 10Marostegui)
[05:51:53] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: db2058: Disk #11 predictive failure - https://phabricator.wikimedia.org/T202798 (10Marostegui) 05Open>03Invalid The disk finally failed, so let's follow up there - T202824
[05:52:29] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2058 - https://phabricator.wikimedia.org/T202824 (10Marostegui) p:05Triage>03Normal a:03Papaul Let's get the disk replaced! Thanks
[05:53:38] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2088 - T202822 (duration: 00m 54s)
[05:53:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:53:44] <stashbot>	 T202822: db2088 rebooted itself and came back sick - https://phabricator.wikimedia.org/T202822
[05:54:30] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: db2088 rebooted itself and came back sick - https://phabricator.wikimedia.org/T202822 (10Marostegui) p:05Triage>03Normal a:03Papaul Thanks a lot for triaging this @Andrew. HW logs look empty unfortunately but this crash looks really similar to the...
[06:03:28] <wikibugs>	 (03CR) 10jenkins-bot: db-codfw.php: Depool db2088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455384 (https://phabricator.wikimedia.org/T202822) (owner: 10Marostegui)
[06:28:59] <icinga-wm>	 PROBLEM - puppet last run on mw1308 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ImageMagick-6/policy.xml]
[06:31:48] <icinga-wm>	 PROBLEM - puppet last run on dbproxy1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/puppet-enabled]
[06:56:58] <icinga-wm>	 RECOVERY - puppet last run on dbproxy1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[06:59:28] <icinga-wm>	 RECOVERY - puppet last run on mw1308 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[07:26:29] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[07:28:38] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[07:34:59] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[07:39:18] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[11:05:00] <wikibugs>	 (03PS1) 10WMDE-leszek: Wikidata: Use new item ID formatter for Q1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455389 (https://phabricator.wikimedia.org/T201832)
[11:14:39] <wikibugs>	 (03PS1) 10WMDE-leszek: Wikidata: Use new item ID formatter for Q1-Q100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455390 (https://phabricator.wikimedia.org/T201833)
[12:10:59] <wikibugs>	 (03CR) 10Rxy: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454213 (https://phabricator.wikimedia.org/T202347) (owner: 10MarcoAurelio)
[12:23:00] <wikibugs>	 (03PS1) 10Urbanecm: *.pensoft.net should be in wgCopyUploadsDomains whitelist instead of pensoft.net [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455393 (https://phabricator.wikimedia.org/T202832)
[12:28:19] <wikibugs>	 (03PS1) 10Urbanecm: Create namespace aliases in zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455394 (https://phabricator.wikimedia.org/T202821)
[13:30:46] <wikibugs>	 (03PS1) 10Urbanecm: Create new namespaces in zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455395 (https://phabricator.wikimedia.org/T201675)
[13:42:48] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[13:47:08] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[14:14:44] <wikibugs>	 (03CR) 10Chico Venancio: [C: 031] *.pensoft.net should be in wgCopyUploadsDomains whitelist instead of pensoft.net [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455393 (https://phabricator.wikimedia.org/T202832) (owner: 10Urbanecm)
[15:41:31] <wikibugs>	 (03PS2) 10Urbanecm: Create new namespaces in zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455395 (https://phabricator.wikimedia.org/T201675)
[17:29:59] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[17:32:49] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[17:53:48] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[17:58:49] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[18:07:28] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[18:11:38] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[19:16:28] <icinga-wm>	 PROBLEM - Disk space on elastic1027 is CRITICAL: DISK CRITICAL - free space: /srv 49695 MB (10% inode=99%)
[19:32:28] <icinga-wm>	 RECOVERY - Disk space on elastic1027 is OK: DISK OK
[21:14:19] <wikibugs>	 (03CR) 10Zoranzoki21: "Note for Zeljko or another person who will perform SWAT: An unexpected health problem happened and I go to the dentist. In the period of d" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455354 (owner: 10Zoranzoki21)
[21:14:25] <wikibugs>	 (03CR) 10Zoranzoki21: "Note for Zeljko or another person who will perform SWAT: An unexpected health problem happened and I go to the dentist. In the period of d" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455307 (https://phabricator.wikimedia.org/T202808) (owner: 10Zoranzoki21)
[21:53:26] <wikibugs>	 (03PS11) 10Gergő Tisza: Remove sitewide and user CSS/JS editing from old groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421124 (https://phabricator.wikimedia.org/T190015)
[21:54:23] <wikibugs>	 (03PS13) 10Gergő Tisza: Enforce that interface-admin is the only group that can edit non-own CSS/JS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421125 (https://phabricator.wikimedia.org/T190015)
[23:42:58] <icinga-wm>	 PROBLEM - dhclient process on stat1005 is CRITICAL: Return code of 255 is out of bounds
[23:42:58] <icinga-wm>	 PROBLEM - DPKG on stat1005 is CRITICAL: Return code of 255 is out of bounds
[23:42:58] <icinga-wm>	 PROBLEM - Check systemd state on stat1005 is CRITICAL: Return code of 255 is out of bounds
[23:43:29] <icinga-wm>	 PROBLEM - configured eth on stat1005 is CRITICAL: Return code of 255 is out of bounds
[23:43:29] <icinga-wm>	 PROBLEM - MD RAID on stat1005 is CRITICAL: Return code of 255 is out of bounds
[23:44:19] <icinga-wm>	 PROBLEM - puppet last run on stat1005 is CRITICAL: Return code of 255 is out of bounds