[00:52:57] PROBLEM - Host db2088 is DOWN: PING CRITICAL - Packet loss = 100% [00:55:17] RECOVERY - Host db2088 is UP: PING OK - Packet loss = 0%, RTA = 36.18 ms [00:57:27] PROBLEM - MariaDB Slave IO: s2 on db2088 is CRITICAL: CRITICAL slave_io_state could not connect [00:57:48] PROBLEM - MariaDB Slave SQL: s1 on db2088 is CRITICAL: CRITICAL slave_sql_state could not connect [00:57:48] PROBLEM - MariaDB Slave SQL: s2 on db2088 is CRITICAL: CRITICAL slave_sql_state could not connect [00:57:57] PROBLEM - MariaDB read only s2 on db2088 is CRITICAL: Could not connect to localhost:3312 [00:57:57] PROBLEM - MariaDB read only s1 on db2088 is CRITICAL: Could not connect to localhost:3311 [00:58:00] PROBLEM - mysqld processes on db2088 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [00:58:18] PROBLEM - MariaDB Slave IO: s1 on db2088 is CRITICAL: CRITICAL slave_io_state could not connect [01:05:17] PROBLEM - MariaDB Slave Lag: s1 on db2088 is CRITICAL: CRITICAL slave_sql_lag could not connect [01:05:47] PROBLEM - MariaDB Slave Lag: s2 on db2088 is CRITICAL: CRITICAL slave_sql_lag could not connect [01:33:17] (03PS1) 10星耀晨曦: Allow subpages in main namespace in zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455380 (https://phabricator.wikimedia.org/T202007) [01:34:32] (03CR) 10jerkins-bot: [V: 04-1] Allow subpages in main namespace in zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455380 (https://phabricator.wikimedia.org/T202007) (owner: 10星耀晨曦) [01:34:39] db2088 is back up and looks ok — I assume that mysql needs a manual start though [01:34:47] (and I'm not 100% sure that's the right thing to do…) [01:39:47] PROBLEM - Check systemd state on db2088 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [01:47:44] 10Operations, 10DBA: db2088 rebooted itself and came back sick - https://phabricator.wikimedia.org/T202822 (10Andrew) [01:50:05] 10Operations, 10DBA: db2088 rebooted itself and came back sick - https://phabricator.wikimedia.org/T202822 (10Andrew) I don't see anything in the syslog to warn about the coming crash... it just stops dead at 00:50:01 [01:53:58] ACKNOWLEDGEMENT - Check systemd state on db2088 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. andrew bogott https://phabricator.wikimedia.org/T202822 [01:53:58] ACKNOWLEDGEMENT - MariaDB Slave IO: s1 on db2088 is CRITICAL: CRITICAL slave_io_state could not connect andrew bogott https://phabricator.wikimedia.org/T202822 [01:53:58] ACKNOWLEDGEMENT - MariaDB Slave IO: s2 on db2088 is CRITICAL: CRITICAL slave_io_state could not connect andrew bogott https://phabricator.wikimedia.org/T202822 [01:53:58] ACKNOWLEDGEMENT - MariaDB Slave Lag: s1 on db2088 is CRITICAL: CRITICAL slave_sql_lag could not connect andrew bogott https://phabricator.wikimedia.org/T202822 [01:53:58] ACKNOWLEDGEMENT - MariaDB Slave Lag: s2 on db2088 is CRITICAL: CRITICAL slave_sql_lag could not connect andrew bogott https://phabricator.wikimedia.org/T202822 [01:53:59] ACKNOWLEDGEMENT - MariaDB Slave SQL: s1 on db2088 is CRITICAL: CRITICAL slave_sql_state could not connect andrew bogott https://phabricator.wikimedia.org/T202822 [01:53:59] ACKNOWLEDGEMENT - MariaDB Slave SQL: s2 on db2088 is CRITICAL: CRITICAL slave_sql_state could not connect andrew bogott https://phabricator.wikimedia.org/T202822 [01:54:00] ACKNOWLEDGEMENT - MariaDB read only s1 on db2088 is CRITICAL: Could not connect to localhost:3311 andrew bogott https://phabricator.wikimedia.org/T202822 [01:54:00] ACKNOWLEDGEMENT - MariaDB read only s2 on db2088 is CRITICAL: Could not connect to localhost:3312 andrew bogott https://phabricator.wikimedia.org/T202822 [01:54:01] ACKNOWLEDGEMENT - mysqld processes on db2088 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld andrew bogott https://phabricator.wikimedia.org/T202822 [02:10:36] (03PS2) 10星耀晨曦: Allow subpages in main namespace in zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455380 (https://phabricator.wikimedia.org/T202007) [02:11:19] (03PS3) 10Krinkle: Document meaning and origin of 'cluster' in mc.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454740 [02:33:06] (03CR) 10Rxy: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455380 (https://phabricator.wikimedia.org/T202007) (owner: 10星耀晨曦) [03:28:08] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 867.38 seconds [03:36:18] PROBLEM - puppet last run on analytics1064 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz],File[/usr/share/GeoIP/GeoIP2-City.mmdb.test] [03:50:48] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 285.62 seconds [03:59:37] PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet operation_type={create_container,start_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [03:59:50] ACKNOWLEDGEMENT - HP RAID on db2058 is CRITICAL: CRITICAL: Slot 0: Failed: 1I:1:11 - OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:12 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T202824 [03:59:55] 10Operations, 10ops-codfw: Degraded RAID on db2058 - https://phabricator.wikimedia.org/T202824 (10ops-monitoring-bot) [04:00:38] RECOVERY - kubelet operational latencies on kubernetes1003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [04:01:47] RECOVERY - puppet last run on analytics1064 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:43:56] (03PS1) 10Marostegui: db2088.yaml: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/455383 (https://phabricator.wikimedia.org/T202822) [05:45:01] (03CR) 10Marostegui: [C: 032] db2088.yaml: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/455383 (https://phabricator.wikimedia.org/T202822) (owner: 10Marostegui) [05:48:26] (03PS1) 10Marostegui: db-codfw.php: Depool db2088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455384 (https://phabricator.wikimedia.org/T202822) [05:50:31] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455384 (https://phabricator.wikimedia.org/T202822) (owner: 10Marostegui) [05:51:51] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455384 (https://phabricator.wikimedia.org/T202822) (owner: 10Marostegui) [05:51:53] 10Operations, 10ops-codfw, 10DBA: db2058: Disk #11 predictive failure - https://phabricator.wikimedia.org/T202798 (10Marostegui) 05Open>03Invalid The disk finally failed, so let's follow up there - T202824 [05:52:29] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2058 - https://phabricator.wikimedia.org/T202824 (10Marostegui) p:05Triage>03Normal a:03Papaul Let's get the disk replaced! Thanks [05:53:38] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2088 - T202822 (duration: 00m 54s) [05:53:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:53:44] T202822: db2088 rebooted itself and came back sick - https://phabricator.wikimedia.org/T202822 [05:54:30] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: db2088 rebooted itself and came back sick - https://phabricator.wikimedia.org/T202822 (10Marostegui) p:05Triage>03Normal a:03Papaul Thanks a lot for triaging this @Andrew. HW logs look empty unfortunately but this crash looks really similar to the... [06:03:28] (03CR) 10jenkins-bot: db-codfw.php: Depool db2088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455384 (https://phabricator.wikimedia.org/T202822) (owner: 10Marostegui) [06:28:59] PROBLEM - puppet last run on mw1308 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ImageMagick-6/policy.xml] [06:31:48] PROBLEM - puppet last run on dbproxy1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/puppet-enabled] [06:56:58] RECOVERY - puppet last run on dbproxy1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:59:28] RECOVERY - puppet last run on mw1308 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:26:29] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [07:28:38] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [07:34:59] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [07:39:18] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [11:05:00] (03PS1) 10WMDE-leszek: Wikidata: Use new item ID formatter for Q1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455389 (https://phabricator.wikimedia.org/T201832) [11:14:39] (03PS1) 10WMDE-leszek: Wikidata: Use new item ID formatter for Q1-Q100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455390 (https://phabricator.wikimedia.org/T201833) [12:10:59] (03CR) 10Rxy: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454213 (https://phabricator.wikimedia.org/T202347) (owner: 10MarcoAurelio) [12:23:00] (03PS1) 10Urbanecm: *.pensoft.net should be in wgCopyUploadsDomains whitelist instead of pensoft.net [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455393 (https://phabricator.wikimedia.org/T202832) [12:28:19] (03PS1) 10Urbanecm: Create namespace aliases in zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455394 (https://phabricator.wikimedia.org/T202821) [13:30:46] (03PS1) 10Urbanecm: Create new namespaces in zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455395 (https://phabricator.wikimedia.org/T201675) [13:42:48] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [13:47:08] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [14:14:44] (03CR) 10Chico Venancio: [C: 031] *.pensoft.net should be in wgCopyUploadsDomains whitelist instead of pensoft.net [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455393 (https://phabricator.wikimedia.org/T202832) (owner: 10Urbanecm) [15:41:31] (03PS2) 10Urbanecm: Create new namespaces in zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455395 (https://phabricator.wikimedia.org/T201675) [17:29:59] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:32:49] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [17:53:48] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:58:49] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [18:07:28] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [18:11:38] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [19:16:28] PROBLEM - Disk space on elastic1027 is CRITICAL: DISK CRITICAL - free space: /srv 49695 MB (10% inode=99%) [19:32:28] RECOVERY - Disk space on elastic1027 is OK: DISK OK [21:14:19] (03CR) 10Zoranzoki21: "Note for Zeljko or another person who will perform SWAT: An unexpected health problem happened and I go to the dentist. In the period of d" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455354 (owner: 10Zoranzoki21) [21:14:25] (03CR) 10Zoranzoki21: "Note for Zeljko or another person who will perform SWAT: An unexpected health problem happened and I go to the dentist. In the period of d" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455307 (https://phabricator.wikimedia.org/T202808) (owner: 10Zoranzoki21) [21:53:26] (03PS11) 10Gergő Tisza: Remove sitewide and user CSS/JS editing from old groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421124 (https://phabricator.wikimedia.org/T190015) [21:54:23] (03PS13) 10Gergő Tisza: Enforce that interface-admin is the only group that can edit non-own CSS/JS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421125 (https://phabricator.wikimedia.org/T190015) [23:42:58] PROBLEM - dhclient process on stat1005 is CRITICAL: Return code of 255 is out of bounds [23:42:58] PROBLEM - DPKG on stat1005 is CRITICAL: Return code of 255 is out of bounds [23:42:58] PROBLEM - Check systemd state on stat1005 is CRITICAL: Return code of 255 is out of bounds [23:43:29] PROBLEM - configured eth on stat1005 is CRITICAL: Return code of 255 is out of bounds [23:43:29] PROBLEM - MD RAID on stat1005 is CRITICAL: Return code of 255 is out of bounds [23:44:19] PROBLEM - puppet last run on stat1005 is CRITICAL: Return code of 255 is out of bounds