[00:02:17] PROBLEM - HHVM rendering on mw2172 is CRITICAL: connect to address 10.192.32.60 and port 80: Connection refused [00:02:17] PROBLEM - HHVM processes on mw2171 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:02:18] PROBLEM - nutcracker port on mw2171 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:02:18] PROBLEM - nutcracker process on mw2172 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:02:18] PROBLEM - puppet last run on mw2173 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:03:57] PROBLEM - HHVM rendering on mw2171 is CRITICAL: connect to address 10.192.32.59 and port 80: Connection refused [00:03:58] PROBLEM - nutcracker process on mw2171 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:03:58] PROBLEM - puppet last run on mw2172 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:05:38] PROBLEM - Apache HTTP on mw2173 is CRITICAL: connect to address 10.192.32.61 and port 80: Connection refused [00:05:38] PROBLEM - puppet last run on mw2171 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:05:38] PROBLEM - MD RAID on mw2173 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:07:27] PROBLEM - Apache HTTP on mw2172 is CRITICAL: connect to address 10.192.32.60 and port 80: Connection refused [00:07:27] PROBLEM - MD RAID on mw2172 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:07:27] PROBLEM - Check size of conntrack table on mw2173 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:09:07] PROBLEM - Apache HTTP on mw2171 is CRITICAL: connect to address 10.192.32.59 and port 80: Connection refused [00:09:07] PROBLEM - Nginx local proxy to apache on mw2173 is CRITICAL: connect to address 10.192.32.61 and port 443: Connection refused [00:09:07] PROBLEM - MD RAID on mw2171 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:09:08] PROBLEM - Check systemd state on mw2173 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:09:08] PROBLEM - Check size of conntrack table on mw2172 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:10:47] PROBLEM - Nginx local proxy to apache on mw2172 is CRITICAL: connect to address 10.192.32.60 and port 443: Connection refused [00:10:48] PROBLEM - Check size of conntrack table on mw2171 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:10:48] PROBLEM - Check systemd state on mw2172 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:10:48] PROBLEM - Check the NTP synchronisation status of timesyncd on mw2173 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:12:37] PROBLEM - Nginx local proxy to apache on mw2171 is CRITICAL: connect to address 10.192.32.59 and port 443: Connection refused [00:12:37] PROBLEM - Check systemd state on mw2171 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:12:37] PROBLEM - Check the NTP synchronisation status of timesyncd on mw2172 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:12:37] PROBLEM - Check whether ferm is active by checking the default input chain on mw2173 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:12:37] PROBLEM - configured eth on mw2173 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:14:17] PROBLEM - Check the NTP synchronisation status of timesyncd on mw2171 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:14:17] PROBLEM - Check whether ferm is active by checking the default input chain on mw2172 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:14:17] PROBLEM - configured eth on mw2172 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:14:17] PROBLEM - DPKG on mw2173 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:14:17] PROBLEM - dhclient process on mw2173 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:14:58] sigh, yes, i got that [00:15:49] always starts when i look away for 5 min, heh [00:17:47] PROBLEM - DPKG on mw2171 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:17:47] PROBLEM - mediawiki-installation DSH group on mw2172 is CRITICAL: Host mw2172 is not in mediawiki-installation dsh group [00:17:47] PROBLEM - dhclient process on mw2171 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:17:47] PROBLEM - HHVM processes on mw2173 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:17:47] PROBLEM - Disk space on mw2172 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:17:47] PROBLEM - nutcracker port on mw2173 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:19:28] PROBLEM - HHVM rendering on mw2173 is CRITICAL: connect to address 10.192.32.61 and port 80: Connection refused [00:19:28] PROBLEM - mediawiki-installation DSH group on mw2171 is CRITICAL: Host mw2171 is not in mediawiki-installation dsh group [00:19:28] PROBLEM - Disk space on mw2171 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:19:28] PROBLEM - HHVM processes on mw2172 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:19:28] PROBLEM - nutcracker port on mw2172 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:19:28] PROBLEM - nutcracker process on mw2173 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:20:34] 10Operations, 10Mail: E-mail for people in different OIT LDAP object unit - https://phabricator.wikimedia.org/T159750#4165673 (10bbogaert) Hi All, Could we (Office IT) discuss this with someone in Ops? We are currently blocked on my tasks/projects until we get our alignment on LDAP gorups and the way our MX s... [00:33:21] (03PS2) 10Dzahn: base: update version of gen_fingerprints script [puppet] - 10https://gerrit.wikimedia.org/r/429114 [00:33:37] (03CR) 10jerkins-bot: [V: 04-1] base: update version of gen_fingerprints script [puppet] - 10https://gerrit.wikimedia.org/r/429114 (owner: 10Dzahn) [00:34:07] (03CR) 10Dzahn: "PS2: debugged and got it to work by removing the "\s" from the awk line." [puppet] - 10https://gerrit.wikimedia.org/r/429114 (owner: 10Dzahn) [00:34:26] (03PS3) 10Dzahn: base: update version of gen_fingerprints script [puppet] - 10https://gerrit.wikimedia.org/r/429114 [00:36:14] (03PS4) 10Dzahn: base: update version of gen_fingerprints script [puppet] - 10https://gerrit.wikimedia.org/r/429114 [01:09:55] 10Operations, 10Design-Research: Edit optoutresearch@ mailing list recipients - https://phabricator.wikimedia.org/T100860#4165706 (10Dzahn) a:05Dzahn>03bbogaert [01:18:02] 10Operations, 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban, and 6 others: Proxies information gone from Zero portal. Opera mini pageviews geolocating to wrong country - https://phabricator.wikimedia.org/T187014#4165709 (10Tbayer) Thanks @dr0ptp4kt for the explanation! To add to the charts pos... [01:23:38] RECOVERY - Apache HTTP on mw2172 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.073 second response time [01:24:18] PROBLEM - Disk space on labtestnet2001 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=75%) [01:24:57] RECOVERY - Apache HTTP on mw2173 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.073 second response time [01:25:18] RECOVERY - Apache HTTP on mw2171 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.073 second response time [01:36:37] RECOVERY - Check size of conntrack table on mw2172 is OK: OK: nf_conntrack is 0 % full [01:36:38] RECOVERY - Check whether ferm is active by checking the default input chain on mw2172 is OK: OK ferm input default policy is set [01:36:38] RECOVERY - configured eth on mw2172 is OK: OK - interfaces up [01:36:47] RECOVERY - HHVM processes on mw2172 is OK: PROCS OK: 6 processes with command name hhvm [01:36:57] RECOVERY - MD RAID on mw2172 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [01:37:07] RECOVERY - HHVM processes on mw2173 is OK: PROCS OK: 6 processes with command name hhvm [01:37:07] RECOVERY - MD RAID on mw2173 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [01:37:08] RECOVERY - Disk space on mw2172 is OK: DISK OK [01:37:38] RECOVERY - DPKG on mw2173 is OK: All packages OK [01:37:38] RECOVERY - dhclient process on mw2173 is OK: PROCS OK: 0 processes with command name dhclient [01:37:57] RECOVERY - Check size of conntrack table on mw2173 is OK: OK: nf_conntrack is 0 % full [01:37:57] RECOVERY - configured eth on mw2173 is OK: OK - interfaces up [01:37:58] RECOVERY - Check whether ferm is active by checking the default input chain on mw2173 is OK: OK ferm input default policy is set [01:39:18] that's all fine and stuff ..yet it still said "Unable to run wmf-auto-reimage-host: Failed to puppet_first_run" which is another new issue [01:40:48] RECOVERY - Check the NTP synchronisation status of timesyncd on mw2173 is OK: OK: synced at Sat 2018-04-28 01:40:44 UTC. [01:41:16] 1 lets me login. 2 don't. none of them work with install-console.. sigh [01:42:18] RECOVERY - Nginx local proxy to apache on mw2172 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 5.204 second response time [01:42:37] RECOVERY - Check the NTP synchronisation status of timesyncd on mw2172 is OK: OK: synced at Sat 2018-04-28 01:42:30 UTC. [01:42:48] RECOVERY - HHVM rendering on mw2172 is OK: HTTP OK: HTTP/1.1 200 OK - 79495 bytes in 6.517 second response time [01:50:58] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [01:51:58] RECOVERY - Disk space on mw2171 is OK: DISK OK [01:51:58] RECOVERY - HHVM processes on mw2171 is OK: PROCS OK: 6 processes with command name hhvm [01:52:18] RECOVERY - DPKG on mw2171 is OK: All packages OK [01:52:27] RECOVERY - dhclient process on mw2171 is OK: PROCS OK: 0 processes with command name dhclient [01:52:27] RECOVERY - Check size of conntrack table on mw2171 is OK: OK: nf_conntrack is 0 % full [01:52:47] RECOVERY - MD RAID on mw2171 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [01:55:38] RECOVERY - puppet last run on mw2171 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [01:56:57] RECOVERY - HHVM rendering on mw2173 is OK: HTTP OK: HTTP/1.1 200 OK - 79495 bytes in 5.664 second response time [01:57:18] RECOVERY - puppet last run on mw2173 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [01:57:37] RECOVERY - Nginx local proxy to apache on mw2173 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 2.796 second response time [01:57:47] RECOVERY - HHVM rendering on mw2171 is OK: HTTP OK: HTTP/1.1 200 OK - 79495 bytes in 7.951 second response time [01:58:25] RECOVERY - Nginx local proxy to apache on mw2171 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 2.950 second response time [02:02:44] it finished anyways besides claiming the run failed [02:04:05] RECOVERY - puppet last run on mw2172 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [02:14:15] RECOVERY - Check the NTP synchronisation status of timesyncd on mw2171 is OK: OK: synced at Sat 2018-04-28 02:14:11 UTC. [02:57:46] 10Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth, 10Wikimedia-Site-requests, 10Wikimedia-log-errors: Please unblock stuck global renames at Meta-Wiki - https://phabricator.wikimedia.org/T193254#4165761 (101997kB) 05Resolved>03Open [[https://meta.wikimedia.org/wiki/Special:GlobalRename... [03:08:46] RECOVERY - nutcracker process on mw2173 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker [03:08:46] RECOVERY - nutcracker port on mw2173 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [03:09:16] RECOVERY - Check systemd state on mw2173 is OK: OK - running: The system is fully operational [03:09:25] RECOVERY - nutcracker process on mw2171 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker [03:09:35] RECOVERY - nutcracker port on mw2171 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [03:09:45] RECOVERY - nutcracker process on mw2172 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker [03:09:55] RECOVERY - Check systemd state on mw2171 is OK: OK - running: The system is fully operational [03:09:57] RECOVERY - Check systemd state on mw2172 is OK: OK - running: The system is fully operational [03:10:05] RECOVERY - nutcracker port on mw2172 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [03:10:12] rebooted them [03:10:25] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [03:25:46] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 792.27 seconds [03:26:35] RECOVERY - mediawiki-installation DSH group on mw2171 is OK: OK [04:04:56] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 211.44 seconds [04:17:45] RECOVERY - mediawiki-installation DSH group on mw2172 is OK: OK [05:11:33] 10Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth, 10Wikimedia-Site-requests, 10Wikimedia-log-errors: Please unblock stuck global renames at Meta-Wiki - https://phabricator.wikimedia.org/T193254#4164229 (10Nirmos) Is this because of https://gerrit.wikimedia.org/r/419520 for T167246? [05:19:51] !log reimaged snapshot1005 to stretch [05:19:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:21:34] (03PS1) 10ArielGlenn: use php7 for all dumps-relate things on snapshot1005 [puppet] - 10https://gerrit.wikimedia.org/r/429539 (https://phabricator.wikimedia.org/T181029) [05:23:43] (03PS2) 10ArielGlenn: use php7 for all dumps-related things on snapshot1005 [puppet] - 10https://gerrit.wikimedia.org/r/429539 (https://phabricator.wikimedia.org/T181029) [05:24:30] (03CR) 10ArielGlenn: [C: 032] use php7 for all dumps-related things on snapshot1005 [puppet] - 10https://gerrit.wikimedia.org/r/429539 (https://phabricator.wikimedia.org/T181029) (owner: 10ArielGlenn) [05:57:35] (03PS3) 10ArielGlenn: pull phabricator dumps from phab server to dumps web server [puppet] - 10https://gerrit.wikimedia.org/r/429197 (https://phabricator.wikimedia.org/T188726) [05:58:10] (03CR) 10jerkins-bot: [V: 04-1] pull phabricator dumps from phab server to dumps web server [puppet] - 10https://gerrit.wikimedia.org/r/429197 (https://phabricator.wikimedia.org/T188726) (owner: 10ArielGlenn) [06:03:36] (03PS4) 10ArielGlenn: pull phabricator dumps from phab server to dumps web server [puppet] - 10https://gerrit.wikimedia.org/r/429197 (https://phabricator.wikimedia.org/T188726) [06:04:12] (03CR) 10jerkins-bot: [V: 04-1] pull phabricator dumps from phab server to dumps web server [puppet] - 10https://gerrit.wikimedia.org/r/429197 (https://phabricator.wikimedia.org/T188726) (owner: 10ArielGlenn) [06:25:28] yeah fine. I'll deal with you later [06:28:33] PROBLEM - puppet last run on labvirt1014 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/nova/policy.json] [06:30:24] PROBLEM - puppet last run on dbmonitor2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/lib/nagios/plugins/check_conntrack] [06:31:43] PROBLEM - puppet last run on analytics1048 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/rsyslog.d/10-puppet-agent.conf] [06:56:44] RECOVERY - puppet last run on analytics1048 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [06:58:34] RECOVERY - puppet last run on labvirt1014 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:00:24] RECOVERY - puppet last run on dbmonitor2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:41:13] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1953 bytes in 0.102 second response time [07:57:29] (03CR) 10Smalyshev: wdqs: add standard prometheus JVM monitoring to blazegraph (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/429382 (https://phabricator.wikimedia.org/T192759) (owner: 10Gehel) [08:34:53] PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: 9.938e+04 ge 1.5e+04 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [08:35:54] RECOVERY - kubelet operational latencies on kubernetes1003 is OK: (C)1.5e+04 ge (W)1e+04 ge 4246 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [08:42:23] 10Puppet, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Services, and 2 others: deployment-prep has jobqueue issues - https://phabricator.wikimedia.org/T192473#4165878 (10EddieGP) >>! In T192473#4163154, aaron wrote: > I'm not so familiar with the Kafka system (only the basic concept). This... [12:10:35] 10Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth, 10Wikimedia-Site-requests, 10Wikimedia-log-errors: Please unblock stuck global renames at Meta-Wiki - https://phabricator.wikimedia.org/T193254#4165953 (10MarcoAurelio) a:05Tgr>03None Removing asignee to let people know that on this re... [12:33:02] (03CR) 10Gehel: wdqs: add standard prometheus JVM monitoring to blazegraph (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/429382 (https://phabricator.wikimedia.org/T192759) (owner: 10Gehel) [12:35:04] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 34 probes of 300 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [12:37:42] (03PS1) 10Gehel: wdqs: enable UseNUMA on blazegraph and updater [puppet] - 10https://gerrit.wikimedia.org/r/429552 [12:40:04] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 5 probes of 300 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [13:28:36] (03PS1) 10Ladsgroup: labs: enable wp10 model for ores extension in enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429554 (https://phabricator.wikimedia.org/T175757) [13:29:25] (03CR) 10jerkins-bot: [V: 04-1] labs: enable wp10 model for ores extension in enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429554 (https://phabricator.wikimedia.org/T175757) (owner: 10Ladsgroup) [13:37:12] (03PS2) 10Ladsgroup: labs: enable wp10 model for ores extension in enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429554 (https://phabricator.wikimedia.org/T175757) [13:42:53] (03CR) 10Ladsgroup: [C: 032] "labs-only patch" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429554 (https://phabricator.wikimedia.org/T175757) (owner: 10Ladsgroup) [13:44:13] (03Merged) 10jenkins-bot: labs: enable wp10 model for ores extension in enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429554 (https://phabricator.wikimedia.org/T175757) (owner: 10Ladsgroup) [13:45:06] ^ rebased on tin [13:49:41] (03CR) 10jenkins-bot: labs: enable wp10 model for ores extension in enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429554 (https://phabricator.wikimedia.org/T175757) (owner: 10Ladsgroup) [15:17:34] PROBLEM - Host db2081 is DOWN: PING CRITICAL - Packet loss = 100% [15:18:33] RECOVERY - Host db2081 is UP: PING OK - Packet loss = 0%, RTA = 36.09 ms [15:20:43] PROBLEM - MariaDB Slave SQL: s8 on db2081 is CRITICAL: CRITICAL slave_sql_state could not connect [15:21:29] PROBLEM - mysqld processes on db2081 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [15:21:33] PROBLEM - MariaDB Slave IO: s8 on db2081 is CRITICAL: CRITICAL slave_io_state could not connect [15:21:56] <_joe_> uhm [15:22:23] I'm on my way home [15:22:34] Being codfw means it is a slave not in use [15:22:52] Can someone silence it and I will take care of it in 15 minutes once I get to my laptop? [15:23:07] <_joe_> doing it [15:23:11] Thanks [15:23:19] I'm in the metro [15:23:34] Just silence it and forget about it [15:24:45] <_joe_> mysql is not running there, FTR [15:25:02] Yeah I guess it craahed [15:25:06] Crashed [15:26:07] It is codfw so I will take a quick look and create a task for it to be taken care of on Monday [15:26:09] <_joe_> gave it 2 hours of downtime [15:26:14] Cool thanks [15:26:15] <_joe_> take your time [15:26:21] I will take it from there [15:27:46] I am here [15:27:55] Hey [15:28:04] a crash? [15:28:10] I'm on my way home [15:28:23] I told Joe to silence it and forget it, I will take care of it [15:28:38] I guess it crashed [15:28:46] I was going to take a quick look and create a task [15:29:02] And we can look at it on Monday, it is a slave anyways [15:30:29] no logs [15:30:42] Uptime? [15:31:34] /opt/wmf-mariadb10/bin/mysqld: Normal shutdown [15:31:37] Just disable notifications and create a task. Not worth spending time on a Saturday for a not in use server I would say [15:31:46] What's the server uptime? [15:31:47] but it was the 180418 [15:33:38] it didn't crash [15:33:42] it was rebooted [15:33:56] There you go... [15:33:57] reboot system boot 4.9.0-6-amd64 Sat Apr 28 15:18 still running [15:34:21] well, normally that means it hard had issues [15:34:28] Yeah, most likely [15:34:41] Create a task and we can look into it on Monday [15:34:43] will disable alerts create a ticket and [15:34:46] ^yep [15:34:58] <3 [15:35:04] I'm late for the party [15:35:29] Nothing to see here! A slave not in use, go back to your Saturday! [15:36:04] ack, thanks [15:36:47] I will depool it anyway [15:39:10] 10Operations, 10ops-codfw, 10DBA: db2081 crashed/rebooted, probably due to hardware failure - https://phabricator.wikimedia.org/T193325#4166043 (10jcrespo) [15:40:40] (03PS1) 10Jcrespo: mariadb: Depool db2081, crashed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429573 (https://phabricator.wikimedia.org/T193325) [15:44:46] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db2081, crashed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429573 (https://phabricator.wikimedia.org/T193325) (owner: 10Jcrespo) [15:45:31] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: db2081 crashed/rebooted, probably due to hardware failure - https://phabricator.wikimedia.org/T193325#4166057 (10jcrespo) p:05Triage>03High [15:46:02] (03Merged) 10jenkins-bot: mariadb: Depool db2081, crashed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429573 (https://phabricator.wikimedia.org/T193325) (owner: 10Jcrespo) [15:49:00] 10Operations, 10ops-codfw, 10DBA: db2081 crashed/rebooted, probably due to hardware failure - https://phabricator.wikimedia.org/T193325#4166060 (10jcrespo) [15:49:18] !log jynus@tin Synchronized wmf-config/db-codfw.php: Depool db2081, crashed (duration: 01m 00s) [15:49:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:44] (03CR) 10jenkins-bot: mariadb: Depool db2081, crashed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429573 (https://phabricator.wikimedia.org/T193325) (owner: 10Jcrespo) [15:52:20] (03PS1) 10Jcrespo: mariadb: Disable notifications for db2081, crashed [puppet] - 10https://gerrit.wikimedia.org/r/429575 (https://phabricator.wikimedia.org/T193325) [15:52:58] (03CR) 10Jcrespo: [C: 032] mariadb: Disable notifications for db2081, crashed [puppet] - 10https://gerrit.wikimedia.org/r/429575 (https://phabricator.wikimedia.org/T193325) (owner: 10Jcrespo) [15:56:40] 10Operations, 10ops-codfw, 10DBA: db2081 crashed/rebooted, probably due to hardware failure - https://phabricator.wikimedia.org/T193325#4166077 (10jcrespo) [17:06:14] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1970 bytes in 0.104 second response time [17:12:40] 10Operations, 10ops-codfw, 10DBA: db2081 crashed/rebooted, probably due to hardware failure - https://phabricator.wikimedia.org/T193325#4166113 (10Marostegui) a:03Papaul Same error we experienced at: T175973#3615656 ``` PWR2262: The Intel Management Engine has reported an internal system error. 2018-04-2... [17:21:59] 10Operations, 10ops-codfw, 10DBA: db2081 crashed/rebooted, probably due to hardware failure - https://phabricator.wikimedia.org/T193325#4166120 (10Papaul) Okay will check that on Monday. [18:38:14] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1971 bytes in 0.118 second response time [23:15:05] PROBLEM - Host db1098 is DOWN: PING CRITICAL - Packet loss = 100% [23:16:04] RECOVERY - Host db1098 is UP: PING OK - Packet loss = 0%, RTA = 1.60 ms [23:18:15] PROBLEM - MariaDB Slave SQL: s7 on db1098 is CRITICAL: CRITICAL slave_sql_state could not connect [23:18:24] PROBLEM - MariaDB Slave IO: s6 on db1098 is CRITICAL: CRITICAL slave_io_state could not connect [23:18:38] PROBLEM - mysqld processes on db1098 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [23:18:38] PROBLEM - MariaDB Slave IO: s7 on db1098 is CRITICAL: CRITICAL slave_io_state could not connect [23:19:04] PROBLEM - MariaDB Slave SQL: s6 on db1098 is CRITICAL: CRITICAL slave_sql_state could not connect [23:19:53] * volans|off looking [23:20:03] <_joe_> volans|off: I'm here [23:21:09] <_joe_> we need to depool it I guess [23:22:43] yes, but it would already been depooled by mediawiki, also has only weight 1 on the main cluster, is a recentcchanges slave [23:22:49] I'm looking at the logs [23:23:04] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [23:23:15] <_joe_> yes but as you know [23:23:23] <_joe_> one db down means mediawiki has big issues [23:23:36] <_joe_> call jaime or manuel [23:23:45] <_joe_> I'll post a first patch [23:23:48] if you want to do the patch to depool in the meanwhile [23:23:59] that would be great, just comment all it's lines [23:24:57] the host got rebooted btw [23:25:04] uptime 9 min [23:25:10] <_joe_> volans|off: and mysql crashed I guess [23:25:19] cannot find any useful log so far [23:25:34] PROBLEM - MariaDB Slave Lag: s6 on db1098 is CRITICAL: CRITICAL slave_sql_lag could not connect [23:25:34] PROBLEM - MariaDB Slave Lag: s7 on db1098 is CRITICAL: CRITICAL slave_sql_lag could not connect [23:25:43] (03PS1) 10Giuseppe Lavagetto: Depool db1098, crashed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429642 [23:26:01] <_joe_> volans|off: ^^ but I didn't add another host for recentchanges [23:26:11] <_joe_> so I don't know if it would create further issues [23:26:19] <_joe_> hence the need for the phone call [23:26:48] <_joe_> or if you know if we can divert a slave, or if this is ok, no need to call [23:26:50] volans|off: another one crashed like that this morning iirc, the error is only visible via the idrac web interface [23:26:50] https://phabricator.wikimedia.org/T193325#4166113 [23:26:51] surely don't add another, it should be ok with just one for now [23:26:54] (03CR) 10jerkins-bot: [V: 04-1] Depool db1098, crashed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429642 (owner: 10Giuseppe Lavagetto) [23:27:20] <_joe_> damn linter [23:31:32] (03PS2) 10Giuseppe Lavagetto: Depool db1098, crashed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429642 [23:31:58] <_joe_> volans|off: can you check the patch? [23:32:01] sure [23:32:55] <_joe_> It's well past 1 AM, I was about to go to bed [23:33:22] _joe_: missing one, line 459, but I can take care of it if you have to go [23:33:32] <_joe_> no don't worry [23:34:22] (03PS3) 10Giuseppe Lavagetto: Depool db1098, crashed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429642 [23:34:45] <_joe_> done [23:35:03] (03CR) 10Volans: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429642 (owner: 10Giuseppe Lavagetto) [23:35:18] looks good, we can add a comment with a task later on ;) [23:35:20] <_joe_> https://cdn.meme.am/cache/instances/folder46/65289046.jpg [23:35:49] <_joe_> very fittingly, that jpeg is broken [23:36:00] rotfl [23:36:05] their db broken too? :D [23:36:18] (03CR) 10Giuseppe Lavagetto: [C: 032] Depool db1098, crashed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429642 (owner: 10Giuseppe Lavagetto) [23:36:45] <_joe_> volans|off: can you do the deploy? I have to go feed my cat or she'll wake up everyone in the house [23:37:26] sure [23:37:32] very useful error so far: 'The Intel Management Engine has reported an internal system error.' [23:39:41] (03CR) 10jenkins-bot: Depool db1098, crashed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429642 (owner: 10Giuseppe Lavagetto) [23:41:27] <_joe_> hey I'm back, are you already deploying? [23:41:30] yes [23:41:34] scap running [23:42:06] !log volans@tin Synchronized wmf-config/db-eqiad.php: Depool db1098 (crashed) (duration: 01m 01s) [23:42:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:19] done [23:43:14] <_joe_> fatals are down [23:43:27] <_joe_> since the minute you deployed [23:43:39] <_joe_> did you try to reach the DBAs for confirmation? [23:47:14] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [23:48:00] <_joe_> machine stats for db1098 show absolutely nothing unusual before the crash [23:50:13] <_joe_> interestingly, the crash happened around the time at which puppet runs [23:51:10] <_joe_> but no, crash started around 23:12, according to syslog [23:51:32] <_joe_> but of course there is no trace of anything in the logs around that time [23:55:12] 10Operations, 10DBA: db1098 crashed and got rebooted - https://phabricator.wikimedia.org/T193331#4166255 (10Volans) [23:57:13] 10Operations, 10DBA: db1098 crashed and got rebooted - https://phabricator.wikimedia.org/T193331#4166265 (10Volans) p:05Triage>03High [23:59:25] PROBLEM - HTTP availability for Nginx -SSL terminators- on einsteinium is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [23:59:41] <_joe_> that's a bit late to the party [23:59:46] <_joe_> the issue is long over