[00:02:17] <icinga-wm>	 PROBLEM - HHVM rendering on mw2172 is CRITICAL: connect to address 10.192.32.60 and port 80: Connection refused
[00:02:17] <icinga-wm>	 PROBLEM - HHVM processes on mw2171 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:02:18] <icinga-wm>	 PROBLEM - nutcracker port on mw2171 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:02:18] <icinga-wm>	 PROBLEM - nutcracker process on mw2172 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:02:18] <icinga-wm>	 PROBLEM - puppet last run on mw2173 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:03:57] <icinga-wm>	 PROBLEM - HHVM rendering on mw2171 is CRITICAL: connect to address 10.192.32.59 and port 80: Connection refused
[00:03:58] <icinga-wm>	 PROBLEM - nutcracker process on mw2171 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:03:58] <icinga-wm>	 PROBLEM - puppet last run on mw2172 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:05:38] <icinga-wm>	 PROBLEM - Apache HTTP on mw2173 is CRITICAL: connect to address 10.192.32.61 and port 80: Connection refused
[00:05:38] <icinga-wm>	 PROBLEM - puppet last run on mw2171 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:05:38] <icinga-wm>	 PROBLEM - MD RAID on mw2173 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:07:27] <icinga-wm>	 PROBLEM - Apache HTTP on mw2172 is CRITICAL: connect to address 10.192.32.60 and port 80: Connection refused
[00:07:27] <icinga-wm>	 PROBLEM - MD RAID on mw2172 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:07:27] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw2173 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:09:07] <icinga-wm>	 PROBLEM - Apache HTTP on mw2171 is CRITICAL: connect to address 10.192.32.59 and port 80: Connection refused
[00:09:07] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw2173 is CRITICAL: connect to address 10.192.32.61 and port 443: Connection refused
[00:09:07] <icinga-wm>	 PROBLEM - MD RAID on mw2171 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:09:08] <icinga-wm>	 PROBLEM - Check systemd state on mw2173 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:09:08] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw2172 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:10:47] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw2172 is CRITICAL: connect to address 10.192.32.60 and port 443: Connection refused
[00:10:48] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw2171 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:10:48] <icinga-wm>	 PROBLEM - Check systemd state on mw2172 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:10:48] <icinga-wm>	 PROBLEM - Check the NTP synchronisation status of timesyncd on mw2173 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:12:37] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw2171 is CRITICAL: connect to address 10.192.32.59 and port 443: Connection refused
[00:12:37] <icinga-wm>	 PROBLEM - Check systemd state on mw2171 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:12:37] <icinga-wm>	 PROBLEM - Check the NTP synchronisation status of timesyncd on mw2172 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:12:37] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw2173 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:12:37] <icinga-wm>	 PROBLEM - configured eth on mw2173 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:14:17] <icinga-wm>	 PROBLEM - Check the NTP synchronisation status of timesyncd on mw2171 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:14:17] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw2172 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:14:17] <icinga-wm>	 PROBLEM - configured eth on mw2172 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:14:17] <icinga-wm>	 PROBLEM - DPKG on mw2173 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:14:17] <icinga-wm>	 PROBLEM - dhclient process on mw2173 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:14:58] <mutante>	 sigh, yes, i got that
[00:15:49] <mutante>	 always starts when i look away for 5 min, heh
[00:17:47] <icinga-wm>	 PROBLEM - DPKG on mw2171 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:17:47] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw2172 is CRITICAL: Host mw2172 is not in mediawiki-installation dsh group
[00:17:47] <icinga-wm>	 PROBLEM - dhclient process on mw2171 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:17:47] <icinga-wm>	 PROBLEM - HHVM processes on mw2173 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:17:47] <icinga-wm>	 PROBLEM - Disk space on mw2172 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:17:47] <icinga-wm>	 PROBLEM - nutcracker port on mw2173 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:19:28] <icinga-wm>	 PROBLEM - HHVM rendering on mw2173 is CRITICAL: connect to address 10.192.32.61 and port 80: Connection refused
[00:19:28] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw2171 is CRITICAL: Host mw2171 is not in mediawiki-installation dsh group
[00:19:28] <icinga-wm>	 PROBLEM - Disk space on mw2171 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:19:28] <icinga-wm>	 PROBLEM - HHVM processes on mw2172 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:19:28] <icinga-wm>	 PROBLEM - nutcracker port on mw2172 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:19:28] <icinga-wm>	 PROBLEM - nutcracker process on mw2173 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[00:20:34] <wikibugs_>	 10Operations, 10Mail: E-mail for people in different OIT LDAP object unit - https://phabricator.wikimedia.org/T159750#4165673 (10bbogaert) Hi All,  Could we (Office IT) discuss this with someone in Ops? We are currently blocked on my tasks/projects until we get our alignment on LDAP gorups and the way our MX s...
[00:33:21] <wikibugs_>	 (03PS2) 10Dzahn: base: update version of gen_fingerprints script [puppet] - 10https://gerrit.wikimedia.org/r/429114
[00:33:37] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] base: update version of gen_fingerprints script [puppet] - 10https://gerrit.wikimedia.org/r/429114 (owner: 10Dzahn)
[00:34:07] <wikibugs_>	 (03CR) 10Dzahn: "PS2: debugged and got it to work by removing the "\s" from the awk line." [puppet] - 10https://gerrit.wikimedia.org/r/429114 (owner: 10Dzahn)
[00:34:26] <wikibugs_>	 (03PS3) 10Dzahn: base: update version of gen_fingerprints script [puppet] - 10https://gerrit.wikimedia.org/r/429114
[00:36:14] <wikibugs_>	 (03PS4) 10Dzahn: base: update version of gen_fingerprints script [puppet] - 10https://gerrit.wikimedia.org/r/429114
[01:09:55] <wikibugs_>	 10Operations, 10Design-Research: Edit optoutresearch@ mailing list recipients - https://phabricator.wikimedia.org/T100860#4165706 (10Dzahn) a:05Dzahn>03bbogaert
[01:18:02] <wikibugs_>	 10Operations, 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban, and 6 others: Proxies information gone from Zero portal. Opera mini pageviews geolocating to wrong country - https://phabricator.wikimedia.org/T187014#4165709 (10Tbayer) Thanks @dr0ptp4kt for the explanation!  To add to the charts pos...
[01:23:38] <icinga-wm>	 RECOVERY - Apache HTTP on mw2172 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.073 second response time
[01:24:18] <icinga-wm>	 PROBLEM - Disk space on labtestnet2001 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=75%)
[01:24:57] <icinga-wm>	 RECOVERY - Apache HTTP on mw2173 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.073 second response time
[01:25:18] <icinga-wm>	 RECOVERY - Apache HTTP on mw2171 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.073 second response time
[01:36:37] <icinga-wm>	 RECOVERY - Check size of conntrack table on mw2172 is OK: OK: nf_conntrack is 0 % full
[01:36:38] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw2172 is OK: OK ferm input default policy is set
[01:36:38] <icinga-wm>	 RECOVERY - configured eth on mw2172 is OK: OK - interfaces up
[01:36:47] <icinga-wm>	 RECOVERY - HHVM processes on mw2172 is OK: PROCS OK: 6 processes with command name hhvm
[01:36:57] <icinga-wm>	 RECOVERY - MD RAID on mw2172 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0
[01:37:07] <icinga-wm>	 RECOVERY - HHVM processes on mw2173 is OK: PROCS OK: 6 processes with command name hhvm
[01:37:07] <icinga-wm>	 RECOVERY - MD RAID on mw2173 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0
[01:37:08] <icinga-wm>	 RECOVERY - Disk space on mw2172 is OK: DISK OK
[01:37:38] <icinga-wm>	 RECOVERY - DPKG on mw2173 is OK: All packages OK
[01:37:38] <icinga-wm>	 RECOVERY - dhclient process on mw2173 is OK: PROCS OK: 0 processes with command name dhclient
[01:37:57] <icinga-wm>	 RECOVERY - Check size of conntrack table on mw2173 is OK: OK: nf_conntrack is 0 % full
[01:37:57] <icinga-wm>	 RECOVERY - configured eth on mw2173 is OK: OK - interfaces up
[01:37:58] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw2173 is OK: OK ferm input default policy is set
[01:39:18] <mutante>	 that's all fine and stuff ..yet it still said "Unable to run wmf-auto-reimage-host: Failed to puppet_first_run" which is another new issue
[01:40:48] <icinga-wm>	 RECOVERY - Check the NTP synchronisation status of timesyncd on mw2173 is OK: OK: synced at Sat 2018-04-28 01:40:44 UTC.
[01:41:16] <mutante>	 1 lets me login. 2 don't. none of them work with install-console.. sigh
[01:42:18] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw2172 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 5.204 second response time
[01:42:37] <icinga-wm>	 RECOVERY - Check the NTP synchronisation status of timesyncd on mw2172 is OK: OK: synced at Sat 2018-04-28 01:42:30 UTC.
[01:42:48] <icinga-wm>	 RECOVERY - HHVM rendering on mw2172 is OK: HTTP OK: HTTP/1.1 200 OK - 79495 bytes in 6.517 second response time
[01:50:58] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[01:51:58] <icinga-wm>	 RECOVERY - Disk space on mw2171 is OK: DISK OK
[01:51:58] <icinga-wm>	 RECOVERY - HHVM processes on mw2171 is OK: PROCS OK: 6 processes with command name hhvm
[01:52:18] <icinga-wm>	 RECOVERY - DPKG on mw2171 is OK: All packages OK
[01:52:27] <icinga-wm>	 RECOVERY - dhclient process on mw2171 is OK: PROCS OK: 0 processes with command name dhclient
[01:52:27] <icinga-wm>	 RECOVERY - Check size of conntrack table on mw2171 is OK: OK: nf_conntrack is 0 % full
[01:52:47] <icinga-wm>	 RECOVERY - MD RAID on mw2171 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0
[01:55:38] <icinga-wm>	 RECOVERY - puppet last run on mw2171 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[01:56:57] <icinga-wm>	 RECOVERY - HHVM rendering on mw2173 is OK: HTTP OK: HTTP/1.1 200 OK - 79495 bytes in 5.664 second response time
[01:57:18] <icinga-wm>	 RECOVERY - puppet last run on mw2173 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures
[01:57:37] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw2173 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 2.796 second response time
[01:57:47] <icinga-wm>	 RECOVERY - HHVM rendering on mw2171 is OK: HTTP OK: HTTP/1.1 200 OK - 79495 bytes in 7.951 second response time
[01:58:25] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw2171 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 2.950 second response time
[02:02:44] <mutante>	 it finished anyways besides claiming the run failed
[02:04:05] <icinga-wm>	 RECOVERY - puppet last run on mw2172 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[02:14:15] <icinga-wm>	 RECOVERY - Check the NTP synchronisation status of timesyncd on mw2171 is OK: OK: synced at Sat 2018-04-28 02:14:11 UTC.
[02:57:46] <wikibugs_>	 10Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth, 10Wikimedia-Site-requests, 10Wikimedia-log-errors: Please unblock stuck global renames at Meta-Wiki - https://phabricator.wikimedia.org/T193254#4165761 (101997kB) 05Resolved>03Open [[https://meta.wikimedia.org/wiki/Special:GlobalRename...
[03:08:46] <icinga-wm>	 RECOVERY - nutcracker process on mw2173 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker
[03:08:46] <icinga-wm>	 RECOVERY - nutcracker port on mw2173 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212
[03:09:16] <icinga-wm>	 RECOVERY - Check systemd state on mw2173 is OK: OK - running: The system is fully operational
[03:09:25] <icinga-wm>	 RECOVERY - nutcracker process on mw2171 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker
[03:09:35] <icinga-wm>	 RECOVERY - nutcracker port on mw2171 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212
[03:09:45] <icinga-wm>	 RECOVERY - nutcracker process on mw2172 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker
[03:09:55] <icinga-wm>	 RECOVERY - Check systemd state on mw2171 is OK: OK - running: The system is fully operational
[03:09:57] <icinga-wm>	 RECOVERY - Check systemd state on mw2172 is OK: OK - running: The system is fully operational
[03:10:05] <icinga-wm>	 RECOVERY - nutcracker port on mw2172 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212
[03:10:12] <mutante>	 rebooted them 
[03:10:25] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[03:25:46] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 792.27 seconds
[03:26:35] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw2171 is OK: OK
[04:04:56] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 211.44 seconds
[04:17:45] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw2172 is OK: OK
[05:11:33] <wikibugs_>	 10Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth, 10Wikimedia-Site-requests, 10Wikimedia-log-errors: Please unblock stuck global renames at Meta-Wiki - https://phabricator.wikimedia.org/T193254#4164229 (10Nirmos) Is this because of https://gerrit.wikimedia.org/r/419520 for T167246?
[05:19:51] <apergos>	 !log reimaged snapshot1005 to stretch
[05:19:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:21:34] <wikibugs_>	 (03PS1) 10ArielGlenn: use php7 for all dumps-relate things on snapshot1005 [puppet] - 10https://gerrit.wikimedia.org/r/429539 (https://phabricator.wikimedia.org/T181029)
[05:23:43] <wikibugs_>	 (03PS2) 10ArielGlenn: use php7 for all dumps-related things on snapshot1005 [puppet] - 10https://gerrit.wikimedia.org/r/429539 (https://phabricator.wikimedia.org/T181029)
[05:24:30] <wikibugs_>	 (03CR) 10ArielGlenn: [C: 032] use php7 for all dumps-related things on snapshot1005 [puppet] - 10https://gerrit.wikimedia.org/r/429539 (https://phabricator.wikimedia.org/T181029) (owner: 10ArielGlenn)
[05:57:35] <wikibugs_>	 (03PS3) 10ArielGlenn: pull phabricator dumps from phab server to dumps web server [puppet] - 10https://gerrit.wikimedia.org/r/429197 (https://phabricator.wikimedia.org/T188726)
[05:58:10] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] pull phabricator dumps from phab server to dumps web server [puppet] - 10https://gerrit.wikimedia.org/r/429197 (https://phabricator.wikimedia.org/T188726) (owner: 10ArielGlenn)
[06:03:36] <wikibugs_>	 (03PS4) 10ArielGlenn: pull phabricator dumps from phab server to dumps web server [puppet] - 10https://gerrit.wikimedia.org/r/429197 (https://phabricator.wikimedia.org/T188726)
[06:04:12] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] pull phabricator dumps from phab server to dumps web server [puppet] - 10https://gerrit.wikimedia.org/r/429197 (https://phabricator.wikimedia.org/T188726) (owner: 10ArielGlenn)
[06:25:28] <apergos>	 yeah fine. I'll deal with you later
[06:28:33] <icinga-wm>	 PROBLEM - puppet last run on labvirt1014 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/nova/policy.json]
[06:30:24] <icinga-wm>	 PROBLEM - puppet last run on dbmonitor2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/lib/nagios/plugins/check_conntrack]
[06:31:43] <icinga-wm>	 PROBLEM - puppet last run on analytics1048 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/rsyslog.d/10-puppet-agent.conf]
[06:56:44] <icinga-wm>	 RECOVERY - puppet last run on analytics1048 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures
[06:58:34] <icinga-wm>	 RECOVERY - puppet last run on labvirt1014 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[07:00:24] <icinga-wm>	 RECOVERY - puppet last run on dbmonitor2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[07:41:13] <icinga-wm>	 PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1953 bytes in 0.102 second response time
[07:57:29] <wikibugs_>	 (03CR) 10Smalyshev: wdqs: add standard prometheus JVM monitoring to blazegraph (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/429382 (https://phabricator.wikimedia.org/T192759) (owner: 10Gehel)
[08:34:53] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: 9.938e+04 ge 1.5e+04 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[08:35:54] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1003 is OK: (C)1.5e+04 ge (W)1e+04 ge 4246 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[08:42:23] <wikibugs_>	 10Puppet, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Services, and 2 others: deployment-prep has jobqueue issues - https://phabricator.wikimedia.org/T192473#4165878 (10EddieGP) >>! In T192473#4163154, aaron wrote: > I'm not so familiar with the Kafka system (only the basic concept).  This...
[12:10:35] <wikibugs_>	 10Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth, 10Wikimedia-Site-requests, 10Wikimedia-log-errors: Please unblock stuck global renames at Meta-Wiki - https://phabricator.wikimedia.org/T193254#4165953 (10MarcoAurelio) a:05Tgr>03None Removing asignee to let people know that on this re...
[12:33:02] <wikibugs_>	 (03CR) 10Gehel: wdqs: add standard prometheus JVM monitoring to blazegraph (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/429382 (https://phabricator.wikimedia.org/T192759) (owner: 10Gehel)
[12:35:04] <icinga-wm>	 PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 34 probes of 300 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[12:37:42] <wikibugs_>	 (03PS1) 10Gehel: wdqs: enable UseNUMA on blazegraph and updater [puppet] - 10https://gerrit.wikimedia.org/r/429552
[12:40:04] <icinga-wm>	 RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 5 probes of 300 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[13:28:36] <wikibugs_>	 (03PS1) 10Ladsgroup: labs: enable wp10 model for ores extension in enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429554 (https://phabricator.wikimedia.org/T175757)
[13:29:25] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] labs: enable wp10 model for ores extension in enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429554 (https://phabricator.wikimedia.org/T175757) (owner: 10Ladsgroup)
[13:37:12] <wikibugs_>	 (03PS2) 10Ladsgroup: labs: enable wp10 model for ores extension in enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429554 (https://phabricator.wikimedia.org/T175757)
[13:42:53] <wikibugs_>	 (03CR) 10Ladsgroup: [C: 032] "labs-only patch" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429554 (https://phabricator.wikimedia.org/T175757) (owner: 10Ladsgroup)
[13:44:13] <wikibugs_>	 (03Merged) 10jenkins-bot: labs: enable wp10 model for ores extension in enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429554 (https://phabricator.wikimedia.org/T175757) (owner: 10Ladsgroup)
[13:45:06] <Amir1>	 ^ rebased on tin
[13:49:41] <wikibugs_>	 (03CR) 10jenkins-bot: labs: enable wp10 model for ores extension in enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429554 (https://phabricator.wikimedia.org/T175757) (owner: 10Ladsgroup)
[15:17:34] <icinga-wm>	 PROBLEM - Host db2081 is DOWN: PING CRITICAL - Packet loss = 100%
[15:18:33] <icinga-wm>	 RECOVERY - Host db2081 is UP: PING OK - Packet loss = 0%, RTA = 36.09 ms
[15:20:43] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s8 on db2081 is CRITICAL: CRITICAL slave_sql_state could not connect
[15:21:29] <icinga-wm>	 PROBLEM - mysqld processes on db2081 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld
[15:21:33] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s8 on db2081 is CRITICAL: CRITICAL slave_io_state could not connect
[15:21:56] <_joe_>	 uhm
[15:22:23] <marostegui_>	 I'm on my way home
[15:22:34] <marostegui_>	 Being codfw means it is a slave not in use
[15:22:52] <marostegui_>	 Can someone silence it and I will take care of it in 15 minutes once I get to my laptop?
[15:23:07] <_joe_>	 doing it
[15:23:11] <marostegui_>	 Thanks
[15:23:19] <marostegui_>	 I'm in the metro
[15:23:34] <marostegui_>	 Just silence it and forget about it
[15:24:45] <_joe_>	 mysql is not running there, FTR
[15:25:02] <marostegui_>	 Yeah I guess it craahed
[15:25:06] <marostegui_>	 Crashed
[15:26:07] <marostegui_>	 It is codfw so I will take a quick look and create a task for it to be taken care of on Monday
[15:26:09] <_joe_>	 gave it 2 hours of downtime
[15:26:14] <marostegui_>	 Cool thanks
[15:26:15] <_joe_>	 take your time
[15:26:21] <marostegui_>	 I will take it from there
[15:27:46] <jynus>	 I am here
[15:27:55] <marostegui_>	 Hey
[15:28:04] <jynus>	 a crash?
[15:28:10] <marostegui_>	 I'm on my way home
[15:28:23] <marostegui_>	 I told Joe to silence it and forget it, I will take care of it
[15:28:38] <marostegui_>	 I guess it crashed
[15:28:46] <marostegui_>	 I was going to take a quick look and create a task
[15:29:02] <marostegui_>	 And we can look at it on Monday, it is a slave anyways
[15:30:29] <jynus>	 no logs
[15:30:42] <marostegui_>	 Uptime?
[15:31:34] <jynus>	  /opt/wmf-mariadb10/bin/mysqld: Normal shutdown
[15:31:37] <marostegui_>	 Just disable notifications and create a task. Not worth spending time on a Saturday for a not in use server I would say
[15:31:46] <marostegui_>	 What's the server uptime?
[15:31:47] <jynus>	 but it was the 180418
[15:33:38] <jynus>	 it didn't crash
[15:33:42] <jynus>	 it was rebooted
[15:33:56] <marostegui_>	 There you go...
[15:33:57] <jynus>	 reboot   system boot  4.9.0-6-amd64    Sat Apr 28 15:18   still running
[15:34:21] <jynus>	 well, normally that means it hard had issues
[15:34:28] <marostegui_>	 Yeah, most likely
[15:34:41] <marostegui_>	 Create a task and we can look into it on Monday
[15:34:43] <jynus>	 will disable alerts create a ticket and 
[15:34:46] <jynus>	 ^yep
[15:34:58] <marostegui_>	 <3
[15:35:04] <volans|off>	 I'm late for the party
[15:35:29] <marostegui_>	 Nothing to see here! A slave not in use, go back to your Saturday!
[15:36:04] <volans|off>	 ack, thanks
[15:36:47] <jynus>	 I will depool it anyway
[15:39:10] <wikibugs_>	 10Operations, 10ops-codfw, 10DBA: db2081 crashed/rebooted, probably due to hardware failure - https://phabricator.wikimedia.org/T193325#4166043 (10jcrespo)
[15:40:40] <wikibugs_>	 (03PS1) 10Jcrespo: mariadb: Depool db2081, crashed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429573 (https://phabricator.wikimedia.org/T193325)
[15:44:46] <wikibugs_>	 (03CR) 10Jcrespo: [C: 032] mariadb: Depool db2081, crashed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429573 (https://phabricator.wikimedia.org/T193325) (owner: 10Jcrespo)
[15:45:31] <wikibugs_>	 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: db2081 crashed/rebooted, probably due to hardware failure - https://phabricator.wikimedia.org/T193325#4166057 (10jcrespo) p:05Triage>03High
[15:46:02] <wikibugs_>	 (03Merged) 10jenkins-bot: mariadb: Depool db2081, crashed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429573 (https://phabricator.wikimedia.org/T193325) (owner: 10Jcrespo)
[15:49:00] <wikibugs_>	 10Operations, 10ops-codfw, 10DBA: db2081 crashed/rebooted, probably due to hardware failure - https://phabricator.wikimedia.org/T193325#4166060 (10jcrespo)
[15:49:18] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-codfw.php: Depool db2081, crashed (duration: 01m 00s)
[15:49:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:49:44] <wikibugs_>	 (03CR) 10jenkins-bot: mariadb: Depool db2081, crashed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429573 (https://phabricator.wikimedia.org/T193325) (owner: 10Jcrespo)
[15:52:20] <wikibugs_>	 (03PS1) 10Jcrespo: mariadb: Disable notifications for db2081, crashed [puppet] - 10https://gerrit.wikimedia.org/r/429575 (https://phabricator.wikimedia.org/T193325)
[15:52:58] <wikibugs_>	 (03CR) 10Jcrespo: [C: 032] mariadb: Disable notifications for db2081, crashed [puppet] - 10https://gerrit.wikimedia.org/r/429575 (https://phabricator.wikimedia.org/T193325) (owner: 10Jcrespo)
[15:56:40] <wikibugs_>	 10Operations, 10ops-codfw, 10DBA: db2081 crashed/rebooted, probably due to hardware failure - https://phabricator.wikimedia.org/T193325#4166077 (10jcrespo)
[17:06:14] <icinga-wm>	 RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1970 bytes in 0.104 second response time
[17:12:40] <wikibugs_>	 10Operations, 10ops-codfw, 10DBA: db2081 crashed/rebooted, probably due to hardware failure - https://phabricator.wikimedia.org/T193325#4166113 (10Marostegui) a:03Papaul Same error we experienced at: T175973#3615656  ``` PWR2262: The Intel Management Engine has reported an internal system error.  2018-04-2...
[17:21:59] <wikibugs_>	 10Operations, 10ops-codfw, 10DBA: db2081 crashed/rebooted, probably due to hardware failure - https://phabricator.wikimedia.org/T193325#4166120 (10Papaul) Okay will check that on Monday.
[18:38:14] <icinga-wm>	 PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1971 bytes in 0.118 second response time
[23:15:05] <icinga-wm>	 PROBLEM - Host db1098 is DOWN: PING CRITICAL - Packet loss = 100%
[23:16:04] <icinga-wm>	 RECOVERY - Host db1098 is UP: PING OK - Packet loss = 0%, RTA = 1.60 ms
[23:18:15] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s7 on db1098 is CRITICAL: CRITICAL slave_sql_state could not connect
[23:18:24] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s6 on db1098 is CRITICAL: CRITICAL slave_io_state could not connect
[23:18:38] <icinga-wm>	 PROBLEM - mysqld processes on db1098 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld
[23:18:38] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s7 on db1098 is CRITICAL: CRITICAL slave_io_state could not connect
[23:19:04] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s6 on db1098 is CRITICAL: CRITICAL slave_sql_state could not connect
[23:19:53] * volans|off looking
[23:20:03] <_joe_>	 volans|off: I'm here
[23:21:09] <_joe_>	 we need to depool it I guess
[23:22:43] <volans|off>	 yes, but it would already been depooled by mediawiki, also has only weight 1 on the main cluster, is a recentcchanges slave
[23:22:49] <volans|off>	 I'm looking at the logs
[23:23:04] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[23:23:15] <_joe_>	 yes but as you know
[23:23:23] <_joe_>	 one db down means mediawiki has big issues
[23:23:36] <_joe_>	 call jaime or manuel
[23:23:45] <_joe_>	 I'll post a first patch
[23:23:48] <volans|off>	 if you want to do the patch to depool in the meanwhile
[23:23:59] <volans|off>	 that would be great, just comment all it's lines
[23:24:57] <volans|off>	 the host got rebooted btw
[23:25:04] <volans|off>	 uptime 9 min
[23:25:10] <_joe_>	 volans|off: and mysql crashed I guess
[23:25:19] <volans|off>	 cannot find any useful log so far
[23:25:34] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s6 on db1098 is CRITICAL: CRITICAL slave_sql_lag could not connect
[23:25:34] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on db1098 is CRITICAL: CRITICAL slave_sql_lag could not connect
[23:25:43] <wikibugs_>	 (03PS1) 10Giuseppe Lavagetto: Depool db1098, crashed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429642
[23:26:01] <_joe_>	 volans|off: ^^ but I didn't add another host for recentchanges
[23:26:11] <_joe_>	 so I don't know if it would create further issues
[23:26:19] <_joe_>	 hence the need for the phone call
[23:26:48] <_joe_>	 or if you know if we can divert a slave, or if this is ok, no need to call
[23:26:50] <p858snake>	 volans|off: another one crashed like that this morning iirc, the error is only visible via the idrac web interface
[23:26:50] <p858snake>	 https://phabricator.wikimedia.org/T193325#4166113
[23:26:51] <volans|off>	 surely don't add another, it should be ok with just one for now
[23:26:54] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] Depool db1098, crashed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429642 (owner: 10Giuseppe Lavagetto)
[23:27:20] <_joe_>	 damn linter
[23:31:32] <wikibugs_>	 (03PS2) 10Giuseppe Lavagetto: Depool db1098, crashed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429642
[23:31:58] <_joe_>	 volans|off: can you check the patch?
[23:32:01] <volans|off>	 sure
[23:32:55] <_joe_>	 It's well past 1 AM, I was about to go to bed
[23:33:22] <volans|off>	 _joe_: missing one, line 459, but I can take care of it if you have to go
[23:33:32] <_joe_>	 no don't worry
[23:34:22] <wikibugs_>	 (03PS3) 10Giuseppe Lavagetto: Depool db1098, crashed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429642
[23:34:45] <_joe_>	 done
[23:35:03] <wikibugs_>	 (03CR) 10Volans: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429642 (owner: 10Giuseppe Lavagetto)
[23:35:18] <volans|off>	 looks good, we can add a comment with a task later on ;)
[23:35:20] <_joe_>	 https://cdn.meme.am/cache/instances/folder46/65289046.jpg
[23:35:49] <_joe_>	 very fittingly, that jpeg is broken
[23:36:00] <volans|off>	 rotfl
[23:36:05] <volans|off>	 their db broken too? :D
[23:36:18] <wikibugs_>	 (03CR) 10Giuseppe Lavagetto: [C: 032] Depool db1098, crashed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429642 (owner: 10Giuseppe Lavagetto)
[23:36:45] <_joe_>	 volans|off: can you do the deploy? I have to go feed my cat or she'll wake up everyone in the house
[23:37:26] <volans|off>	 sure
[23:37:32] <volans|off>	 very useful error so far: 'The Intel Management Engine has reported an internal system error.'
[23:39:41] <wikibugs_>	 (03CR) 10jenkins-bot: Depool db1098, crashed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429642 (owner: 10Giuseppe Lavagetto)
[23:41:27] <_joe_>	  hey I'm back, are you already deploying?
[23:41:30] <volans|off>	 yes
[23:41:34] <volans|off>	 scap running
[23:42:06] <logmsgbot>	 !log volans@tin Synchronized wmf-config/db-eqiad.php: Depool db1098 (crashed) (duration: 01m 01s)
[23:42:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:42:19] <volans|off>	 done
[23:43:14] <_joe_>	 fatals are down
[23:43:27] <_joe_>	 since the minute you deployed
[23:43:39] <_joe_>	 did you try to reach the DBAs for confirmation?
[23:47:14] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[23:48:00] <_joe_>	 machine stats for db1098 show absolutely nothing unusual before the crash
[23:50:13] <_joe_>	 interestingly, the crash happened around the time at which puppet runs
[23:51:10] <_joe_>	 but no, crash started around 23:12, according to syslog
[23:51:32] <_joe_>	 but of course there is no trace of anything in the logs around that time
[23:55:12] <wikibugs_>	 10Operations, 10DBA: db1098 crashed and got rebooted - https://phabricator.wikimedia.org/T193331#4166255 (10Volans)
[23:57:13] <wikibugs_>	 10Operations, 10DBA: db1098 crashed and got rebooted - https://phabricator.wikimedia.org/T193331#4166265 (10Volans) p:05Triage>03High
[23:59:25] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- on einsteinium is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[23:59:41] <_joe_>	 that's a bit late to the party
[23:59:46] <_joe_>	 the issue is long over