[00:04:43] <mutante>	 !log wikitech-static-iad: mv /etc/acme/cert/wikitech-static-iad-signed.csr /etc/acme/cert/wikitech-static-iad.chained.crt ;  wikitech-static-ord: copy wiki logo:  /srv/mediawiki/images# wget https://wikitech-static-iad.wikimedia.org/w/images/labswiki.png
[00:04:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:12:54] <icinga-wm>	 PROBLEM - IPMI Temperature on sodium is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds.
[00:13:30] <wikibugs>	 (03PS5) 10Dzahn: Fix typo on Gerrit downtime page [puppet] - 10https://gerrit.wikimedia.org/r/356477 (owner: 10Bartosz Dziewoński)
[00:14:32] <wikibugs>	 (03CR) 10Dzahn: [C: 032] Fix typo on Gerrit downtime page [puppet] - 10https://gerrit.wikimedia.org/r/356477 (owner: 10Bartosz Dziewoński)
[00:15:12] <wikibugs>	 (03PS5) 10Dzahn: Fix indentation of Gerrit downtime page [puppet] - 10https://gerrit.wikimedia.org/r/356478 (owner: 10Bartosz Dziewoński)
[00:16:20] <wikibugs>	 (03CR) 10Dzahn: [C: 032] Fix indentation of Gerrit downtime page [puppet] - 10https://gerrit.wikimedia.org/r/356478 (owner: 10Bartosz Dziewoński)
[00:21:55] <wikibugs>	 (03PS1) 10Dzahn: planet: remove "ja" and "ca" (empty), add link to new "el" [puppet] - 10https://gerrit.wikimedia.org/r/356977
[00:22:10] <wikibugs>	 (03PS1) 10Dzahn: remove "ja" and "ca" planet.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/356978
[00:23:46] <wikibugs>	 (03PS2) 10Dzahn: planet: remove "ja" and "ca" (empty), add link to new "el" [puppet] - 10https://gerrit.wikimedia.org/r/356977
[00:25:29] <mutante>	 .. and out.. bye
[01:40:54] <icinga-wm>	 PROBLEM - IPMI Temperature on ocg1002 is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds.
[01:42:54] <icinga-wm>	 PROBLEM - IPMI Temperature on sodium is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds.
[02:24:03] <logmsgbot>	 !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.2) (duration: 07m 47s)
[02:24:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:30:27] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Sat Jun  3 02:30:27 UTC 2017 (duration 6m 24s)
[02:30:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:12:54] <icinga-wm>	 PROBLEM - IPMI Temperature on sodium is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds.
[03:40:54] <icinga-wm>	 PROBLEM - IPMI Temperature on ocg1002 is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds.
[04:11:14] <icinga-wm>	 PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=952.00 Read Requests/Sec=579.70 Write Requests/Sec=2.40 KBytes Read/Sec=38664.00 KBytes_Written/Sec=61.20
[04:19:14] <icinga-wm>	 RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=0.50 Read Requests/Sec=15.20 Write Requests/Sec=97.40 KBytes Read/Sec=128.00 KBytes_Written/Sec=673.60
[04:39:34] <icinga-wm>	 PROBLEM - Apache HTTP on mw1201 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.073 second response time
[04:40:34] <icinga-wm>	 RECOVERY - Apache HTTP on mw1201 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.107 second response time
[04:42:54] <icinga-wm>	 PROBLEM - IPMI Temperature on sodium is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds.
[04:55:04] <icinga-wm>	 PROBLEM - Apache HTTP on mw1191 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[04:55:55] <icinga-wm>	 RECOVERY - Apache HTTP on mw1191 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.099 second response time
[04:57:04] <icinga-wm>	 PROBLEM - configured eth on db1089 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[04:57:14] <icinga-wm>	 PROBLEM - Disk space on db1089 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[04:57:14] <icinga-wm>	 PROBLEM - dhclient process on db1089 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[04:57:24] <icinga-wm>	 PROBLEM - Check size of conntrack table on db1089 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[04:57:24] <icinga-wm>	 PROBLEM - Check systemd state on db1089 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[04:57:24] <icinga-wm>	 PROBLEM - DPKG on db1089 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[04:57:27] <icinga-wm>	 PROBLEM - mysqld processes on db1089 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[04:57:34] <icinga-wm>	 PROBLEM - puppet last run on db1089 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[04:57:37] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s1 on db1089 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[04:57:44] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on db1089 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[04:57:44] <icinga-wm>	 PROBLEM - salt-minion processes on db1089 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[04:57:47] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s1 on db1089 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[04:57:57] <icinga-wm>	 PROBLEM - MariaDB disk space on db1089 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[04:59:05] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1009 is CRITICAL: PYBAL CRITICAL - apaches_80 - Could not depool server mw1177.eqiad.wmnet because of too many down!
[05:00:24] <icinga-wm>	 RECOVERY - DPKG on db1089 is OK: All packages OK
[05:00:24] <icinga-wm>	 RECOVERY - Check systemd state on db1089 is OK: OK - running: The system is fully operational
[05:00:24] <icinga-wm>	 RECOVERY - puppet last run on db1089 is OK: OK: Puppet is currently enabled, last run 10 minutes ago with 0 failures
[05:00:24] <icinga-wm>	 RECOVERY - Check size of conntrack table on db1089 is OK: OK: nf_conntrack is 0 % full
[05:00:34] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on db1089 is OK: OK ferm input default policy is set
[05:00:44] <icinga-wm>	 RECOVERY - salt-minion processes on db1089 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[05:01:04] <icinga-wm>	 RECOVERY - configured eth on db1089 is OK: OK - interfaces up
[05:01:04] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1009 is OK: PYBAL OK - All pools are healthy
[05:01:14] <icinga-wm>	 RECOVERY - dhclient process on db1089 is OK: PROCS OK: 0 processes with command name dhclient
[05:02:34] <icinga-wm>	 ACKNOWLEDGEMENT - Host lvs1007 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T150256
[05:02:34] <icinga-wm>	 ACKNOWLEDGEMENT - Host lvs1008 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T150256
[05:02:44] <marostegui>	 what is going on?
[05:03:00] <apergos>	 I get input/output error from even an uptime or a dmesg on there
[05:03:05] <apergos>	 just got to the host
[05:03:20] <marostegui>	 yes
[05:03:22] <marostegui>	 it is broken
[05:03:26] <marostegui>	 going to depool it
[05:03:33] <apergos>	 ok
[05:03:39] <mutante>	 could not get to the host, but ACKed the 2 unrelated "host down" that distracted me too
[05:04:42] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356984
[05:04:47] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on db1089 is CRITICAL: CRITICAL slave_sql_lag could not connect
[05:04:48] <marostegui>	 can someone review it?
[05:05:02] <marostegui>	 ^
[05:06:16] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356984 (owner: 10Marostegui)
[05:07:13] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356984 (owner: 10Marostegui)
[05:07:21] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356984 (owner: 10Marostegui)
[05:08:17] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1059 - it is broken (duration: 00m 41s)
[05:08:21] <marostegui>	 ^ that is obviously db1089 i will edit SAL wiki ^
[05:08:25] <mutante>	 i have depooled appservers but not db servers, but it's broken and the comment in there clearly says "broken ones should be removed"
[05:08:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:09:04] <icinga-wm>	 PROBLEM - Apache HTTP on mw1204 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.073 second response time
[05:09:14] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1204 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.154 second response time
[05:10:04] <icinga-wm>	 RECOVERY - Apache HTTP on mw1204 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.144 second response time
[05:10:14] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1204 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.180 second response time
[05:11:22] <apergos>	 morning, you're late to the party
[05:12:35] <marostegui>	 I have created - T166933
[05:12:36] <stashbot>	 T166933: db1089 possibly HW issues - https://phabricator.wikimedia.org/T166933
[05:13:56] <marostegui>	 errors have decreased so I believe we are good
[05:14:04] <marostegui>	 let's see what db1089 ilo logs say
[05:16:06] <mutante>	 [31587157.626167] sd 0:1:0:0: rejecting I/O to offline device
[05:16:19] <apergos>	 yep
[05:18:23] <mutante>	 Controller lockup detected ..  scsi 0:1:0:0 Aborting command ..  "scsi 0:1:0:0: FAILED to abort command" heh.  the SSD is very dead
[05:18:26] <marostegui>	 I am going to reboot it I think
[05:18:32] <apergos>	 I was going to suggest
[05:18:52] <apergos>	 let's see if it still has controller issues after that
[05:18:54] <apergos>	 or whatever it is
[05:19:06] <jynus>	 depooled, disabled notifications
[05:19:09] <marostegui>	 https://phabricator.wikimedia.org/T166933#3312224
[05:19:12] <jynus>	 everything working
[05:19:17] <marostegui>	 jynus: Yeah I depooled it :)
[05:19:20] <jynus>	 I do not see anything else to do
[05:19:20] <marostegui>	 and downtimed it
[05:19:33] <marostegui>	 Let's give it a reboot and see...
[05:19:36] <jynus>	 before breakafst
[05:19:51] <marostegui>	 haha
[05:20:09] <mutante>	 wanted to log out but it just froze for me anyways
[05:20:35] <marostegui>	 !log Reboot db1089 - T166933
[05:20:41] <apergos>	 you'll be disconnected anyways ;-)
[05:20:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:20:44] <stashbot>	 T166933: db1089 possibly HW issues - https://phabricator.wikimedia.org/T166933
[05:21:31] <jynus>	 1600 errors before automatic depool
[05:21:36] <jynus>	 but that's al
[05:22:30] <jynus>	 and probably most of those were retried
[05:22:39] <marostegui>	 The controller is dead
[05:22:43] <apergos>	 ouch
[05:23:18] <marostegui>	 https://phabricator.wikimedia.org/T166933#3312227
[05:24:12] <marostegui>	 Oh wow, it continued booting up
[05:24:29] <marostegui>	 And the server is backup
[05:24:33] <marostegui>	 ANd looks healthy so far
[05:24:37] <marostegui>	 not going to start mysql yet
[05:25:03] <marostegui>	 going to check it a bit, and reboot it again
[05:26:53] <apergos>	 really! wat
[05:27:33] <mutante>	 did we have more cases where controllers "died" like this and was it always HP? 
[05:27:46] <jynus>	 "A new network or storage device has been detected"
[05:28:07] <jynus>	 I think it could the the typical "I am going away for a moment"
[05:29:04] <marostegui>	 A second reboot doesn't show anything yes
[05:29:29] <jynus>	 that happenend with a few of the higher db10XX
[05:30:03] <apergos>	 can we expect more erratic behavior from this host, jynus, or was it a one-off event?
[05:30:06] <apergos>	 in the past I mean
[05:30:24] <marostegui>	 server is back with no issues
[05:30:27] <jynus>	 https://phabricator.wikimedia.org/T141601
[05:30:46] <mutante>	 saw "sudo hpssacli controller all show" on another ticket but it shows no problems
[05:30:53] <jynus>	 "This looks like the last time, when the RAID controller got disconnected on another host."
[05:31:06] <apergos>	 amazing
[05:31:09] <mutante>	 "Smart Array P840 in Slot 1 "
[05:32:02] <marostegui>	 I have started MySQL and all looks fine. I will not pool this server until Monday though
[05:32:08] <jynus>	 https://phabricator.wikimedia.org/T154031#2899292
[05:32:15] <mutante>	 ticket where a controller card gets replaced on a HP https://phabricator.wikimedia.org/T150206
[05:32:30] <apergos>	 +1, need no extra excitement on the weekend
[05:32:49] <marostegui>	 haha
[05:33:36] <mutante>	 T154031: db2060 crashed (RAID controller) .. T140598: db2056 RAID controller (temporary) failure  ...
[05:33:36] <stashbot>	 T154031: db2060 crashed (RAID controller) - https://phabricator.wikimedia.org/T154031
[05:33:36] <stashbot>	 T140598: db2056 RAID controller (temporary) failure - https://phabricator.wikimedia.org/T140598
[05:34:02] <jynus>	 yeah, that is the error I was referring to
[05:34:22] <jynus>	 it seems to happen 1 every around one year for each 50 nodes or so
[05:34:48] <apergos>	 wow thank you HP 
[05:35:02] <mutante>	 *nod* it did seem kind of familiar and it's always HP and then they make p.apaul update the firmware
[05:35:10] <mutante>	 afair
[05:35:24] <jynus>	 altenatively
[05:35:42] <jynus>	 the new IPMI checks could be crashing the entire fleet
[05:35:51] <marostegui>	 Anyways, as this is kinda solved, I am going to get some breakfast :)
[05:35:56] <marostegui>	 jynus: that last sentence scared me!!!
[05:35:58] <apergos>	 enjoy!
[05:37:45] <apergos>	 I planned to deal with the laptop this morning (replace the internal hd instead of running off usb stick)
[05:38:26] <apergos>	 so now that things are back to quiet I'll get started on that... back in a few hours I guess, however long it takes to do the install, updates, transfer around all the data, etc etc
[05:43:34] <icinga-wm>	 PROBLEM - Apache HTTP on mw1263 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.073 second response time
[05:43:44] <icinga-wm>	 PROBLEM - HHVM rendering on mw1263 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.074 second response time
[05:43:45] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1263 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.155 second response time
[05:44:34] <icinga-wm>	 RECOVERY - Apache HTTP on mw1263 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.114 second response time
[05:44:44] <icinga-wm>	 RECOVERY - HHVM rendering on mw1263 is OK: HTTP OK: HTTP/1.1 200 OK - 73606 bytes in 0.234 second response time
[05:44:45] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1263 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.214 second response time
[06:11:54] <icinga-wm>	 PROBLEM - IPMI Temperature on ms-be2014 is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds.
[06:12:54] <icinga-wm>	 PROBLEM - IPMI Temperature on sodium is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds.
[06:13:54] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1261 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 1.538 second response time
[06:14:04] <icinga-wm>	 PROBLEM - Apache HTTP on mw1261 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.079 second response time
[06:14:54] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1261 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.186 second response time
[06:15:04] <icinga-wm>	 RECOVERY - Apache HTTP on mw1261 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.105 second response time
[06:20:18] <elukey>	 Jun  3 06:13:57 mw1261 systemd[1]: hhvm.service: main process exited, code=killed, status=11/SEGV
[06:22:09] <elukey>	 same thing for mw1263
[06:22:32] <elukey>	 and I don't see any coredump in /var/log/hhvm
[06:23:26] <elukey>	 nor in /tmp
[06:23:50] <elukey>	 moritzm: is there a task already opened for this specific appserver random SEGV with 3.18 ?
[06:24:04] <elukey>	 (not sure if you already investigated and found that it is similar to luasandbox etc..)
[06:35:54] <icinga-wm>	 PROBLEM - nutcracker process on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[06:36:44] <icinga-wm>	 RECOVERY - nutcracker process on thumbor1001 is OK: PROCS OK: 1 process with UID = 115 (nutcracker), command name nutcracker
[06:39:31] <elukey>	 nc_proxy.c:330 client connections 989 exceed limit 989 - Cc gilles,godog --^
[06:40:46] <gilles>	 elukey: did it recover by itself or did you restart nutcracker?
[06:41:25] <gilles>	 I guess it might have recovered by way of thumbor processes naturally dying
[06:42:52] <elukey>	 nono didn't do anything
[06:43:27] <gilles>	 filed a task for it https://phabricator.wikimedia.org/T166938
[06:44:15] <elukey>	 I think it's related to https://phabricator.wikimedia.org/T105131, ulimits might need to be tuned
[06:44:20] <elukey>	 will add the ref in the task
[06:45:27] <gilles>	 interesting
[06:45:41] <elukey>	 I am also seeing kernel: [835763.831338] rsvg-convert[35756]: segfault at 30 ip
[06:46:33] <elukey>	 and the OOM killer acting
[06:46:36] <gilles>	 that's expected
[06:46:37] <elukey>	 on thumbor
[06:46:54] <gilles>	 some images make things blow up, nothing we can do about it and that's why the limits are there
[06:47:01] <elukey>	 ah yeah
[06:47:04] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1265 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.154 second response time
[06:47:34] <icinga-wm>	 PROBLEM - Apache HTTP on mw1265 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.073 second response time
[06:47:44] <icinga-wm>	 PROBLEM - Check the NTP synchronisation status of timesyncd on mw1265 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[06:47:44] <icinga-wm>	 PROBLEM - HHVM rendering on mw1265 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.073 second response time
[06:48:05] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1265 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.188 second response time
[06:48:10] <elukey>	 sigh
[06:48:14] <icinga-wm>	 PROBLEM - HHVM rendering on mw1264 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.074 second response time
[06:48:24] <icinga-wm>	 PROBLEM - Apache HTTP on mw1264 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.073 second response time
[06:48:35] <icinga-wm>	 RECOVERY - Apache HTTP on mw1265 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.107 second response time
[06:48:44] <icinga-wm>	 RECOVERY - HHVM rendering on mw1265 is OK: HTTP OK: HTTP/1.1 200 OK - 73289 bytes in 0.251 second response time
[06:49:04] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1264 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.154 second response time
[06:50:04] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1264 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.179 second response time
[06:50:15] <icinga-wm>	 RECOVERY - HHVM rendering on mw1264 is OK: HTTP OK: HTTP/1.1 200 OK - 73288 bytes in 0.233 second response time
[06:50:24] <icinga-wm>	 RECOVERY - Apache HTTP on mw1264 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.108 second response time
[06:51:36] <elukey>	 gilles: ah we have "limit nofile 64000 64000" in /etc/init/nutcracker.override, but that one IIUC is only for upstart
[06:58:17] <elukey>	 the same 1024/4096 is on all the mw servers
[06:58:33] <elukey>	 I'll add a note to the task if it is the case to raise it in puppet
[07:00:39] <gilles>	 makes sense, but would nutcracker create that many connections without demand? because single thumbor is single threaded it should need only as many connections as there are cores max
[07:01:35] <gilles>	 increasing the limit makes sense based on these other examples, but I worry it would mask an issue further. app servers don't really have a limit on how many requests they handle in parallel, having a high connection limit there makes sense
[07:02:37] <elukey>	 sure sure, but 1024 is a bit low as limit.. maybe we could think about a tunable that will adapt case by case
[07:04:02] <elukey>	 in this case the limit was breached right after an OOM, so not really a normal use case scenario
[07:04:48] <elukey>	 it doesn't seem a huge problem now, we can discuss it with Filippo on Monday.. I added all the info to the task so the discussing will re-start from there :)
[07:11:54] <icinga-wm>	 PROBLEM - IPMI Temperature on ms-be2014 is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds.
[07:17:34] <icinga-wm>	 RECOVERY - Check the NTP synchronisation status of timesyncd on mw1265 is OK: OK: synced at Sat 2017-06-03 07:17:25 UTC.
[07:42:54] <icinga-wm>	 PROBLEM - IPMI Temperature on sodium is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds.
[08:12:55] <wikibugs>	 (03PS2) 10Amire80: Set collation for Bashkir wikis to uppercase-ba [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353099 (https://phabricator.wikimedia.org/T162823)
[08:16:25] <wikibugs>	 (03PS3) 10Amire80: Set collation for Bashkir wikis to uppercase-ba [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353099 (https://phabricator.wikimedia.org/T162823)
[08:30:16] <icinga-wm>	 PROBLEM - IPMI Temperature on ms-be1021 is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds.
[08:59:54] <icinga-wm>	 RECOVERY - IPMI Temperature on ms-be1021 is OK: Sensor Type(s) Temperature Status: OK
[09:12:54] <icinga-wm>	 PROBLEM - IPMI Temperature on sodium is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds.
[10:25:44] <icinga-wm>	 PROBLEM - puppet last run on ms-be1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:25:44] <icinga-wm>	 PROBLEM - SSH on ms-be1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[10:25:54] <icinga-wm>	 PROBLEM - configured eth on ms-be1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:26:34] <icinga-wm>	 RECOVERY - puppet last run on ms-be1002 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures
[10:26:35] <icinga-wm>	 RECOVERY - SSH on ms-be1002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0)
[10:26:44] <icinga-wm>	 RECOVERY - configured eth on ms-be1002 is OK: OK - interfaces up
[10:42:54] <icinga-wm>	 PROBLEM - IPMI Temperature on sodium is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds.
[11:08:14] <wikibugs>	 (03PS1) 10Ema: check_ipmi_temp: set check timeout to 60 seconds [puppet] - 10https://gerrit.wikimedia.org/r/357010 (https://phabricator.wikimedia.org/T125205)
[12:12:54] <icinga-wm>	 PROBLEM - IPMI Temperature on sodium is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds.
[12:40:54] <icinga-wm>	 PROBLEM - IPMI Temperature on ocg1002 is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds.
[13:42:54] <icinga-wm>	 PROBLEM - IPMI Temperature on sodium is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds.
[14:11:54] <icinga-wm>	 PROBLEM - IPMI Temperature on ms-be2014 is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds.
[15:12:54] <icinga-wm>	 PROBLEM - IPMI Temperature on sodium is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds.
[15:20:32] <wikibugs>	 (03PS3) 10Faidon Liambotis: ircecho: notify service on config change [puppet] - 10https://gerrit.wikimedia.org/r/356852 (owner: 10BryanDavis)
[15:20:34] <wikibugs>	 (03PS3) 10Faidon Liambotis: ircecho: notify service on config change [puppet] - 10https://gerrit.wikimedia.org/r/356852 (owner: 10BryanDavis)
[15:20:42] <wikibugs>	 (03CR) 10Faidon Liambotis: [C: 032] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/356852 (owner: 10BryanDavis)
[15:20:44] <wikibugs>	 (03CR) 10Faidon Liambotis: [C: 032] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/356852 (owner: 10BryanDavis)
[15:37:06] <wikibugs>	 (03PS1) 10Andrew Bogott: novastats:  Add 'diskspace.py' script [puppet] - 10https://gerrit.wikimedia.org/r/357014 (https://phabricator.wikimedia.org/T163796)
[15:37:07] <wikibugs>	 (03PS1) 10Andrew Bogott: novastats:  Add 'diskspace.py' script [puppet] - 10https://gerrit.wikimedia.org/r/357014 (https://phabricator.wikimedia.org/T163796)
[16:42:54] <icinga-wm>	 PROBLEM - IPMI Temperature on sodium is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds.
[18:11:54] <icinga-wm>	 PROBLEM - IPMI Temperature on ms-be2014 is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds.
[18:12:54] <icinga-wm>	 PROBLEM - IPMI Temperature on sodium is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds.
[19:42:54] <icinga-wm>	 PROBLEM - IPMI Temperature on sodium is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds.
[21:07:02] <Chrissymad>	 hey anyone around that can help reset 2fa?
[21:08:16] <Zppix>	 Chrissymad:  welcome to the club  i once was in need of the same 
[21:10:02] <Chrissymad>	 nvm i figured it out :D 
[21:12:54] <icinga-wm>	 PROBLEM - IPMI Temperature on sodium is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds.
[21:19:34] <icinga-wm>	 PROBLEM - puppet last run on db1049 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[21:40:54] <icinga-wm>	 PROBLEM - IPMI Temperature on ocg1002 is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds.
[21:47:34] <icinga-wm>	 RECOVERY - puppet last run on db1049 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures
[21:48:44] <icinga-wm>	 PROBLEM - puppet last run on terbium is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[21:50:34] <icinga-wm>	 PROBLEM - configured eth on terbium is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[21:50:44] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on terbium is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[21:50:44] <icinga-wm>	 PROBLEM - SSH on terbium is CRITICAL: Server answer
[21:50:44] <icinga-wm>	 PROBLEM - salt-minion processes on terbium is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[21:50:44] <icinga-wm>	 PROBLEM - DPKG on terbium is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[21:50:54] <icinga-wm>	 PROBLEM - nutcracker port on terbium is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[21:50:54] <icinga-wm>	 PROBLEM - Disk space on terbium is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[21:51:04] <icinga-wm>	 PROBLEM - HTTP-noc on terbium is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[21:51:04] <icinga-wm>	 PROBLEM - dhclient process on terbium is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[21:51:04] <icinga-wm>	 PROBLEM - Check systemd state on terbium is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[21:51:14] <icinga-wm>	 PROBLEM - nutcracker process on terbium is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[21:51:15] <icinga-wm>	 PROBLEM - Check size of conntrack table on terbium is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[21:53:04] <icinga-wm>	 PROBLEM - Check the NTP synchronisation status of timesyncd on terbium is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[21:53:49] <paladox>	 I forgot what terbium is used for.
[21:55:53] <Zppix>	 paladox:  ive noticed impi or whatever errors being sent for 2 days now
[21:56:15] <paladox>	 yep, that's temepeture. But termbium seems to have a problem.
[21:56:36] <Zppix>	 ik
[21:57:24] <icinga-wm>	 PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1932 bytes in 0.151 second response time
[21:57:37] <Zppix>	 paladox:  wikitech.wikimedia.org/wiki/terbium
[21:58:57] <paladox>	 thanks
[21:59:05] <Zppix>	 np
[22:06:24] <icinga-wm>	 PROBLEM - MegaRAID on terbium is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[22:06:25] <icinga-wm>	 ACKNOWLEDGEMENT - MegaRAID on terbium is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T166962
[22:06:29] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on terbium - https://phabricator.wikimedia.org/T166962#3313067 (10ops-monitoring-bot)
[22:38:04] <icinga-wm>	 PROBLEM - IPMI Temperature on terbium is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[22:41:54] <icinga-wm>	 PROBLEM - IPMI Temperature on ms-be2014 is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds.
[22:42:54] <icinga-wm>	 PROBLEM - IPMI Temperature on sodium is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds.
[22:43:04] <icinga-wm>	 RECOVERY - HTTP-noc on terbium is OK: HTTP OK: HTTP/1.1 200 OK - 3829 bytes in 9.742 second response time
[22:43:04] <icinga-wm>	 RECOVERY - dhclient process on terbium is OK: PROCS OK: 0 processes with command name dhclient
[22:43:04] <icinga-wm>	 RECOVERY - Check systemd state on terbium is OK: OK - running: The system is fully operational
[22:43:14] <icinga-wm>	 RECOVERY - nutcracker process on terbium is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker
[22:43:14] <icinga-wm>	 RECOVERY - Check size of conntrack table on terbium is OK: OK: nf_conntrack is 0 % full
[22:43:34] <icinga-wm>	 RECOVERY - configured eth on terbium is OK: OK - interfaces up
[22:46:04] <icinga-wm>	 PROBLEM - dhclient process on terbium is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[22:46:04] <icinga-wm>	 PROBLEM - Check systemd state on terbium is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[22:46:14] <icinga-wm>	 PROBLEM - nutcracker process on terbium is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[22:46:15] <icinga-wm>	 PROBLEM - Check size of conntrack table on terbium is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[22:46:34] <icinga-wm>	 PROBLEM - configured eth on terbium is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[23:27:34] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[23:32:14] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[23:36:14] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[23:36:34] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]