[00:04:43] !log wikitech-static-iad: mv /etc/acme/cert/wikitech-static-iad-signed.csr /etc/acme/cert/wikitech-static-iad.chained.crt ; wikitech-static-ord: copy wiki logo: /srv/mediawiki/images# wget https://wikitech-static-iad.wikimedia.org/w/images/labswiki.png [00:04:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:54] PROBLEM - IPMI Temperature on sodium is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [00:13:30] (03PS5) 10Dzahn: Fix typo on Gerrit downtime page [puppet] - 10https://gerrit.wikimedia.org/r/356477 (owner: 10Bartosz Dziewoński) [00:14:32] (03CR) 10Dzahn: [C: 032] Fix typo on Gerrit downtime page [puppet] - 10https://gerrit.wikimedia.org/r/356477 (owner: 10Bartosz Dziewoński) [00:15:12] (03PS5) 10Dzahn: Fix indentation of Gerrit downtime page [puppet] - 10https://gerrit.wikimedia.org/r/356478 (owner: 10Bartosz Dziewoński) [00:16:20] (03CR) 10Dzahn: [C: 032] Fix indentation of Gerrit downtime page [puppet] - 10https://gerrit.wikimedia.org/r/356478 (owner: 10Bartosz Dziewoński) [00:21:55] (03PS1) 10Dzahn: planet: remove "ja" and "ca" (empty), add link to new "el" [puppet] - 10https://gerrit.wikimedia.org/r/356977 [00:22:10] (03PS1) 10Dzahn: remove "ja" and "ca" planet.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/356978 [00:23:46] (03PS2) 10Dzahn: planet: remove "ja" and "ca" (empty), add link to new "el" [puppet] - 10https://gerrit.wikimedia.org/r/356977 [00:25:29] .. and out.. bye [01:40:54] PROBLEM - IPMI Temperature on ocg1002 is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [01:42:54] PROBLEM - IPMI Temperature on sodium is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [02:24:03] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.2) (duration: 07m 47s) [02:24:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:30:27] !log l10nupdate@tin ResourceLoader cache refresh completed at Sat Jun 3 02:30:27 UTC 2017 (duration 6m 24s) [02:30:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:12:54] PROBLEM - IPMI Temperature on sodium is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [03:40:54] PROBLEM - IPMI Temperature on ocg1002 is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [04:11:14] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=952.00 Read Requests/Sec=579.70 Write Requests/Sec=2.40 KBytes Read/Sec=38664.00 KBytes_Written/Sec=61.20 [04:19:14] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=0.50 Read Requests/Sec=15.20 Write Requests/Sec=97.40 KBytes Read/Sec=128.00 KBytes_Written/Sec=673.60 [04:39:34] PROBLEM - Apache HTTP on mw1201 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.073 second response time [04:40:34] RECOVERY - Apache HTTP on mw1201 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.107 second response time [04:42:54] PROBLEM - IPMI Temperature on sodium is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [04:55:04] PROBLEM - Apache HTTP on mw1191 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:55:55] RECOVERY - Apache HTTP on mw1191 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.099 second response time [04:57:04] PROBLEM - configured eth on db1089 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [04:57:14] PROBLEM - Disk space on db1089 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [04:57:14] PROBLEM - dhclient process on db1089 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:57:24] PROBLEM - Check size of conntrack table on db1089 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [04:57:24] PROBLEM - Check systemd state on db1089 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [04:57:24] PROBLEM - DPKG on db1089 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:57:27] PROBLEM - mysqld processes on db1089 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:57:34] PROBLEM - puppet last run on db1089 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:57:37] PROBLEM - MariaDB Slave IO: s1 on db1089 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:57:44] PROBLEM - Check whether ferm is active by checking the default input chain on db1089 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:57:44] PROBLEM - salt-minion processes on db1089 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [04:57:47] PROBLEM - MariaDB Slave SQL: s1 on db1089 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:57:57] PROBLEM - MariaDB disk space on db1089 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:59:05] PROBLEM - PyBal backends health check on lvs1009 is CRITICAL: PYBAL CRITICAL - apaches_80 - Could not depool server mw1177.eqiad.wmnet because of too many down! [05:00:24] RECOVERY - DPKG on db1089 is OK: All packages OK [05:00:24] RECOVERY - Check systemd state on db1089 is OK: OK - running: The system is fully operational [05:00:24] RECOVERY - puppet last run on db1089 is OK: OK: Puppet is currently enabled, last run 10 minutes ago with 0 failures [05:00:24] RECOVERY - Check size of conntrack table on db1089 is OK: OK: nf_conntrack is 0 % full [05:00:34] RECOVERY - Check whether ferm is active by checking the default input chain on db1089 is OK: OK ferm input default policy is set [05:00:44] RECOVERY - salt-minion processes on db1089 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [05:01:04] RECOVERY - configured eth on db1089 is OK: OK - interfaces up [05:01:04] RECOVERY - PyBal backends health check on lvs1009 is OK: PYBAL OK - All pools are healthy [05:01:14] RECOVERY - dhclient process on db1089 is OK: PROCS OK: 0 processes with command name dhclient [05:02:34] ACKNOWLEDGEMENT - Host lvs1007 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T150256 [05:02:34] ACKNOWLEDGEMENT - Host lvs1008 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T150256 [05:02:44] what is going on? [05:03:00] I get input/output error from even an uptime or a dmesg on there [05:03:05] just got to the host [05:03:20] yes [05:03:22] it is broken [05:03:26] going to depool it [05:03:33] ok [05:03:39] could not get to the host, but ACKed the 2 unrelated "host down" that distracted me too [05:04:42] (03PS1) 10Marostegui: db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356984 [05:04:47] PROBLEM - MariaDB Slave Lag: s1 on db1089 is CRITICAL: CRITICAL slave_sql_lag could not connect [05:04:48] can someone review it? [05:05:02] ^ [05:06:16] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356984 (owner: 10Marostegui) [05:07:13] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356984 (owner: 10Marostegui) [05:07:21] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356984 (owner: 10Marostegui) [05:08:17] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1059 - it is broken (duration: 00m 41s) [05:08:21] ^ that is obviously db1089 i will edit SAL wiki ^ [05:08:25] i have depooled appservers but not db servers, but it's broken and the comment in there clearly says "broken ones should be removed" [05:08:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:09:04] PROBLEM - Apache HTTP on mw1204 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.073 second response time [05:09:14] PROBLEM - Nginx local proxy to apache on mw1204 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.154 second response time [05:10:04] RECOVERY - Apache HTTP on mw1204 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.144 second response time [05:10:14] RECOVERY - Nginx local proxy to apache on mw1204 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.180 second response time [05:11:22] morning, you're late to the party [05:12:35] I have created - T166933 [05:12:36] T166933: db1089 possibly HW issues - https://phabricator.wikimedia.org/T166933 [05:13:56] errors have decreased so I believe we are good [05:14:04] let's see what db1089 ilo logs say [05:16:06] [31587157.626167] sd 0:1:0:0: rejecting I/O to offline device [05:16:19] yep [05:18:23] Controller lockup detected .. scsi 0:1:0:0 Aborting command .. "scsi 0:1:0:0: FAILED to abort command" heh. the SSD is very dead [05:18:26] I am going to reboot it I think [05:18:32] I was going to suggest [05:18:52] let's see if it still has controller issues after that [05:18:54] or whatever it is [05:19:06] depooled, disabled notifications [05:19:09] https://phabricator.wikimedia.org/T166933#3312224 [05:19:12] everything working [05:19:17] jynus: Yeah I depooled it :) [05:19:20] I do not see anything else to do [05:19:20] and downtimed it [05:19:33] Let's give it a reboot and see... [05:19:36] before breakafst [05:19:51] haha [05:20:09] wanted to log out but it just froze for me anyways [05:20:35] !log Reboot db1089 - T166933 [05:20:41] you'll be disconnected anyways ;-) [05:20:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:20:44] T166933: db1089 possibly HW issues - https://phabricator.wikimedia.org/T166933 [05:21:31] 1600 errors before automatic depool [05:21:36] but that's al [05:22:30] and probably most of those were retried [05:22:39] The controller is dead [05:22:43] ouch [05:23:18] https://phabricator.wikimedia.org/T166933#3312227 [05:24:12] Oh wow, it continued booting up [05:24:29] And the server is backup [05:24:33] ANd looks healthy so far [05:24:37] not going to start mysql yet [05:25:03] going to check it a bit, and reboot it again [05:26:53] really! wat [05:27:33] did we have more cases where controllers "died" like this and was it always HP? [05:27:46] "A new network or storage device has been detected" [05:28:07] I think it could the the typical "I am going away for a moment" [05:29:04] A second reboot doesn't show anything yes [05:29:29] that happenend with a few of the higher db10XX [05:30:03] can we expect more erratic behavior from this host, jynus, or was it a one-off event? [05:30:06] in the past I mean [05:30:24] server is back with no issues [05:30:27] https://phabricator.wikimedia.org/T141601 [05:30:46] saw "sudo hpssacli controller all show" on another ticket but it shows no problems [05:30:53] "This looks like the last time, when the RAID controller got disconnected on another host." [05:31:06] amazing [05:31:09] "Smart Array P840 in Slot 1 " [05:32:02] I have started MySQL and all looks fine. I will not pool this server until Monday though [05:32:08] https://phabricator.wikimedia.org/T154031#2899292 [05:32:15] ticket where a controller card gets replaced on a HP https://phabricator.wikimedia.org/T150206 [05:32:30] +1, need no extra excitement on the weekend [05:32:49] haha [05:33:36] T154031: db2060 crashed (RAID controller) .. T140598: db2056 RAID controller (temporary) failure ... [05:33:36] T154031: db2060 crashed (RAID controller) - https://phabricator.wikimedia.org/T154031 [05:33:36] T140598: db2056 RAID controller (temporary) failure - https://phabricator.wikimedia.org/T140598 [05:34:02] yeah, that is the error I was referring to [05:34:22] it seems to happen 1 every around one year for each 50 nodes or so [05:34:48] wow thank you HP [05:35:02] *nod* it did seem kind of familiar and it's always HP and then they make p.apaul update the firmware [05:35:10] afair [05:35:24] altenatively [05:35:42] the new IPMI checks could be crashing the entire fleet [05:35:51] Anyways, as this is kinda solved, I am going to get some breakfast :) [05:35:56] jynus: that last sentence scared me!!! [05:35:58] enjoy! [05:37:45] I planned to deal with the laptop this morning (replace the internal hd instead of running off usb stick) [05:38:26] so now that things are back to quiet I'll get started on that... back in a few hours I guess, however long it takes to do the install, updates, transfer around all the data, etc etc [05:43:34] PROBLEM - Apache HTTP on mw1263 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.073 second response time [05:43:44] PROBLEM - HHVM rendering on mw1263 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.074 second response time [05:43:45] PROBLEM - Nginx local proxy to apache on mw1263 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.155 second response time [05:44:34] RECOVERY - Apache HTTP on mw1263 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.114 second response time [05:44:44] RECOVERY - HHVM rendering on mw1263 is OK: HTTP OK: HTTP/1.1 200 OK - 73606 bytes in 0.234 second response time [05:44:45] RECOVERY - Nginx local proxy to apache on mw1263 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.214 second response time [06:11:54] PROBLEM - IPMI Temperature on ms-be2014 is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [06:12:54] PROBLEM - IPMI Temperature on sodium is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [06:13:54] PROBLEM - Nginx local proxy to apache on mw1261 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 1.538 second response time [06:14:04] PROBLEM - Apache HTTP on mw1261 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.079 second response time [06:14:54] RECOVERY - Nginx local proxy to apache on mw1261 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.186 second response time [06:15:04] RECOVERY - Apache HTTP on mw1261 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.105 second response time [06:20:18] Jun 3 06:13:57 mw1261 systemd[1]: hhvm.service: main process exited, code=killed, status=11/SEGV [06:22:09] same thing for mw1263 [06:22:32] and I don't see any coredump in /var/log/hhvm [06:23:26] nor in /tmp [06:23:50] moritzm: is there a task already opened for this specific appserver random SEGV with 3.18 ? [06:24:04] (not sure if you already investigated and found that it is similar to luasandbox etc..) [06:35:54] PROBLEM - nutcracker process on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:36:44] RECOVERY - nutcracker process on thumbor1001 is OK: PROCS OK: 1 process with UID = 115 (nutcracker), command name nutcracker [06:39:31] nc_proxy.c:330 client connections 989 exceed limit 989 - Cc gilles,godog --^ [06:40:46] elukey: did it recover by itself or did you restart nutcracker? [06:41:25] I guess it might have recovered by way of thumbor processes naturally dying [06:42:52] nono didn't do anything [06:43:27] filed a task for it https://phabricator.wikimedia.org/T166938 [06:44:15] I think it's related to https://phabricator.wikimedia.org/T105131, ulimits might need to be tuned [06:44:20] will add the ref in the task [06:45:27] interesting [06:45:41] I am also seeing kernel: [835763.831338] rsvg-convert[35756]: segfault at 30 ip [06:46:33] and the OOM killer acting [06:46:36] that's expected [06:46:37] on thumbor [06:46:54] some images make things blow up, nothing we can do about it and that's why the limits are there [06:47:01] ah yeah [06:47:04] PROBLEM - Nginx local proxy to apache on mw1265 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.154 second response time [06:47:34] PROBLEM - Apache HTTP on mw1265 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.073 second response time [06:47:44] PROBLEM - Check the NTP synchronisation status of timesyncd on mw1265 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:47:44] PROBLEM - HHVM rendering on mw1265 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.073 second response time [06:48:05] RECOVERY - Nginx local proxy to apache on mw1265 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.188 second response time [06:48:10] sigh [06:48:14] PROBLEM - HHVM rendering on mw1264 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.074 second response time [06:48:24] PROBLEM - Apache HTTP on mw1264 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.073 second response time [06:48:35] RECOVERY - Apache HTTP on mw1265 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.107 second response time [06:48:44] RECOVERY - HHVM rendering on mw1265 is OK: HTTP OK: HTTP/1.1 200 OK - 73289 bytes in 0.251 second response time [06:49:04] PROBLEM - Nginx local proxy to apache on mw1264 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.154 second response time [06:50:04] RECOVERY - Nginx local proxy to apache on mw1264 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.179 second response time [06:50:15] RECOVERY - HHVM rendering on mw1264 is OK: HTTP OK: HTTP/1.1 200 OK - 73288 bytes in 0.233 second response time [06:50:24] RECOVERY - Apache HTTP on mw1264 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.108 second response time [06:51:36] gilles: ah we have "limit nofile 64000 64000" in /etc/init/nutcracker.override, but that one IIUC is only for upstart [06:58:17] the same 1024/4096 is on all the mw servers [06:58:33] I'll add a note to the task if it is the case to raise it in puppet [07:00:39] makes sense, but would nutcracker create that many connections without demand? because single thumbor is single threaded it should need only as many connections as there are cores max [07:01:35] increasing the limit makes sense based on these other examples, but I worry it would mask an issue further. app servers don't really have a limit on how many requests they handle in parallel, having a high connection limit there makes sense [07:02:37] sure sure, but 1024 is a bit low as limit.. maybe we could think about a tunable that will adapt case by case [07:04:02] in this case the limit was breached right after an OOM, so not really a normal use case scenario [07:04:48] it doesn't seem a huge problem now, we can discuss it with Filippo on Monday.. I added all the info to the task so the discussing will re-start from there :) [07:11:54] PROBLEM - IPMI Temperature on ms-be2014 is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [07:17:34] RECOVERY - Check the NTP synchronisation status of timesyncd on mw1265 is OK: OK: synced at Sat 2017-06-03 07:17:25 UTC. [07:42:54] PROBLEM - IPMI Temperature on sodium is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [08:12:55] (03PS2) 10Amire80: Set collation for Bashkir wikis to uppercase-ba [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353099 (https://phabricator.wikimedia.org/T162823) [08:16:25] (03PS3) 10Amire80: Set collation for Bashkir wikis to uppercase-ba [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353099 (https://phabricator.wikimedia.org/T162823) [08:30:16] PROBLEM - IPMI Temperature on ms-be1021 is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [08:59:54] RECOVERY - IPMI Temperature on ms-be1021 is OK: Sensor Type(s) Temperature Status: OK [09:12:54] PROBLEM - IPMI Temperature on sodium is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [10:25:44] PROBLEM - puppet last run on ms-be1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:25:44] PROBLEM - SSH on ms-be1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:25:54] PROBLEM - configured eth on ms-be1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:26:34] RECOVERY - puppet last run on ms-be1002 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [10:26:35] RECOVERY - SSH on ms-be1002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0) [10:26:44] RECOVERY - configured eth on ms-be1002 is OK: OK - interfaces up [10:42:54] PROBLEM - IPMI Temperature on sodium is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [11:08:14] (03PS1) 10Ema: check_ipmi_temp: set check timeout to 60 seconds [puppet] - 10https://gerrit.wikimedia.org/r/357010 (https://phabricator.wikimedia.org/T125205) [12:12:54] PROBLEM - IPMI Temperature on sodium is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [12:40:54] PROBLEM - IPMI Temperature on ocg1002 is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [13:42:54] PROBLEM - IPMI Temperature on sodium is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [14:11:54] PROBLEM - IPMI Temperature on ms-be2014 is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [15:12:54] PROBLEM - IPMI Temperature on sodium is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [15:20:32] (03PS3) 10Faidon Liambotis: ircecho: notify service on config change [puppet] - 10https://gerrit.wikimedia.org/r/356852 (owner: 10BryanDavis) [15:20:34] (03PS3) 10Faidon Liambotis: ircecho: notify service on config change [puppet] - 10https://gerrit.wikimedia.org/r/356852 (owner: 10BryanDavis) [15:20:42] (03CR) 10Faidon Liambotis: [C: 032] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/356852 (owner: 10BryanDavis) [15:20:44] (03CR) 10Faidon Liambotis: [C: 032] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/356852 (owner: 10BryanDavis) [15:37:06] (03PS1) 10Andrew Bogott: novastats: Add 'diskspace.py' script [puppet] - 10https://gerrit.wikimedia.org/r/357014 (https://phabricator.wikimedia.org/T163796) [15:37:07] (03PS1) 10Andrew Bogott: novastats: Add 'diskspace.py' script [puppet] - 10https://gerrit.wikimedia.org/r/357014 (https://phabricator.wikimedia.org/T163796) [16:42:54] PROBLEM - IPMI Temperature on sodium is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [18:11:54] PROBLEM - IPMI Temperature on ms-be2014 is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [18:12:54] PROBLEM - IPMI Temperature on sodium is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [19:42:54] PROBLEM - IPMI Temperature on sodium is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [21:07:02] hey anyone around that can help reset 2fa? [21:08:16] Chrissymad: welcome to the club i once was in need of the same [21:10:02] nvm i figured it out :D [21:12:54] PROBLEM - IPMI Temperature on sodium is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [21:19:34] PROBLEM - puppet last run on db1049 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:40:54] PROBLEM - IPMI Temperature on ocg1002 is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [21:47:34] RECOVERY - puppet last run on db1049 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [21:48:44] PROBLEM - puppet last run on terbium is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:50:34] PROBLEM - configured eth on terbium is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:50:44] PROBLEM - Check whether ferm is active by checking the default input chain on terbium is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:50:44] PROBLEM - SSH on terbium is CRITICAL: Server answer [21:50:44] PROBLEM - salt-minion processes on terbium is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:50:44] PROBLEM - DPKG on terbium is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:50:54] PROBLEM - nutcracker port on terbium is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:50:54] PROBLEM - Disk space on terbium is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:51:04] PROBLEM - HTTP-noc on terbium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:51:04] PROBLEM - dhclient process on terbium is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:51:04] PROBLEM - Check systemd state on terbium is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:51:14] PROBLEM - nutcracker process on terbium is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:51:15] PROBLEM - Check size of conntrack table on terbium is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:53:04] PROBLEM - Check the NTP synchronisation status of timesyncd on terbium is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:53:49] I forgot what terbium is used for. [21:55:53] paladox: ive noticed impi or whatever errors being sent for 2 days now [21:56:15] yep, that's temepeture. But termbium seems to have a problem. [21:56:36] ik [21:57:24] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1932 bytes in 0.151 second response time [21:57:37] paladox: wikitech.wikimedia.org/wiki/terbium [21:58:57] thanks [21:59:05] np [22:06:24] PROBLEM - MegaRAID on terbium is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [22:06:25] ACKNOWLEDGEMENT - MegaRAID on terbium is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T166962 [22:06:29] 10Operations, 10ops-eqiad: Degraded RAID on terbium - https://phabricator.wikimedia.org/T166962#3313067 (10ops-monitoring-bot) [22:38:04] PROBLEM - IPMI Temperature on terbium is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [22:41:54] PROBLEM - IPMI Temperature on ms-be2014 is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [22:42:54] PROBLEM - IPMI Temperature on sodium is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [22:43:04] RECOVERY - HTTP-noc on terbium is OK: HTTP OK: HTTP/1.1 200 OK - 3829 bytes in 9.742 second response time [22:43:04] RECOVERY - dhclient process on terbium is OK: PROCS OK: 0 processes with command name dhclient [22:43:04] RECOVERY - Check systemd state on terbium is OK: OK - running: The system is fully operational [22:43:14] RECOVERY - nutcracker process on terbium is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [22:43:14] RECOVERY - Check size of conntrack table on terbium is OK: OK: nf_conntrack is 0 % full [22:43:34] RECOVERY - configured eth on terbium is OK: OK - interfaces up [22:46:04] PROBLEM - dhclient process on terbium is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [22:46:04] PROBLEM - Check systemd state on terbium is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [22:46:14] PROBLEM - nutcracker process on terbium is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [22:46:15] PROBLEM - Check size of conntrack table on terbium is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [22:46:34] PROBLEM - configured eth on terbium is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [23:27:34] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [23:32:14] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [23:36:14] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:36:34] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]