[01:36:30] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [01:38:40] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [01:39:40] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [01:39:50] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [01:45:30] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:46:50] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:47:40] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:48:40] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [02:19:47] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.4) (duration: 07m 37s) [02:19:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:25:53] !log l10nupdate@tin ResourceLoader cache refresh completed at Sun Jun 11 02:25:53 UTC 2017 (duration 6m 6s) [02:26:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:23:45] 10Operations, 10Labs, 10Striker, 10LDAP: Store Wikimedia unified account name (SUL) in LDAP directory - https://phabricator.wikimedia.org/T148048#3337599 (10bd808) @faidon, @Volans, and I talked about this at the Vienna hackathon. Moving this data from #striker's local DB to LDAP would be a useful step in... [04:15:40] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=5786.00 Read Requests/Sec=4644.80 Write Requests/Sec=13.00 KBytes Read/Sec=18748.40 KBytes_Written/Sec=4841.60 [04:22:40] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=6.40 Read Requests/Sec=1.30 Write Requests/Sec=2.20 KBytes Read/Sec=16.80 KBytes_Written/Sec=73.20 [06:47:40] PROBLEM - puppet last run on mw1260 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [07:15:50] RECOVERY - puppet last run on mw1260 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [12:21:40] We had a performance alert yesterday and I redeployed the agent we use (it runs on AWS US East (N. Virginia)) and every run now complains about problems with the security certificate like in https://phabricator.wikimedia.org/T167572#3337802 [13:01:30] PROBLEM - Apache HTTP on mw1205 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:01:30] PROBLEM - HHVM rendering on mw1205 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:01:30] PROBLEM - Nginx local proxy to apache on mw1205 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:02:20] RECOVERY - Apache HTTP on mw1205 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 1.339 second response time [13:02:20] RECOVERY - Nginx local proxy to apache on mw1205 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 615 bytes in 1.414 second response time [13:02:20] RECOVERY - HHVM rendering on mw1205 is OK: HTTP OK: HTTP/1.1 200 OK - 74697 bytes in 2.220 second response time [13:22:51] PROBLEM - salt-minion processes on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:23:40] RECOVERY - salt-minion processes on thumbor1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:14:54] !log executed cumin 'mw22[51-60].codfw.wmnet' 'find /var/log/hhvm/* -user root -exec chown www-data:www-data {} \;' to reduce cron-spam (new hosts added in March) - T146464 [14:15:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:04] T146464: hhvm root:adm owned log files cause failures for logrotate - https://phabricator.wikimedia.org/T146464 [14:17:40] 10Operations, 10User-Elukey: hhvm root:adm owned log files cause failures for logrotate - https://phabricator.wikimedia.org/T146464#3337980 (10elukey) On mw2251 today: ``` root@mw2251:/var/log/hhvm# ls -lht total 24M -rw-r----- 1 www-data www-data 0 Jun 10 06:25 error.log -rw-r----- 1 www-data www-data... [14:32:01] 10Operations, 10User-Elukey: hhvm root:adm owned log files cause failures for logrotate - https://phabricator.wikimedia.org/T146464#3337983 (10elukey) A possibile fix for this issue might to add `FileCreateMode="0640" FileOwner="www-data" FileGroup="www-data"` to the HHVM rsyslog config. I suspect that until... [14:43:30] PROBLEM - Apache HTTP on mw1205 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:43:30] PROBLEM - Nginx local proxy to apache on mw1205 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:43:30] PROBLEM - HHVM rendering on mw1205 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:44:20] RECOVERY - Apache HTTP on mw1205 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 1.190 second response time [14:44:20] RECOVERY - Nginx local proxy to apache on mw1205 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 615 bytes in 1.254 second response time [14:44:20] RECOVERY - HHVM rendering on mw1205 is OK: HTTP OK: HTTP/1.1 200 OK - 74769 bytes in 2.292 second response time [15:11:30] PROBLEM - Apache HTTP on mw1205 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:11:30] PROBLEM - Nginx local proxy to apache on mw1205 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:11:30] PROBLEM - HHVM rendering on mw1205 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:12:20] RECOVERY - Nginx local proxy to apache on mw1205 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 615 bytes in 2.790 second response time [15:12:20] RECOVERY - Apache HTTP on mw1205 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 2.751 second response time [15:12:20] RECOVERY - HHVM rendering on mw1205 is OK: HTTP OK: HTTP/1.1 200 OK - 74707 bytes in 3.613 second response time [15:29:00] PROBLEM - Apache HTTP on mw1192 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.076 second response time [15:29:01] PROBLEM - Nginx local proxy to apache on mw1192 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.161 second response time [15:30:00] RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.541 second response time [15:30:00] RECOVERY - Nginx local proxy to apache on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.183 second response time [15:32:31] this one was killed due to hhvm.service: main process exited, code=killed, status=11/SEGV [15:32:38] moritzm: --^ [15:34:11] there is a stacktrace in /var/log/hhvm [15:34:27] (not sure if it is already part of another bug report, mention it anyway : [15:34:30] :) [15:43:27] 10Operations, 10DBA, 10Wikimedia-Site-requests: Renaming Neoalpha: supervision needed - https://phabricator.wikimedia.org/T167597#3338004 (10revi) [16:35:54] (03PS1) 10Halfak: Adds require ::icinga::plugins to ::ores::base [puppet] - 10https://gerrit.wikimedia.org/r/358240 [17:27:25] (03PS2) 10Zppix: Adds require ::icinga::plugins to ::ores::base [puppet] - 10https://gerrit.wikimedia.org/r/358240 (https://phabricator.wikimedia.org/T167602) (owner: 10Halfak) [18:01:50] PROBLEM - Host scb2005 is DOWN: PING CRITICAL - Packet loss = 100% [19:30:00] PROBLEM - salt-minion processes on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:30:00] PROBLEM - dhclient process on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:30:00] PROBLEM - nutcracker process on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:31:50] RECOVERY - salt-minion processes on thumbor1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:31:50] RECOVERY - dhclient process on thumbor1002 is OK: PROCS OK: 0 processes with command name dhclient [19:31:50] RECOVERY - nutcracker process on thumbor1002 is OK: PROCS OK: 1 process with UID = 115 (nutcracker), command name nutcracker [20:04:52] hey are any security folks around at the moment? [20:37:54] Reedy: got a min? [20:47:15] Zppix: email security@wikimedia.org or file a bug in phabricator [20:48:28] (03CR) 10Krinkle: [C: 031] For HHVM set LANG=C.UTF-8 [puppet] - 10https://gerrit.wikimedia.org/r/353228 (https://phabricator.wikimedia.org/T107128) (owner: 10Tim Starling) [21:04:00] PROBLEM - Check Varnish expiry mailbox lag on cp1099 is CRITICAL: CRITICAL: expiry mailbox lag is 2006254 [22:04:00] RECOVERY - Check Varnish expiry mailbox lag on cp1099 is OK: OK: expiry mailbox lag is 47 [22:06:20] Zppix: Hmm? [22:11:47] (03PS1) 10Framawiki: planet: cleanup en_config.erb [puppet] - 10https://gerrit.wikimedia.org/r/358301 [22:24:49] Reedy: nevermind ill just email security@wikimedia.org [22:51:09] (03PS1) 10Odder: Add high-density logos for the Basque Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358303 (https://phabricator.wikimedia.org/T150618) [22:53:38] (03PS2) 10Odder: Add high-density logos for the Basque Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358303 (https://phabricator.wikimedia.org/T150618) [23:10:06] (03CR) 10Dereckson: [C: 031] planet: cleanup en_config.erb [puppet] - 10https://gerrit.wikimedia.org/r/358301 (owner: 10Framawiki)