[00:40:02] 06Operations, 10Phabricator: Set up Yubikey support in Phabricator - https://phabricator.wikimedia.org/T134672#2272805 (10mmodell) [01:10:24] 06Operations, 10Phabricator: Set up Yubikey support in Phabricator - https://phabricator.wikimedia.org/T134672#2272828 (10Krenair) Source makes several references to YubiCloud... [01:34:21] 06Operations, 10Traffic, 10Wikidata: Varnish seems to sometimes mangle uncompressed API results - https://phabricator.wikimedia.org/T133866#2272839 (10MZMcBride) Copying @Anomie's comment from T132159: > While investigating T132123, I discovered that responses with lengths near to multiples of 32768 bytes w... [01:39:37] 10Ops-Access-Reviews, 05acl*operations-team: ops access request (T123158) - https://phabricator.wikimedia.org/T123159#2272843 (10ori) [02:25:34] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.22) (duration: 09m 52s) [02:25:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:43:00] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.23) (duration: 08m 19s) [02:43:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:52:46] !log l10nupdate@tin ResourceLoader cache refresh completed at Sun May 8 02:52:46 UTC 2016 (duration 9m 47s) [02:52:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:35:46] PROBLEM - puppet last run on db2051 is CRITICAL: CRITICAL: Puppet has 1 failures [03:37:26] PROBLEM - puppet last run on mw2054 is CRITICAL: CRITICAL: Puppet has 1 failures [03:38:56] PROBLEM - puppet last run on furud is CRITICAL: CRITICAL: Puppet has 1 failures [04:02:36] RECOVERY - puppet last run on mw2054 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:02:46] RECOVERY - puppet last run on db2051 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:06:07] RECOVERY - puppet last run on furud is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:54:17] PROBLEM - puppet last run on cp3031 is CRITICAL: CRITICAL: puppet fail [05:16:38] PROBLEM - puppet last run on mw1221 is CRITICAL: CRITICAL: Puppet has 1 failures [05:21:18] RECOVERY - puppet last run on cp3031 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [05:31:58] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [05:35:04] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [05:38:44] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [05:41:54] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [05:42:23] RECOVERY - puppet last run on mw1221 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:07:44] !log restarting elasticsearch server elastic1015.eqiad.wmnet (T110236) [06:07:45] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [06:07:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:30:46] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 679 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5973297 keys - replication_delay is 679 [06:30:47] PROBLEM - puppet last run on mw2036 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:06] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:16] PROBLEM - puppet last run on mw2023 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:57] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:58] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:17] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:56] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 2 failures [06:38:27] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5944375 keys - replication_delay is 0 [06:55:57] RECOVERY - puppet last run on mw2036 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [06:56:18] RECOVERY - puppet last run on mw2023 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [06:56:58] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:57:06] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [06:57:26] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:58:06] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:07] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:39] PROBLEM - tools-home on tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Not Available - 530 bytes in 0.034 second response time [07:07:33] RECOVERY - tools-home on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 825023 bytes in 3.707 second response time [07:10:38] PROBLEM - puppet last run on mw2059 is CRITICAL: CRITICAL: puppet fail [07:20:58] PROBLEM - puppet last run on db2047 is CRITICAL: CRITICAL: puppet fail [07:37:55] RECOVERY - puppet last run on mw2059 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [07:44:16] !log restarting elasticsearch server elastic1016.eqiad.wmnet (T110236) [07:44:17] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [07:44:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:47:55] RECOVERY - puppet last run on db2047 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [08:07:07] PROBLEM - Apache HTTP on mw1251 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.009 second response time [08:08:58] RECOVERY - Apache HTTP on mw1251 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.029 second response time [08:30:08] PROBLEM - Disk space on pollux is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:30:27] PROBLEM - DPKG on pollux is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:30:38] PROBLEM - salt-minion processes on pollux is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:31:18] PROBLEM - configured eth on pollux is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:31:18] PROBLEM - RAID on pollux is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:31:28] PROBLEM - dhclient process on pollux is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:31:48] PROBLEM - Check size of conntrack table on pollux is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:31:48] PROBLEM - puppet last run on pollux is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:32:12] PROBLEM - Corp OIT LDAP Mirror on pollux is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:33:48] PROBLEM - wikidata.org dispatch lag is higher than 300s on wikidata is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1695 bytes in 0.222 second response time [08:34:38] RECOVERY - salt-minion processes on pollux is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:35:17] RECOVERY - configured eth on pollux is OK: OK - interfaces up [08:35:17] RECOVERY - RAID on pollux is OK: OK: no RAID installed [08:35:27] RECOVERY - dhclient process on pollux is OK: PROCS OK: 0 processes with command name dhclient [08:35:38] RECOVERY - Check size of conntrack table on pollux is OK: OK: nf_conntrack is 0 % full [08:35:47] RECOVERY - puppet last run on pollux is OK: OK: Puppet is currently enabled, last run 28 minutes ago with 0 failures [08:36:03] RECOVERY - Corp OIT LDAP Mirror on pollux is OK: LDAP OK - 0.117 seconds response time [08:36:03] RECOVERY - Disk space on pollux is OK: DISK OK [08:36:18] RECOVERY - DPKG on pollux is OK: All packages OK [08:39:47] RECOVERY - wikidata.org dispatch lag is higher than 300s on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1683 bytes in 0.200 second response time [08:42:07] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [08:43:58] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5950584 keys - replication_delay is 0 [08:46:03] 06Operations, 10Mail, 10MediaWiki-Email: Wiki-Mail sent but never delivered - https://phabricator.wikimedia.org/T134674#2272981 (10MarcoAurelio) @01tonythomas Thank you again. My account is not live or yahoo. I'll wait for instructions from #operations then. [09:59:57] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [10:09:47] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [12:52:03] PROBLEM - Disk space on elastic1016 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 80426 MB (15% inode=99%) [13:00:04] PROBLEM - Apache HTTP on mw1017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:04:18] RECOVERY - Apache HTTP on mw1017 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 1.714 second response time [13:04:28] PROBLEM - HHVM rendering on mw1017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:08:17] PROBLEM - Apache HTTP on mw1017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:10:07] RECOVERY - HHVM rendering on mw1017 is OK: HTTP OK: HTTP/1.1 200 OK - 67988 bytes in 0.483 second response time [13:10:07] RECOVERY - Apache HTTP on mw1017 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 2.497 second response time [13:12:38] PROBLEM - puppet last run on mw2029 is CRITICAL: CRITICAL: puppet fail [13:17:53] (03PS1) 10BBlack: debug_proxy: allow all WMF networks [puppet] - 10https://gerrit.wikimedia.org/r/287440 [13:18:05] (03CR) 10BBlack: [C: 032 V: 032] debug_proxy: allow all WMF networks [puppet] - 10https://gerrit.wikimedia.org/r/287440 (owner: 10BBlack) [13:40:09] RECOVERY - puppet last run on mw2029 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:09:41] valhallasw`cloud: around? [14:11:56] Danny_B: yes [14:11:56] 06Operations, 06Commons, 10MediaWiki-File-management, 06Multimedia, 07Regression: image magick stripping colour profile of PNG files [probably regression] - https://phabricator.wikimedia.org/T113123#2273285 (10Danny_B) [14:12:24] valhallasw`cloud: i was impatient... ;-) but see query anyway, pls... [14:42:10] PROBLEM - puppet last run on elastic1022 is CRITICAL: CRITICAL: puppet fail [15:07:11] RECOVERY - puppet last run on elastic1022 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [15:39:10] (03PS1) 10Yuvipanda: k8s: Verify pause container is correct on each build [puppet] - 10https://gerrit.wikimedia.org/r/287444 [15:39:37] (03PS2) 10Yuvipanda: k8s: Verify pause container is correct on each build [puppet] - 10https://gerrit.wikimedia.org/r/287444 [15:41:32] (03CR) 10Yuvipanda: [C: 032] k8s: Verify pause container is correct on each build [puppet] - 10https://gerrit.wikimedia.org/r/287444 (owner: 10Yuvipanda) [15:44:22] aaaa fucking stupid 'y' bug in bash. may bash 'scripts' in general fucking rot in hell [15:44:51] (different hell than the one I'm going to) [15:46:07] maybe they're already roting in hell, and I'm in hell, which is why I'm encountering them. [15:54:15] (03PS1) 10Yuvipanda: k8s: Fixup to previous commit [puppet] - 10https://gerrit.wikimedia.org/r/287445 [15:56:04] (03PS2) 10Yuvipanda: k8s: Fixup to previous commit [puppet] - 10https://gerrit.wikimedia.org/r/287445 [16:01:58] (03CR) 10Yuvipanda: [C: 032] k8s: Fixup to previous commit [puppet] - 10https://gerrit.wikimedia.org/r/287445 (owner: 10Yuvipanda) [17:48:54] https://en.wikipedia.org/w/index.php?title=MediaWiki:Wdsearch-autodesc.js&action=raw&ctype=text/javascript doesn't seem to load [17:49:42] i can't access whole enwiki [17:51:57] hm, weird browser hiccup [18:05:15] PROBLEM - aqs endpoints health on aqs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:09:54] PROBLEM - aqs endpoints health on aqs1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:10:55] RECOVERY - aqs endpoints health on aqs1001 is OK: All endpoints are healthy [18:11:36] RECOVERY - aqs endpoints health on aqs1003 is OK: All endpoints are healthy [18:58:30] (03PS1) 10Yuvipanda: k8s: Allow pod infra image to be overriden with hiera [puppet] - 10https://gerrit.wikimedia.org/r/287501 (https://phabricator.wikimedia.org/T133873) [18:58:51] (03CR) 10jenkins-bot: [V: 04-1] k8s: Allow pod infra image to be overriden with hiera [puppet] - 10https://gerrit.wikimedia.org/r/287501 (https://phabricator.wikimedia.org/T133873) (owner: 10Yuvipanda) [19:01:35] (03PS2) 10Yuvipanda: k8s: Allow pod infra image to be overriden with hiera [puppet] - 10https://gerrit.wikimedia.org/r/287501 (https://phabricator.wikimedia.org/T133873) [19:09:44] (03CR) 10Yuvipanda: [C: 032] k8s: Allow pod infra image to be overriden with hiera [puppet] - 10https://gerrit.wikimedia.org/r/287501 (https://phabricator.wikimedia.org/T133873) (owner: 10Yuvipanda) [19:10:32] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 711 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5997619 keys - replication_delay is 711 [19:20:01] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5971742 keys - replication_delay is 0 [19:39:23] Stupid 8 lagged databases making v.wp in read mode only onceevery 15 minutes... gah... [19:39:27] sv* [19:42:04] svwiki is on s2 [19:42:45] I don't see any significant lag [19:43:34] AWB gets stuck once every 15 min. SOme times for "slave database cathing up wit master" "Sv.wp is in read only mode" and 8 lagged databases.." something-something... [19:43:37] a bit annoying... [19:43:47] with* [19:43:49] 8 lagged databases? [19:43:57] That's what it said. [19:43:59] next time it happens, give me the exact error message [19:44:10] Sure thing [20:09:42] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 643 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5978193 keys - replication_delay is 643 [20:16:11] PROBLEM - puppet last run on dbstore1001 is CRITICAL: CRITICAL: Puppet has 1 failures [20:34:45] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5975869 keys - replication_delay is 0 [20:42:24] RECOVERY - puppet last run on dbstore1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:59:39] PROBLEM - Host pc2006 is DOWN: PING CRITICAL - Packet loss = 100% [21:59:50] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 626 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5987493 keys - replication_delay is 626 [22:17:20] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5981316 keys - replication_delay is 0 [22:26:48] Hi it seems that grrrit-wm bot has quit irc. [22:26:48] * grrrit-wm has quit (Remote host closed the connection) [22:26:53] and hasent rejoined.