[00:16:26] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Fri 20 Dec 2013 09:08:06 PM UTC [00:21:32] (03PS2) 10Sn1per: Fix en.wiktionary favicon for GCI 2013 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/103183 [00:26:14] (03PS2) 10Ori.livneh: reprepro: import from elasticsearch/logstash apt [operations/puppet] - 10https://gerrit.wikimedia.org/r/103112 (owner: 10Faidon Liambotis) [00:28:29] (03CR) 10Ori.livneh: [C: 04-1] "I agree that we should use upstream's logstash." (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/103112 (owner: 10Faidon Liambotis) [01:11:36] (03CR) 10Physikerwelt: "any updates on the node modules?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/90733 (owner: 10Physikerwelt) [01:25:01] (03PS1) 10Yurik: Added carrier 436-04 to zero [operations/puppet] - 10https://gerrit.wikimedia.org/r/103198 [01:30:49] (03PS1) 10Yurik: Zero: Keep things DRY - removed duplicate IDs [operations/puppet] - 10https://gerrit.wikimedia.org/r/103199 [01:47:46] PROBLEM - Host streber is DOWN: PING CRITICAL - Packet loss = 100% [01:50:16] RECOVERY - Host streber is UP: PING OK - Packet loss = 0%, RTA = 35.39 ms [01:52:26] PROBLEM - Puppet freshness on analytics1021 is CRITICAL: Last successful Puppet run was Wed 18 Dec 2013 10:28:53 PM UTC [01:52:26] PROBLEM - Puppet freshness on analytics1022 is CRITICAL: Last successful Puppet run was Wed 18 Dec 2013 10:29:11 PM UTC [02:08:39] !log LocalisationUpdate completed (1.23wmf7) at Sun Dec 22 02:08:39 UTC 2013 [02:09:00] Logged the message, Master [02:15:27] !log LocalisationUpdate completed (1.23wmf8) at Sun Dec 22 02:15:27 UTC 2013 [02:15:43] Logged the message, Master [02:28:45] !log LocalisationUpdate ResourceLoader cache refresh completed at Sun Dec 22 02:28:45 UTC 2013 [02:28:59] Logged the message, Master [03:17:26] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Fri 20 Dec 2013 09:08:06 PM UTC [03:24:31] (03PS3) 10Aaron Schulz: [WIP] Make scap transport CDB files via JSON [operations/puppet] - 10https://gerrit.wikimedia.org/r/103080 [03:28:12] (03PS1) 10Aaron Schulz: Remove slow, per-MW, syntax check [operations/puppet] - 10https://gerrit.wikimedia.org/r/103202 [03:28:37] (03PS2) 10Aaron Schulz: Remove slow, per-MW, syntax check [operations/puppet] - 10https://gerrit.wikimedia.org/r/103202 [03:46:49] (03CR) 10TTO: "Hello there! Please do not include "for GCI 2013" in the commit message. The change itself has nothing to do with GCI, so this should not " [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/103183 (owner: 10Sn1per) [04:02:01] (03PS3) 10Sn1per: Fix en.wiktionary favicon for GCI 2013 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/103183 [04:02:13] (03PS4) 10Sn1per: Fix en.wiktionary favicon [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/103183 [04:48:17] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:49:07] RECOVERY - RAID on searchidx1001 is OK: OK: optimal, 1 logical, 4 physical [04:53:26] PROBLEM - Puppet freshness on analytics1021 is CRITICAL: Last successful Puppet run was Wed 18 Dec 2013 10:28:53 PM UTC [04:53:26] PROBLEM - Puppet freshness on analytics1022 is CRITICAL: Last successful Puppet run was Wed 18 Dec 2013 10:29:11 PM UTC [04:59:22] (03CR) 10TTO: "Thanks, a lot better now." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/103183 (owner: 10Sn1per) [05:08:38] (03PS1) 10Murfel: Update favicon spcom.ico [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/103203 [05:57:16] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:59:16] RECOVERY - RAID on searchidx1001 is OK: OK: optimal, 1 logical, 4 physical [06:18:26] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Fri 20 Dec 2013 09:08:06 PM UTC [06:32:50] (03PS3) 10Faidon Liambotis: reprepro: import from elasticsearch/logstash apt [operations/puppet] - 10https://gerrit.wikimedia.org/r/103112 [06:37:04] (03CR) 10Faidon Liambotis: "(thanks Ori :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/103112 (owner: 10Faidon Liambotis) [07:04:09] (03CR) 10Ori.livneh: [C: 031] reprepro: import from elasticsearch/logstash apt [operations/puppet] - 10https://gerrit.wikimedia.org/r/103112 (owner: 10Faidon Liambotis) [07:07:34] (03CR) 10Qgil: [C: 04-1] "The size of the icon seems to be 48x55, not 48x48." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/103203 (owner: 10Murfel) [07:17:00] (03CR) 10Qgil: "I'm not sure whether it matters the order of the layers. So far all the favicons had the 16x16 icons as first layer, then the 24x24 as sec" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/103183 (owner: 10Sn1per) [07:54:16] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:54:26] PROBLEM - Puppet freshness on analytics1021 is CRITICAL: Last successful Puppet run was Wed 18 Dec 2013 10:28:53 PM UTC [07:54:26] PROBLEM - Puppet freshness on analytics1022 is CRITICAL: Last successful Puppet run was Wed 18 Dec 2013 10:29:11 PM UTC [07:55:06] RECOVERY - RAID on searchidx1001 is OK: OK: optimal, 1 logical, 4 physical [08:12:16] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:14:16] RECOVERY - RAID on searchidx1001 is OK: OK: optimal, 1 logical, 4 physical [08:15:14] (03PS3) 10Aaron Schulz: Remove slow per-branch syntax check [operations/puppet] - 10https://gerrit.wikimedia.org/r/103202 [08:19:16] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:20:07] RECOVERY - RAID on searchidx1001 is OK: OK: optimal, 1 logical, 4 physical [08:53:06] PROBLEM - MySQL Slave Delay on db69 is CRITICAL: CRIT replication delay 312 seconds [08:53:56] PROBLEM - MySQL Replication Heartbeat on db69 is CRITICAL: CRIT replication delay 351 seconds [08:57:56] RECOVERY - MySQL Replication Heartbeat on db69 is OK: OK replication delay -0 seconds [08:58:06] RECOVERY - MySQL Slave Delay on db69 is OK: OK replication delay 0 seconds [09:19:26] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Fri 20 Dec 2013 09:08:06 PM UTC [09:28:26] PROBLEM - Puppet freshness on searchidx1001 is CRITICAL: Last successful Puppet run was Sun 22 Dec 2013 06:27:39 AM UTC [09:46:36] RECOVERY - Puppet freshness on searchidx1001 is OK: puppet ran at Sun Dec 22 09:46:34 UTC 2013 [10:00:41] (03CR) 10Odder: [C: 031] "Oh, this isn't a first attempt -- we discussed the logo at length last night (UTC) :-) Well done!" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/103183 (owner: 10Sn1per) [10:07:33] (03PS5) 10Odder: Fix en.wiktionary favicon [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/103183 (owner: 10Sn1per) [10:08:33] (03CR) 10Odder: [C: 031] "In case the order of layers matters (I don't think it does, but still), I moved the 16px layer to the bottom." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/103183 (owner: 10Sn1per) [10:15:25] (03PS6) 10Matanya: ldap : lint cleanup [operations/puppet] - 10https://gerrit.wikimedia.org/r/102629 [10:55:26] PROBLEM - Puppet freshness on analytics1021 is CRITICAL: Last successful Puppet run was Wed 18 Dec 2013 10:28:53 PM UTC [10:55:26] PROBLEM - Puppet freshness on analytics1022 is CRITICAL: Last successful Puppet run was Wed 18 Dec 2013 10:29:11 PM UTC [11:10:16] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:11:17] RECOVERY - RAID on searchidx1001 is OK: OK: optimal, 1 logical, 4 physical [11:51:16] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:52:16] RECOVERY - RAID on searchidx1001 is OK: OK: optimal, 1 logical, 4 physical [12:20:26] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Fri 20 Dec 2013 09:08:06 PM UTC [12:47:26] PROBLEM - Puppet freshness on searchidx1001 is CRITICAL: Last successful Puppet run was Sun 22 Dec 2013 09:46:34 AM UTC [13:56:26] PROBLEM - Puppet freshness on analytics1021 is CRITICAL: Last successful Puppet run was Wed 18 Dec 2013 10:28:53 PM UTC [13:56:26] PROBLEM - Puppet freshness on analytics1022 is CRITICAL: Last successful Puppet run was Wed 18 Dec 2013 10:29:11 PM UTC [14:05:06] PROBLEM - MySQL Processlist on db1040 is CRITICAL: CRIT 0 unauthenticated, 0 locked, 0 copy to table, 75 statistics [14:07:06] RECOVERY - MySQL Processlist on db1040 is OK: OK 1 unauthenticated, 0 locked, 0 copy to table, 18 statistics [14:20:46] RECOVERY - Puppet freshness on searchidx1001 is OK: puppet ran at Sun Dec 22 14:20:39 UTC 2013 [15:21:26] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Fri 20 Dec 2013 09:08:06 PM UTC [15:34:45] PROBLEM - Puppet freshness on mw1050 is CRITICAL: Last successful Puppet run was Sun 22 Dec 2013 03:29:34 PM UTC [15:36:45] PROBLEM - Puppet freshness on mw1050 is CRITICAL: Last successful Puppet run was Sun 22 Dec 2013 03:29:34 PM UTC [15:38:45] PROBLEM - Puppet freshness on mw1050 is CRITICAL: Last successful Puppet run was Sun 22 Dec 2013 03:29:34 PM UTC [15:40:45] PROBLEM - Puppet freshness on mw1050 is CRITICAL: Last successful Puppet run was Sun 22 Dec 2013 03:29:34 PM UTC [15:42:46] PROBLEM - Puppet freshness on mw1050 is CRITICAL: Last successful Puppet run was Sun 22 Dec 2013 03:29:34 PM UTC [15:44:45] PROBLEM - Puppet freshness on mw1050 is CRITICAL: Last successful Puppet run was Sun 22 Dec 2013 03:29:34 PM UTC [15:46:45] PROBLEM - Puppet freshness on mw1050 is CRITICAL: Last successful Puppet run was Sun 22 Dec 2013 03:29:34 PM UTC [15:48:45] PROBLEM - Puppet freshness on mw1050 is CRITICAL: Last successful Puppet run was Sun 22 Dec 2013 03:29:34 PM UTC [15:50:45] PROBLEM - Puppet freshness on mw1050 is CRITICAL: Last successful Puppet run was Sun 22 Dec 2013 03:29:34 PM UTC [15:52:45] PROBLEM - Puppet freshness on mw1050 is CRITICAL: Last successful Puppet run was Sun 22 Dec 2013 03:29:34 PM UTC [15:54:46] PROBLEM - Puppet freshness on mw1050 is CRITICAL: Last successful Puppet run was Sun 22 Dec 2013 03:29:34 PM UTC [15:56:45] PROBLEM - Puppet freshness on mw1050 is CRITICAL: Last successful Puppet run was Sun 22 Dec 2013 03:29:34 PM UTC [15:58:45] PROBLEM - Puppet freshness on mw1050 is CRITICAL: Last successful Puppet run was Sun 22 Dec 2013 03:29:34 PM UTC [15:59:55] RECOVERY - Puppet freshness on mw1050 is OK: puppet ran at Sun Dec 22 15:59:49 UTC 2013 [16:01:45] PROBLEM - Puppet freshness on mw1050 is CRITICAL: Last successful Puppet run was Sun 22 Dec 2013 03:59:49 PM UTC [16:03:45] PROBLEM - Puppet freshness on mw1050 is CRITICAL: Last successful Puppet run was Sun 22 Dec 2013 03:59:49 PM UTC [16:29:16] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:30:18] RECOVERY - Puppet freshness on mw1050 is OK: puppet ran at Sun Dec 22 16:30:06 UTC 2013 [16:34:16] RECOVERY - RAID on searchidx1001 is OK: OK: optimal, 1 logical, 4 physical [16:34:52] (03PS1) 10Murfel: Update favicon spcom.ico [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/103249 [16:44:21] (03PS2) 10Dereckson: Use local Wiki.png for Central Kurdish Wikipedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/103184 (owner: 10Ebrahim) [16:48:16] (03CR) 10Odder: [C: 031] Logo configuration for ckb.wikipedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/103184 (owner: 10Ebrahim) [16:52:16] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:53:16] PROBLEM - SSH on searchidx1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:53:16] PROBLEM - DPKG on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:54:17] RECOVERY - DPKG on searchidx1001 is OK: All packages OK [16:54:17] RECOVERY - SSH on searchidx1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [16:56:57] PROBLEM - Puppet freshness on analytics1021 is CRITICAL: Last successful Puppet run was Wed 18 Dec 2013 10:28:53 PM UTC [16:56:58] PROBLEM - Puppet freshness on analytics1022 is CRITICAL: Last successful Puppet run was Wed 18 Dec 2013 10:29:11 PM UTC [16:59:06] RECOVERY - RAID on searchidx1001 is OK: OK: optimal, 1 logical, 4 physical [17:19:54] (03CR) 10Qgil: [C: 04-1] "This is a patch on top of your previous patch, which is still under review: https://gerrit.wikimedia.org/r/#/c/103203/1" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/103249 (owner: 10Murfel) [17:21:46] (03CR) 10Qgil: [C: 031] Fix en.wiktionary favicon [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/103183 (owner: 10Sn1per) [18:21:56] PROBLEM - Puppet freshness on cp1065 is CRITICAL: Last successful Puppet run was Fri 20 Dec 2013 09:08:06 PM UTC [18:58:26] RECOVERY - Puppet freshness on cp1065 is OK: puppet ran at Sun Dec 22 18:58:19 UTC 2013 [19:01:06] PROBLEM - Host cp1065 is DOWN: PING CRITICAL - Packet loss = 100% [19:13:16] PROBLEM - MySQL Slave Delay on db1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:13:16] PROBLEM - MySQL InnoDB on db1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:13:16] PROBLEM - MySQL Replication Heartbeat on db1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:13:26] PROBLEM - MySQL Recent Restart on db1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:14:06] PROBLEM - MySQL Slave Running on db1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:14:16] PROBLEM - DPKG on db1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:14:26] PROBLEM - puppet disabled on db1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:14:26] PROBLEM - RAID on db1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:14:36] PROBLEM - Disk space on db1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:14:46] PROBLEM - MySQL Idle Transactions on db1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:14:56] PROBLEM - SSH on db1050 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:14:57] PROBLEM - Full LVS Snapshot on db1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:15:06] PROBLEM - MySQL Processlist on db1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:15:06] PROBLEM - MySQL disk space on db1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:15:16] PROBLEM - mysqld processes on db1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:20:36] Request: GET http://he.wikipedia.org/wiki/%D7%9E%D7%99%D7%95%D7%97%D7%93:%D7%A9%D7%99%D7%A0%D7%95%D7%99%D7%99%D7%9D_%D7%90%D7%97%D7%A8%D7%95%D7%A0%D7%99%D7%9D, from 10.64.0.104 via cp1066 cp1066 ([10.64.0.103]:3128), Varnish XID 1809239396 [19:20:37] Forwarded for: 199.203.78.152, 91.198.174.104, 208.80.154.9, 10.64.0.104 [19:20:37] Error: 503, Service Unavailable at Sun, 22 Dec 2013 19:20:09 GMT [19:20:42] what is going on? [19:21:41] apergos ori-l ? [19:21:56] PROBLEM - Apache HTTP on mw1056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:22:06] PROBLEM - Apache HTTP on mw1065 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:22:16] PROBLEM - Apache HTTP on mw1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:22:16] PROBLEM - Apache HTTP on mw1084 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:22:16] PROBLEM - Apache HTTP on mw1019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:22:26] PROBLEM - Apache HTTP on mw1108 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:22:26] PROBLEM - Apache HTTP on mw1080 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:22:26] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [19:22:26] PROBLEM - Apache HTTP on mw1172 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:22:26] PROBLEM - Apache HTTP on mw1078 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:22:27] PROBLEM - Apache HTTP on mw1098 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:22:27] PROBLEM - Apache HTTP on mw1066 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:22:28] PROBLEM - Apache HTTP on mw1026 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:22:28] PROBLEM - Apache HTTP on mw1054 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:22:29] PROBLEM - Apache HTTP on mw1072 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:22:29] PROBLEM - Apache HTTP on mw1031 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:22:30] PROBLEM - Apache HTTP on mw1046 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:22:46] PROBLEM - Apache HTTP on mw1109 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:22:46] PROBLEM - Apache HTTP on mw1060 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:22:46] crap [19:22:56] PROBLEM - Apache HTTP on mw1049 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:23:16] RECOVERY - Apache HTTP on mw1172 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.055 second response time [19:23:16] RECOVERY - Apache HTTP on mw1078 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.057 second response time [19:23:16] RECOVERY - Apache HTTP on mw1031 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.830 second response time [19:23:16] RECOVERY - Apache HTTP on mw1026 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.373 second response time [19:23:26] RECOVERY - Apache HTTP on mw1098 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.814 second response time [19:23:26] RECOVERY - Apache HTTP on mw1054 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.919 second response time [19:23:26] RECOVERY - Apache HTTP on mw1072 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.638 second response time [19:23:26] PROBLEM - Apache HTTP on mw1188 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:23:36] RECOVERY - Apache HTTP on mw1060 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.075 second response time [19:23:56] PROBLEM - Apache HTTP on mw1062 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:23:57] PROBLEM - Apache HTTP on mw1058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:23:57] PROBLEM - Apache HTTP on mw1038 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:23:57] PROBLEM - Apache HTTP on mw1107 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:23:57] PROBLEM - Apache HTTP on mw1083 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:24:06] PROBLEM - Apache HTTP on mw1178 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:24:07] PROBLEM - Apache HTTP on mw1161 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:24:07] PROBLEM - Apache HTTP on mw1018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:24:07] PROBLEM - Apache HTTP on mw1079 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:24:07] PROBLEM - Apache HTTP on mw1113 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:24:07] PROBLEM - Apache HTTP on mw1048 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:24:08] PROBLEM - Apache HTTP on mw1085 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:24:16] RECOVERY - Apache HTTP on mw1080 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.070 second response time [19:24:16] PROBLEM - Apache HTTP on mw1091 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:24:16] PROBLEM - Apache HTTP on mw1218 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:24:17] PROBLEM - Apache HTTP on mw1086 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:24:17] PROBLEM - Apache HTTP on mw1052 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:24:17] PROBLEM - Apache HTTP on mw1097 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:24:17] PROBLEM - Apache HTTP on mw1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:24:18] PROBLEM - Apache HTTP on mw1029 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:24:18] PROBLEM - Apache HTTP on mw1096 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:24:19] PROBLEM - Apache HTTP on mw1102 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:24:19] PROBLEM - Apache HTTP on mw1034 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:24:20] PROBLEM - Apache HTTP on mw1075 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:24:20] PROBLEM - Apache HTTP on mw1076 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:24:21] RECOVERY - Apache HTTP on mw1046 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.083 second response time [19:24:29] PROBLEM - Apache HTTP on mw1050 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:24:36] RECOVERY - Apache HTTP on mw1109 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.060 second response time [19:24:46] RECOVERY - Apache HTTP on mw1056 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.067 second response time [19:24:46] RECOVERY - Apache HTTP on mw1049 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.779 second response time [19:24:56] RECOVERY - Apache HTTP on mw1062 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.895 second response time [19:24:56] RECOVERY - Apache HTTP on mw1161 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.047 second response time [19:24:56] RECOVERY - Apache HTTP on mw1065 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.077 second response time [19:24:57] RECOVERY - Apache HTTP on mw1083 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.111 second response time [19:24:57] RECOVERY - Apache HTTP on mw1085 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.729 second response time [19:24:57] PROBLEM - Apache HTTP on mw1217 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:24:57] RECOVERY - Apache HTTP on mw1048 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.620 second response time [19:25:06] PROBLEM - Apache HTTP on mw1041 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:25:06] PROBLEM - Apache HTTP on mw1059 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:25:06] RECOVERY - Apache HTTP on mw1086 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.057 second response time [19:25:06] RECOVERY - Apache HTTP on mw1034 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.051 second response time [19:25:06] RECOVERY - Apache HTTP on mw1019 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.063 second response time [19:25:07] RECOVERY - Apache HTTP on mw1075 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.057 second response time [19:25:07] RECOVERY - Apache HTTP on mw1102 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.061 second response time [19:25:08] RECOVERY - Apache HTTP on mw1096 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.722 second response time [19:25:08] RECOVERY - Apache HTTP on mw1052 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.034 second response time [19:25:17] RECOVERY - Apache HTTP on mw1097 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.279 second response time [19:25:17] RECOVERY - Apache HTTP on mw1108 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.993 second response time [19:25:17] PROBLEM - Apache HTTP on mw1198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:25:17] PROBLEM - Apache HTTP on mw1081 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:25:17] RECOVERY - Apache HTTP on mw1066 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.041 second response time [19:25:56] RECOVERY - Apache HTTP on mw1217 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.747 second response time [19:25:56] RECOVERY - Apache HTTP on mw1038 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.838 second response time [19:25:57] RECOVERY - Apache HTTP on mw1059 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.098 second response time [19:25:57] PROBLEM - Apache HTTP on mw1094 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:25:57] RECOVERY - Apache HTTP on mw1178 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.279 second response time [19:26:06] RECOVERY - Apache HTTP on mw1198 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.049 second response time [19:26:07] PROBLEM - Apache HTTP on mw1057 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:26:07] RECOVERY - Apache HTTP on mw1218 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.419 second response time [19:26:07] RECOVERY - Apache HTTP on mw1081 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.049 second response time [19:26:07] RECOVERY - Apache HTTP on mw1023 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.065 second response time [19:26:07] RECOVERY - Apache HTTP on mw1076 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.076 second response time [19:26:16] RECOVERY - Apache HTTP on mw1091 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.098 second response time [19:26:16] RECOVERY - Apache HTTP on mw1029 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.606 second response time [19:26:16] RECOVERY - Apache HTTP on mw1084 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.921 second response time [19:26:16] RECOVERY - Apache HTTP on mw1188 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.055 second response time [19:26:26] RECOVERY - Apache HTTP on mw1050 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.661 second response time [19:26:36] PROBLEM - NTP on db1050 is CRITICAL: NTP CRITICAL: No response from NTP server [19:26:46] RECOVERY - Apache HTTP on mw1107 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.059 second response time [19:26:46] RECOVERY - Apache HTTP on mw1094 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.069 second response time [19:26:56] RECOVERY - Apache HTTP on mw1079 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.070 second response time [19:27:06] RECOVERY - Apache HTTP on mw1113 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.374 second response time [19:27:16] !log powrcycling db1050, inaccessible [19:27:32] Logged the message, Master [19:27:52] very not excited about that [19:28:06] RECOVERY - Apache HTTP on mw1057 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.112 second response time [19:28:16] RECOVERY - Apache HTTP on mw1022 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.234 second response time [19:28:43] need springle-away [19:28:55] Error: Request: GET http://www.mediawiki.org/wiki/Special:Watchlist, from 10.64.32.105 via cp1066 cp1066 ([10.64.0.103]:3128), Varnish XID 1809798691 Forwarded for: 88.13.89.62, 91.198.174.104, 208.80.154.9, 10.64.32.105 Error: 503, Service Unavailable at Sun, 22 Dec 2013 19:27:16 GMT [19:28:56] RECOVERY - Apache HTTP on mw1041 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.063 second response time [19:28:56] RECOVERY - Apache HTTP on mw1058 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.859 second response time [19:28:56] RECOVERY - Apache HTTP on mw1018 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.095 second response time [19:29:28] now loads fine [19:29:46] PROBLEM - Host db1050 is DOWN: PING CRITICAL - Packet loss = 100% [19:29:58] apergos: should i depool it? [19:30:02] db1050, that is. [19:30:33] yes [19:31:15] we'll live without it, this does mean that vslow queries will hit some other host which sucks [19:31:21] (03PS1) 10Ori.livneh: Depool db1050; unresponsive [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/103257 [19:31:22] but we can live with it for a half day [19:31:46] not sure that change got backportd yet anyways [19:31:54] (03CR) 10Ori.livneh: [C: 032 V: 032] Depool db1050; unresponsive [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/103257 (owner: 10Ori.livneh) [19:32:12] !log ori updated /a/common to {{Gerrit|If33894260}}: Depool db1050; unresponsive [19:32:29] Logged the message, Master [19:32:59] !log ori synchronized wmf-config/db-eqiad.php 'Depool db1050' [19:33:15] Logged the message, Master [19:33:36] if there are mw hosts not responding I want to hear about it [19:34:28] !log db1050 boot: "[ 65.625950] device-mapper: table: 252:3: snapshot: Snapshot cow pairing for exception table handover failed" [19:34:46] Logged the message, Master [19:34:56] RECOVERY - Host db1050 is UP: PING OK - Packet loss = 0%, RTA = 0.79 ms [19:35:22] !log db1050 boot: (during fsck) hung at "The disk drive for /a is not ready yet or not present. Continue waiting, S to skip.." etc [19:35:26] RECOVERY - NTP on db1050 is OK: NTP OK: Offset -7.605552673e-05 secs [19:35:39] Logged the message, Master [19:35:47] apergos: [19:35:48] mutante: !log db1050 back up after skipping mount of failed /a [19:35:49] mutante: root@db1050:/a# mount [19:35:49] mutante: Nov 8 18:49:08 db1050 kernel: [ 65.435387] ACPI Error: No handler for Region [IPMI] (ffff880ffbc55240) [IPMI] (20110623/evregion-373) [19:35:51] mutante: Nov 8 18:49:08 db1050 kernel: [ 65.721927] device-mapper: table: 252:3: snapshot: Snapshot cow pairing for exception table handover failed [19:35:53] mutante: Nov 8 18:49:08 db1050 kernel: [ 65.731665] device-mapper: ioctl: error adding target to table [19:36:30] the snapshot message might not be somethign special [19:36:32] the hang is though [19:37:06] all right I'm going to skip this so it's up and can be poked at [19:37:16] RECOVERY - puppet disabled on db1050 is OK: OK [19:37:16] RECOVERY - RAID on db1050 is OK: OK: optimal, 1 logical, 2 physical [19:37:26] RECOVERY - Disk space on db1050 is OK: DISK OK [19:37:36] RECOVERY - MySQL Idle Transactions on db1050 is OK: OK longest blocking idle transaction sleeps for seconds [19:37:46] RECOVERY - SSH on db1050 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [19:37:46] RECOVERY - Full LVS Snapshot on db1050 is OK: OK no full LVM snapshot volumes [19:37:56] RECOVERY - MySQL Slave Running on db1050 is OK: OK replication [19:37:56] RECOVERY - MySQL disk space on db1050 is OK: DISK OK [19:38:06] RECOVERY - MySQL Slave Delay on db1050 is OK: OK replication delay 0 seconds [19:38:06] RECOVERY - DPKG on db1050 is OK: All packages OK [19:38:06] RECOVERY - MySQL Replication Heartbeat on db1050 is OK: OK replication delay 0 seconds [19:38:16] RECOVERY - MySQL Recent Restart on db1050 is OK: OK seconds since restart [19:39:27] that's gotta be lies [19:39:37] there's no mysql running :-D [19:47:46] well not getting a whole lof of traction, lvscan claims there is /dev/tank/data active, but can't xfs_check it or fdisk it, either get 'invalid argument' or 'cant' read first 512 bytes' [19:47:55] I might be missing something obvious (it's late for me) [19:48:59] ori-l: unless you have a bright idea I'll create a ticket and let it be [19:49:07] worst case we can rebuild it from one of the others [19:49:33] no bright ideas [19:49:39] ok, rt ticket coming [19:52:04] it looks better now apergos [19:52:28] well I didn't fix nothin [19:53:02] recovered by itself, anyways the one db can get fixed up later [19:57:56] PROBLEM - Puppet freshness on analytics1021 is CRITICAL: Last successful Puppet run was Wed 18 Dec 2013 10:28:53 PM UTC [19:57:56] PROBLEM - Puppet freshness on analytics1022 is CRITICAL: Last successful Puppet run was Wed 18 Dec 2013 10:29:11 PM UTC [19:58:17] ori-l: I expect it's not you but maybe you know: who disabled puppet on those? ^^ [19:58:44] the last two nights I meant to ask when sf folks were in the channel but was not around to do so [19:59:08] no idea [19:59:17] ok, well it was worth a shot [20:01:20] i think we might be suffering from the same problem reported here: http://www.redhat.com/archives/linux-lvm/2012-February/msg00043.html [20:01:40] the bug thread is [20:02:17] the guilty lvm rule cited in comment #7 (/lib/udev/rules.d/85-lvm2.rules) is present on db1050 [20:03:41] well vgscan returns right away with the correct answer so I dunno [20:04:11] vgchange seems to be ok [20:04:22] maybe I should have just tried that, seemed too simple [20:05:50] xfs_check said nothing [20:06:43] well I have /a mounted so I guess I'll update th ticket (mysql is going to need some error recovery but that should be it) [20:08:36] ori-l: if you want to add a pointer to the bug on the ticket anyways, it's https://rt.wikimedia.org/Ticket/Display.html?id=6542 [20:29:26] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [22:58:56] PROBLEM - Puppet freshness on analytics1021 is CRITICAL: Last successful Puppet run was Wed 18 Dec 2013 10:28:53 PM UTC [22:58:56] PROBLEM - Puppet freshness on analytics1022 is CRITICAL: Last successful Puppet run was Wed 18 Dec 2013 10:29:11 PM UTC