[00:00:28] Nemo_bis: IMGTHING_0001.png [00:01:22] :) [01:04:48] PROBLEM - RAID on analytics1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:05:45] RECOVERY - RAID on analytics1009 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [01:09:05] PROBLEM - RAID on analytics1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:09:45] PROBLEM - RAID on analytics1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:10:05] RECOVERY - RAID on analytics1010 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [01:10:45] RECOVERY - RAID on analytics1009 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [01:21:45] RECOVERY - check_job_queue on hume is OK: JOBQUEUE OK - all job queues below 10,000 [01:24:55] PROBLEM - check_job_queue on hume is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:02:10] !log LocalisationUpdate failed: git pull of extensions failed [02:02:36] Logged the message, Master [03:01:12] RECOVERY - check_job_queue on fenari is OK: JOBQUEUE OK - all job queues below 10,000 [03:04:12] PROBLEM - check_job_queue on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:06:12] RECOVERY - check_job_queue on fenari is OK: JOBQUEUE OK - all job queues below 10,000 [03:09:12] PROBLEM - check_job_queue on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:20:52] RECOVERY - check_job_queue on hume is OK: JOBQUEUE OK - all job queues below 10,000 [03:23:52] PROBLEM - check_job_queue on hume is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:37:22] PROBLEM - MySQL disk space on db1044 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [03:37:32] PROBLEM - SSH on db1044 is CRITICAL: Server answer: [03:37:32] PROBLEM - Disk space on db1044 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [03:37:32] PROBLEM - DPKG on db1044 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [03:37:52] RECOVERY - check_job_queue on hume is OK: JOBQUEUE OK - all job queues below 10,000 [03:38:02] PROBLEM - RAID on db1044 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [03:40:52] PROBLEM - check_job_queue on hume is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:41:32] RECOVERY - SSH on db1044 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [04:15:33] another localization update fail? [04:18:19] RECOVERY - MySQL disk space on db1044 is OK: DISK OK [04:19:09] PROBLEM - RAID on db1044 is CRITICAL: NRPE: Call to popen() failed [04:21:19] PROBLEM - MySQL disk space on db1044 is CRITICAL: NRPE: Call to popen() failed [04:24:09] RECOVERY - RAID on db1044 is OK: OK: State is Optimal, checked 1 logical drive(s), 2 physical drive(s) [04:24:19] RECOVERY - MySQL disk space on db1044 is OK: DISK OK [04:24:39] RECOVERY - Disk space on db1044 is OK: DISK OK [04:24:39] RECOVERY - DPKG on db1044 is OK: All packages OK [04:34:46] greg-g: that's usually the kind of thing that is broken permanently until someone intervenes [04:34:49] i think [04:35:35] probably [04:51:09] RECOVERY - check_job_queue on fenari is OK: JOBQUEUE OK - all job queues below 10,000 [04:54:09] PROBLEM - check_job_queue on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:57:49] RECOVERY - check_job_queue on hume is OK: JOBQUEUE OK - all job queues below 10,000 [05:00:59] PROBLEM - check_job_queue on hume is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:02:49] PROBLEM - Host wikidata-lb.pmtpa.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:ed1a::12 [05:02:59] RECOVERY - Host wikidata-lb.pmtpa.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 35.44 ms [05:10:04] RECOVERY - check_job_queue on fenari is OK: JOBQUEUE OK - all job queues below 10,000 [05:13:14] PROBLEM - check_job_queue on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:17:45] RECOVERY - check_job_queue on hume is OK: JOBQUEUE OK - all job queues below 10,000 [05:20:54] PROBLEM - check_job_queue on hume is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:30:04] RECOVERY - check_job_queue on fenari is OK: JOBQUEUE OK - all job queues below 10,000 [05:33:14] PROBLEM - check_job_queue on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:58:34] PROBLEM - Disk space on wtp1018 is CRITICAL: DISK CRITICAL - free space: / 356 MB (3% inode=76%): [05:58:54] RECOVERY - check_job_queue on hume is OK: JOBQUEUE OK - all job queues below 10,000 [06:01:54] PROBLEM - check_job_queue on hume is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:02:54] RECOVERY - check_job_queue on hume is OK: JOBQUEUE OK - all job queues below 10,000 [06:05:48] PROBLEM - check_job_queue on hume is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:09:38] PROBLEM - Disk space on wtp1018 is CRITICAL: DISK CRITICAL - free space: / 355 MB (3% inode=76%): [06:34:48] PROBLEM - HTTP on kaulen is CRITICAL: Connection refused [06:47:39] PROBLEM - Disk space on wtp1018 is CRITICAL: DISK CRITICAL - free space: / 316 MB (3% inode=76%): [07:24:42] RECOVERY - HTTP on kaulen is OK: HTTP OK: HTTP/1.1 302 Found - 489 bytes in 0.055 second response time [07:26:22] !log kaulen's apache was down (killed by oom killer) & refused to start. copied over old config per #5011, restarted apache, & disabled puppet for now. [07:26:35] Logged the message, Master [07:28:09] ^ mutante [07:28:12] * ori-l sleeps [07:55:22] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:56:12] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 1 logical drive(s), 4 physical drive(s) [08:11:37] RECOVERY - check_job_queue on hume is OK: JOBQUEUE OK - all job queues below 10,000 [08:14:47] PROBLEM - check_job_queue on hume is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:45:44] Does this really exist? https://wikitech.wikimedia.org/wiki/Codereview-proxy.wikimedia.org [08:54:13] it did, I know Reedy was looking to kill it awhile back, but i'm sure if that ever happened [09:32:43] RECOVERY - check_job_queue on hume is OK: JOBQUEUE OK - all job queues below 10,000 [09:35:44] PROBLEM - check_job_queue on hume is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:44:27] Nemo_bis: Yes, it had to exist so the apaches could talk to svn [09:44:42] It's actually seemingly unaccessible now (possibly apache changes on kaulen) [09:44:51] It's on mutantes "to really kill" list [09:47:27] inaccessible [09:47:33] you're welcome [09:47:49] * ori-l is annoying [09:49:16] ori-l: especially as you do this while sleeping, apparently [09:49:54] * Nemo_bis notes down sleep phases "2 h: REM phase with spelling corrections over IRC" [09:50:17] Reedy: can you add an {{old}}? (I'm still locked out of my account) [09:52:08] http://www.thefreedictionary.com/Unaccessible [09:53:37] Unacceptable [09:59:08] illacceptable [09:59:19] ill communication [09:59:49] grrrit-wm seems sick too [10:00:04] mark___: thar be bugs filled [10:01:01] and gerrit is possibly the sick part [10:01:47] or it might be the mail setup where gerrit communicates with [10:07:50] !log Send traffic from Japan and Korea to ulsfo, https://gerrit.wikimedia.org/r/#/c/93309/ [10:08:08] Logged the message, Master [10:22:34] mark: https://bugzilla.wikimedia.org/show_bug.cgi?id=56528 [10:24:08] as I have no idea how it works, I'm not gonna fix it on my sunday :) [10:44:47] PROBLEM - NTP peers on dobson is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:45:38] RECOVERY - NTP peers on dobson is OK: NTP OK: Offset -0.000613 secs [10:52:37] PROBLEM - Disk space on wtp1018 is CRITICAL: DISK CRITICAL - free space: / 356 MB (3% inode=76%): [10:58:37] RECOVERY - check_job_queue on hume is OK: JOBQUEUE OK - all job queues below 10,000 [11:01:47] PROBLEM - check_job_queue on hume is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:10:01] * Vito gets tons of wikimedia errors on it.wiki [11:56:37] RECOVERY - check_job_queue on hume is OK: JOBQUEUE OK - all job queues below 10,000 [11:56:47] PROBLEM - NTP peers on dobson is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:57:38] RECOVERY - NTP peers on dobson is OK: NTP OK: Offset -0.000143 secs [11:59:47] PROBLEM - check_job_queue on hume is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:08:14] PROBLEM - Host mediawiki-lb.pmtpa.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:ed1a::8 [12:09:14] RECOVERY - Host mediawiki-lb.pmtpa.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 35.84 ms [12:50:52] !log Activated BFD between cr2-eqiad and cr1-sdtpa link [12:51:13] Logged the message, Master [12:53:34] RECOVERY - check_job_queue on hume is OK: JOBQUEUE OK - all job queues below 10,000 [12:56:44] PROBLEM - check_job_queue on hume is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:09:35] PROBLEM - Disk space on wtp1018 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=76%): [14:58:10] PROBLEM - Host wikipedia-lb.pmtpa.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:ed1a::1 [14:58:50] RECOVERY - Host wikipedia-lb.pmtpa.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 26.94 ms [15:09:33] PROBLEM - Disk space on wtp1018 is CRITICAL: DISK CRITICAL - free space: / 326 MB (3% inode=76%): [15:31:33] RECOVERY - check_job_queue on hume is OK: JOBQUEUE OK - all job queues below 10,000 [15:34:44] PROBLEM - check_job_queue on hume is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:45:33] PROBLEM - Puppet freshness on lvs1002 is CRITICAL: No successful Puppet run in the last 10 hours [15:46:33] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [15:53:53] RECOVERY - check_job_queue on fenari is OK: JOBQUEUE OK - all job queues below 10,000 [15:57:03] PROBLEM - check_job_queue on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:09:40] PROBLEM - Disk space on wtp1018 is CRITICAL: DISK CRITICAL - free space: / 323 MB (3% inode=76%): [16:15:11] Vito: you mean mediawiki errors? [16:15:23] yep jeremyb [16:15:31] though it then recovered [16:15:40] Vito: well next time you should say what they are [16:16:19] I'll do if I'll see anyone near here :D [16:23:29] jeremyb: Request: POST http://it.wikipedia.org/wiki/Speciale:FiltroAntiAbusi/123, from 208.80.152.85 via sq61.wikimedia.org (squid/2.7.STABLE9) to () [16:23:29] Error: ERR_CANNOT_FORWARD, errno [No Error] at Sun, 03 Nov 2013 16:23:12 GMT [16:24:00] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:25:00] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 1 logical drive(s), 4 physical drive(s) [16:29:22] Vito: and how long did it take to get that error? [16:29:41] Vito: you only get it on that one page? [16:29:50] sounds like maybe it's been like that for a while [16:30:08] I got it three times on that page [16:30:37] honestly I cannot remember if it was the same one I got ~2 hours ago while saving common wikitext [16:33:03] hah, Nowiki inseriti da VisualEditor [17:09:36] PROBLEM - Disk space on wtp1018 is CRITICAL: DISK CRITICAL - free space: / 320 MB (3% inode=76%): [17:10:05] gwicke_away: RoanKattouw_away: ^^^ [17:10:12] James_F|Away: ^ [17:10:39] actually seems to be steady about there for a while. still should be fixed though [17:10:39] is that parsoid? [17:10:47] i think so [17:11:06] no idea which are actually in use or not though [17:11:48] ori-l: do we have metrics on how long it takes to process a move/delete? end-to-end as the user experiences it [17:11:57] (see #-tech) [17:12:01] bbl [17:12:02] i don't think so, no [17:12:17] do we have a metrics wishlist? :) [17:26:35] jeremyb: yes but I don't think it's really used https://www.mediawiki.org/wiki/Analytics/Dreams [17:27:06] (And I'm not complaining; I'd much better like existing stuff at stats.wikimedia.org to be kept functional. ;) ) [17:58:36] (03PS5) 10Ori.livneh: Change output format of Gerrit review count gsql to JSON_SINGLE [operations/puppet] - 10https://gerrit.wikimedia.org/r/84743 (owner: 10QChris) [18:00:45] (03CR) 10Ori.livneh: [C: 032] "There's no sense in blocking this patch, so I'll merge it, but please look to replace this setup with something less precarious." [operations/puppet] - 10https://gerrit.wikimedia.org/r/84743 (owner: 10QChris) [20:11:05] jeremyb: [20:11:07] Bah. [20:11:24] jeremyb: "Nowiki inseriti da VisualEditor" sounds like a local AbuseFilter error… [20:14:25] James_F: no, re wtp. or did i get parsoid people wrong? [20:14:55] 03 17:09:36 <+icinga-wm_> PROBLEM - Disk space on wtp1018 is CRITICAL: DISK CRITICAL - free space: / 320 MB (3% inode=76%): [20:14:59] 03 17:10:39 < jeremyb> actually seems to be steady about there for a while. still should be fixed though [20:15:01] jeremyb: Oh. Yeah, that looks bad. gwicke_away or RoanKattouw_away would be the people, yes. Are the rest of the WTPs OK? [20:15:10] James_F: idk, can look [20:16:13] James_F: some had a dramatic drop. https://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&m=disk_free&s=by+name&c=Parsoid+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [20:17:07] jeremyb: Oh dear. Not good. [20:17:53] James_F: 1018 1008 1023 are the worst [20:17:56] https://ganglia.wikimedia.org/latest/?r=4hr&cs=&ce=&m=disk_free&s=ascending&c=Parsoid+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [20:18:01] that last link is sorted [20:18:05] jeremyb: *Probably* some local disc caching by Parsoid that isn't being cleaned up. [20:18:07] * James_F nods. [20:18:47] James_F: is wtp now only eqiad? [20:18:55] jeremyb: Yes. [20:18:58] (03PS3) 10Andrew Bogott: Switch to using uwsgi for the proxy api [operations/puppet] - 10https://gerrit.wikimedia.org/r/92664 [20:19:20] k, good :) [20:19:40] jeremyb: Clearly the latest deploy made things a lot worse, but it's not been great for a while. [20:30:49] PROBLEM - MySQL Slave Running on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:30:49] PROBLEM - MySQL Recent Restart on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:32:39] RECOVERY - MySQL Slave Running on db1047 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [20:32:39] RECOVERY - MySQL Recent Restart on db1047 is OK: OK 2178543 seconds since restart [20:50:09] RECOVERY - check_job_queue on hume is OK: JOBQUEUE OK - all job queues below 10,000 [20:52:59] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:54:49] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 1 logical drive(s), 4 physical drive(s) [20:56:19] PROBLEM - check_job_queue on hume is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:04:21] (03PS1) 10Andrew Bogott: Move the nfs-noid upstart job into an 'nfs' directory. [operations/puppet] - 10https://gerrit.wikimedia.org/r/93407 [21:10:10] (03CR) 10Andrew Bogott: [C: 032] Move the nfs-noid upstart job into an 'nfs' directory. [operations/puppet] - 10https://gerrit.wikimedia.org/r/93407 (owner: 10Andrew Bogott) [21:54:45] PROBLEM - NTP peers on dobson is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:55:35] RECOVERY - NTP peers on dobson is OK: NTP OK: Offset 0.000509 secs [22:12:34] PROBLEM - Host srv267 is DOWN: CRITICAL - Time to live exceeded (10.0.8.17) [22:12:34] PROBLEM - Host srv290 is DOWN: CRITICAL - Time to live exceeded (10.0.8.40) [22:12:34] PROBLEM - Host db45 is DOWN: CRITICAL - Time to live exceeded (10.0.6.55) [22:12:44] PROBLEM - Host db48 is DOWN: PING CRITICAL - Packet loss = 100% [22:12:44] PROBLEM - Host emery is DOWN: PING CRITICAL - Packet loss = 100% [22:12:44] PROBLEM - Host db52 is DOWN: PING CRITICAL - Packet loss = 100% [22:12:44] PROBLEM - Host formey is DOWN: PING CRITICAL - Packet loss = 100% [22:12:44] PROBLEM - Host mw3 is DOWN: PING CRITICAL - Packet loss = 100% [22:12:45] PROBLEM - Host virt10 is DOWN: PING CRITICAL - Packet loss = 100% [22:13:05] RECOVERY - Host srv267 is UP: PING WARNING - Packet loss = 73%, RTA = 37.52 ms [22:13:14] RECOVERY - Host db45 is UP: PING OK - Packet loss = 0%, RTA = 26.60 ms [22:13:14] RECOVERY - Host formey is UP: PING OK - Packet loss = 0%, RTA = 26.64 ms [22:13:14] RECOVERY - Host emery is UP: PING OK - Packet loss = 0%, RTA = 26.57 ms [22:13:14] RECOVERY - Host srv290 is UP: PING OK - Packet loss = 0%, RTA = 26.58 ms [22:13:14] RECOVERY - Host db52 is UP: PING OK - Packet loss = 0%, RTA = 26.61 ms [22:13:15] RECOVERY - Host db48 is UP: PING OK - Packet loss = 0%, RTA = 26.62 ms [22:13:15] RECOVERY - Host mw3 is UP: PING OK - Packet loss = 0%, RTA = 26.61 ms [22:13:24] RECOVERY - Host virt10 is UP: PING OK - Packet loss = 0%, RTA = 26.69 ms [22:16:28] i wonder what that was about... [22:20:29] odd [22:26:24] PROBLEM - Packetloss_Average on emery is CRITICAL: CRITICAL: packet_loss_average is 9.89072303279 (gt 8.0) [22:30:24] RECOVERY - Packetloss_Average on emery is OK: OK: packet_loss_average is 2.2758216 [22:54:51] jeremyb, pmtpa asking nicely to be put to death [23:02:24] PROBLEM - MySQL Slave Running on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:04:14] RECOVERY - MySQL Slave Running on db1047 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [23:22:28] PROBLEM - MySQL Slave Running on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:22:48] PROBLEM - MySQL Recent Restart on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:26:18] RECOVERY - MySQL Slave Running on db1047 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [23:27:38] RECOVERY - MySQL Recent Restart on db1047 is OK: OK 2189043 seconds since restart [23:32:48] PROBLEM - MySQL Recent Restart on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:35:38] RECOVERY - MySQL Recent Restart on db1047 is OK: OK 2189523 seconds since restart