[00:16:50] New patchset: Wpmirrordev; "fix typos in Makefile, mwxml2sql.c, and sqlfilter.c" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/65706 [00:16:57] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [00:17:57] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [00:20:59] PROBLEM - Puppet freshness on cp1029 is CRITICAL: No successful Puppet run in the last 10 hours [00:50:07] PROBLEM - Puppet freshness on stat1002 is CRITICAL: No successful Puppet run in the last 10 hours [01:01:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:02:07] RECOVERY - NTP on ssl3003 is OK: NTP OK: Offset -0.0005847215652 secs [01:02:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [01:02:17] RECOVERY - NTP on ssl3002 is OK: NTP OK: Offset -0.001499414444 secs [01:26:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:27:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.131 second response time [01:32:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:33:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [01:56:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:57:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.133 second response time [01:57:43] PROBLEM - Disk space on mc15 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:58:26] RECOVERY - Disk space on mc15 is OK: DISK OK [02:01:26] !log LocalisationUpdate completed (1.22wmf4) at Tue May 28 02:01:26 UTC 2013 [02:01:42] Logged the message, Master [02:11:34] !log LocalisationUpdate completed (1.22wmf5) at Tue May 28 02:11:34 UTC 2013 [02:11:44] Logged the message, Master [02:17:19] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue May 28 02:17:19 UTC 2013 [02:17:29] Logged the message, Master [03:22:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:23:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [03:56:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:57:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [04:10:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:11:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [04:12:02] PROBLEM - Puppet freshness on stat1 is CRITICAL: No successful Puppet run in the last 10 hours [04:17:03] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [04:17:03] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [04:17:03] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [04:19:03] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [04:48:25] PROBLEM - DPKG on mw1171 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:48:35] PROBLEM - RAID on mw1171 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:48:35] PROBLEM - Disk space on mw1171 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:50:15] RECOVERY - DPKG on mw1171 is OK: All packages OK [04:50:27] RECOVERY - RAID on mw1171 is OK: OK: no RAID installed [04:50:27] RECOVERY - Disk space on mw1171 is OK: DISK OK [04:53:45] PROBLEM - SSH on mw1171 is CRITICAL: Server answer: [05:00:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:02:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [05:02:55] PROBLEM - NTP on ssl3003 is CRITICAL: NTP CRITICAL: No response from NTP server [05:03:46] PROBLEM - NTP on ssl3002 is CRITICAL: NTP CRITICAL: No response from NTP server [05:26:44] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/65689 [05:28:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:31:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [06:33:36] PROBLEM - Host search23 is DOWN: PING CRITICAL - Packet loss = 100% [06:36:38] RECOVERY - Host search23 is UP: PING OK - Packet loss = 0%, RTA = 26.76 ms [06:38:48] PROBLEM - search indices - check lucene status page on search23 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:41:38] RECOVERY - search indices - check lucene status page on search23 is OK: HTTP OK: HTTP/1.1 200 OK - 269 bytes in 0.055 second response time [06:50:48] PROBLEM - NTP on search23 is CRITICAL: NTP CRITICAL: Offset unknown [06:55:48] RECOVERY - NTP on search23 is OK: NTP OK: Offset -0.001485109329 secs [07:01:29] !log search23 apparently rebooted itself, nothing of use in the logs nor in atop [07:01:38] Logged the message, Master [07:18:19] PROBLEM - SSH on lvs1001 is CRITICAL: Server answer: [07:20:19] RECOVERY - SSH on lvs1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [07:52:17] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:53:16] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [08:02:06] RECOVERY - NTP on ssl3003 is OK: NTP OK: Offset 0.001659393311 secs [08:16:17] PROBLEM - Puppet freshness on db1032 is CRITICAL: No successful Puppet run in the last 10 hours [08:22:17] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [08:22:17] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [08:22:17] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [08:25:18] PROBLEM - Puppet freshness on pdf2 is CRITICAL: No successful Puppet run in the last 10 hours [08:25:18] PROBLEM - Puppet freshness on pdf1 is CRITICAL: No successful Puppet run in the last 10 hours [08:25:35] New review: Petrb; "I didn't notice you submit this, anyway I think we should merge this with the existing sql tool I wr..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64847 [08:29:37] New review: Petrb; "https://gerrit.wikimedia.org/r/#/c/65634/" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64847 [08:33:28] RECOVERY - NTP on ssl3002 is OK: NTP OK: Offset -0.0001240968704 secs [09:10:02] PROBLEM - Puppet freshness on db45 is CRITICAL: No successful Puppet run in the last 10 hours [09:11:57] paravoid,ping [09:43:02] PROBLEM - Apache HTTP on mw1171 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:56:21] New patchset: Krinkle; "Fix various path inflexibilities and inconsistencies" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62923 [10:17:23] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [10:18:23] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [10:21:23] PROBLEM - Puppet freshness on cp1029 is CRITICAL: No successful Puppet run in the last 10 hours [10:34:50] yurik_: pong [10:48:48] New patchset: Mark Bergsma; "Unset the IMS header when the LoggedOut cookie is present" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/65772 [10:49:02] New patchset: Faidon; "Varnish text: don't do backend fetches for logged out" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/65773 [10:49:14] haha [10:49:23] haha [10:49:32] I'll abandon mine [10:49:45] i figured I could save the regex match if the IMS header isn't even there [10:49:49] not that it really matters ;) [10:49:54] please confirm, let's not abandon both [10:49:57] :) [10:49:58] ok [10:50:30] Change abandoned: Faidon; "Lost the race. The other one is better too!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/65773 [10:50:47] PROBLEM - Puppet freshness on stat1002 is CRITICAL: No successful Puppet run in the last 10 hours [11:18:57] New patchset: Mark Bergsma; "Fix vcl_error function name (case)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/65774 [11:19:28] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/65772 [11:19:43] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/65774 [11:28:34] New review: Ori.livneh; "(1 comment)" [operations/puppet/cdh4] (master) C: 1; - https://gerrit.wikimedia.org/r/65267 [11:31:44] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:32:34] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [11:33:03] New patchset: Mark Bergsma; "Put request and error information in the error page" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/65779 [11:34:34] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/65779 [11:36:00] RECOVERY - Host amssq47 is UP: PING OK - Packet loss = 0%, RTA = 89.09 ms [11:39:00] PROBLEM - Host amssq47 is DOWN: PING CRITICAL - Packet loss = 100% [11:42:41] New patchset: Ottomata; "Adding ganglia monitoring of webrequest data loss in Kraken HDFS" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/65780 [11:43:55] New patchset: Ottomata; "Adding ganglia monitoring of webrequest data loss in Kraken HDFS" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/65780 [11:44:05] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/65780 [11:47:18] New patchset: Mark Bergsma; "Add XFF and X-Cache headers to error page" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/65781 [11:47:57] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/65781 [11:48:32] RECOVERY - Host amssq47 is UP: PING OK - Packet loss = 0%, RTA = 89.51 ms [11:56:10] PROBLEM - Host amssq47 is DOWN: PING CRITICAL - Packet loss = 100% [12:02:49] New patchset: Mark Bergsma; "Use obj.http.X-Cache, only display headers when present" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/65784 [12:05:20] New patchset: Mark Bergsma; "Use obj.http.X-Cache, only display headers when present" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/65784 [12:08:52] New patchset: Mark Bergsma; "Improve information in the error page" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/65784 [12:09:34] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/65784 [12:11:20] RECOVERY - Host amssq47 is UP: PING OK - Packet loss = 0%, RTA = 89.10 ms [12:12:29] New patchset: Mark Bergsma; "Missing semicolon" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/65786 [12:15:01] PROBLEM - Host amssq47 is DOWN: PING CRITICAL - Packet loss = 100% [12:19:20] New review: Daniel Kinzler; "@aude: nasty edge case, nice catch! I think that's a known bug in Apache... don't know how to work a..." [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/65443 [12:22:54] notpeter, you around? i just need a brainbouncer [12:26:38] New review: Daniel Kinzler; "The rewrite looks fine, but the recursion guard does not. Or maybe I'm missing something?" [operations/apache-config] (master) C: -1; - https://gerrit.wikimedia.org/r/65443 [12:26:50] New review: Aude; "I would go ahead with this approach. " [operations/apache-config] (master) C: 1; - https://gerrit.wikimedia.org/r/65443 [12:31:31] mutante, you there? [12:32:50] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:33:31] let's trryyyyy maybe LeslieCarr? you around? i just need a quick brain bounce [12:33:40] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.122 second response time [12:34:16] New review: Aude; "see the problem with en.wikidata.org/wiki/foo being a rewrite to www.wikidata.org/wiki/Special:ItemB..." [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/65443 [12:46:54] RECOVERY - Host amssq47 is UP: PING OK - Packet loss = 0%, RTA = 89.08 ms [12:47:23] New patchset: Mark Bergsma; "Work around the fact that synthetic can only be used once" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/65790 [12:47:50] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/65786 [12:48:06] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/65790 [12:51:23] PROBLEM - Host amssq47 is DOWN: PING CRITICAL - Packet loss = 100% [12:53:37] New review: Aude; "how about http://dpaste.com/1202394/ ?" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/65443 [12:54:25] oh, gerrity is lovely [12:54:35] gerrit [12:55:15] New review: Aude; "dpaste.com/1202394/" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/65443 [12:57:57] hiya mark, you there? i got a really weird multicast problem [12:58:10] the logs in the webrequest multicast stream are pretty much all duplicated [12:58:10] oh? [12:58:19] but. only in the analytics cluster [12:58:30] here's how i'm checking: [12:59:49] https://gist.github.com/ottomata/5662603 [13:00:09] i'm firing up a udp2log instance consuming from the multicast stream and cating out [13:00:16] extracting the seq # and hostname [13:00:17] sorting [13:00:19] and then uniq -c [13:00:28] on analytics1026, i get duplicates [13:00:34] on stat1002, i don't see the duplicates [13:01:16] New review: Daniel Kinzler; "> see the problem with en.wikidata.org/wiki/foo being a " [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/65443 [13:01:38] brandon was saying how HTCP purges were duplicate too, but he theorized they might be sent twice to make them more resilient to lost packets [13:02:45] i'm looking all over the analytics cluster to see if there is some erroneous extra relay or something…maybe i did something stupid and started up something that piped everything back at the multicast group [13:02:50] i don't see anything yet though [13:03:19] if there's an extra relay, presumably they're coming from different source IPs [13:03:29] it's also possible the network is duplicating them [13:03:47] yeah, this only started happening after we set up the ACl and fixed the igmp problem [13:04:29] New patchset: Mark Bergsma; "Make the retry link work, with a protocol relative URL" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/65792 [13:05:20] hmm [13:05:29] perhaps the routers are no longer seeing eachother's PIM packets [13:05:34] i'll check that in a bit [13:05:47] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/65792 [13:06:04] hmm, ok, how would I check for source IPs? i coudl check tcpdump but i'm not sure how to associate them there with the duplicate lines in the logs [13:06:34] RECOVERY - Host amssq47 is UP: PING OK - Packet loss = 0%, RTA = 89.30 ms [13:07:45] ok i'm looking at tcpdump and running my test [13:08:00] i only see source packets from gadolinium (the multicast relay host) [13:08:36] (sorry that makes sense, though I was going to be looking for the originating log frontend source IPs, got it) [13:08:46] also, not sure if this is reliable [13:08:57] because i'm not looking at the content [13:09:06] but tcpdump by default reports the packet length [13:09:15] and I see lots of duplicate packet lengths [13:09:26] PROBLEM - Host amssq47 is DOWN: PING CRITICAL - Packet loss = 100% [13:09:59] filter on a likely length [13:10:05] and see if those packets are coming from the same ip? [13:10:31] i'll do the same thing I did for seq + hostname with length and source... [13:10:42] actually save yourself the trouble :) [13:10:47] oh? [13:10:48] the routers don't see eachother as PIM neighbors [13:10:53] that would do it [13:10:55] haha [13:10:57] yay! [13:10:59] solution! [13:11:20] i'll fix that in a bit [13:11:23] thanks [13:13:27] just on the analytics lan you mean? [13:14:43] hey mark, can you make today's meeting at 5pm (amsterdam time) for the varnishkafka demo? [13:15:59] * mark checks his calendar [13:16:22] ottomata: here now [13:16:23] yeah [13:16:24] I canmake that [13:17:38] s'ok notpeter, we figured it out! i thought i was going nutso, but it turned out I wasn't! [13:18:17] paravoid: yes [13:18:23] so that wouldn't be brandon's problem [13:25:36] if it's a problem at all [13:34:59] ottomata: coool! [13:38:32] mark: thanks! [13:58:34] PROBLEM - DPKG on mc15 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:59:34] RECOVERY - DPKG on mc15 is OK: All packages OK [14:04:24] RECOVERY - Host amssq47 is UP: PING OK - Packet loss = 0%, RTA = 89.75 ms [14:09:56] ottomata: is it fixed now? [14:11:00] * mark does another change [14:11:26] looks like it! [14:11:33] check again in a few mins ;) [14:11:55] http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&title=&vl=&x=&n=&hreg[]=analytics100%5B3456%5D&mreg[]=pkts_in>ype=line&glegend=show&aggregate=1&embed=1&_=1369750294829 [14:12:18] PROBLEM - Puppet freshness on stat1 is CRITICAL: No successful Puppet run in the last 10 hours [14:12:19] hehe [14:12:43] cut in half, so looks right [14:14:25] good [14:14:28] appears to stay up [14:14:50] from { [14:14:50] destination-address { [14:14:50] 224.0.0.13/32; [14:14:50] } [14:14:50] protocol pim; [14:14:50] ttl 1; [14:14:53] } [14:14:55] then accept; [14:16:54] cool danke [14:19:18] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [14:22:43] !log reimaging db1033 for testing [14:22:51] Logged the message, notpeter [14:25:38] PROBLEM - Host db1033 is DOWN: PING CRITICAL - Packet loss = 100% [14:29:16] New patchset: Mark Bergsma; "Restrict access to backend caches to Wikimedia IP ranges" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/65807 [14:30:20] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/65807 [14:30:48] RECOVERY - Host db1033 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [14:32:58] PROBLEM - SSH on db1033 is CRITICAL: Connection refused [14:33:28] PROBLEM - Disk space on db1033 is CRITICAL: Connection refused by host [14:33:48] PROBLEM - RAID on db1033 is CRITICAL: Connection refused by host [14:33:49] PROBLEM - DPKG on db1033 is CRITICAL: Connection refused by host [14:35:00] New patchset: Mark Bergsma; "The wikimedia_nets ACL isn't always defined" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/65808 [14:35:51] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/65808 [14:36:50] !log Gracefulled Apache on gallium, looks like there might have been APC corruption [14:36:59] Logged the message, Mr. Obvious [14:39:57] RECOVERY - SSH on db1033 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [14:40:17] PROBLEM - Puppet freshness on mw1171 is CRITICAL: No successful Puppet run in the last 10 hours [14:42:47] PROBLEM - RAID on mc15 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:43:47] RECOVERY - RAID on mc15 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [14:44:09] New patchset: Mark Bergsma; "Remove duplicate import of std vmod" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/65811 [14:44:54] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/65811 [14:45:27] PROBLEM - NTP on db1033 is CRITICAL: NTP CRITICAL: Offset unknown [14:47:27] RECOVERY - Disk space on db1033 is OK: DISK OK [14:47:47] RECOVERY - RAID on db1033 is OK: OK: State is Optimal, checked 2 logical device(s) [14:47:47] RECOVERY - DPKG on db1033 is OK: All packages OK [14:50:27] RECOVERY - NTP on db1033 is OK: NTP OK: Offset -0.01446533203 secs [15:19:06] New review: Andrew Bogott; "Why not just install the latexml debian package? Seems to work for me..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61767 [15:21:17] New review: Physikerwelt; "Currently it can not be configured as web-service." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61767 [15:24:42] New patchset: Andrew Bogott; "Remove apachebench on wtp1004, we don't need it any more" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63738 [15:24:58] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63738 [15:27:12] PROBLEM - Host wtp1008 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:52] RECOVERY - Host wtp1008 is UP: PING OK - Packet loss = 0%, RTA = 0.57 ms [16:07:03] New patchset: Mark Bergsma; "Handle the mobile redirect in a slightly nicer way" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/65820 [16:08:29] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/65820 [16:09:33] New review: Petrb; "(1 comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/65705 [16:10:23] Reedy: huh, wikidatawiki has more RC rows than enwiki [16:11:39] New patchset: Ottomata; "Fixing eventlogging stat1 rsync job puppetization" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/65821 [16:11:54] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/65821 [16:11:58] New review: Petrb; "also can you insert this package to development environment as well?" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/65705 [16:14:49] RECOVERY - Puppet freshness on stat1 is OK: puppet ran at Tue May 28 16:14:39 UTC 2013 [16:17:06] New review: Demon; "So this looks nice and versatile and will probably work for Solr as well :)" [operations/puppet/cdh4] (master) - https://gerrit.wikimedia.org/r/65408 [16:19:09] RECOVERY - Puppet freshness on stat1002 is OK: puppet ran at Tue May 28 16:19:04 UTC 2013 [16:19:23] Aaron|home: this is what I was trying to tell you guys [16:20:58] why is gerrit still so slow? [16:21:07] https://wikipulse.herokuapp.com/ [16:21:17] chck wikidata and en wp any time of day or night... [16:21:27] ah no idea about gerrit :-( [16:23:21] taking 7s [16:25:42] ok java is using half the memory but [16:25:48] there's no runaway cpu thing happening [16:27:36] can't suss anything out of ganglia [16:29:05] <^demon|away> Java's using half the memory because that's what we've allocated to the JVM at launch. [16:31:34] greedy thing [16:38:34] what is git-upload-pack in a POST request? [16:38:45] e.g. POST /r/p/mediawiki/extensions/WikiForum.git/git-upload-pack [16:39:22] New review: Faidon; "Definitely. This needs to be its own module, especially if there are already plans to use it in anot..." [operations/puppet/cdh4] (master) C: -1; - https://gerrit.wikimedia.org/r/65408 [16:39:31] * Aaron|home hands ^demon|away https://gerrit.wikimedia.org/r/#/c/65159/2 [16:39:35] New patchset: Mark Bergsma; "Add all mobile UAs currently listed in the Squid configuration" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/65823 [16:39:37] joy [16:39:42] ^demon|away: I guess that is to you [16:39:44] mark: yikes [16:40:01] gerrit is slllow [16:40:14] * paravoid hands Aaron|home https://gerrit.wikimedia.org/r/#/c/62549/ [16:40:28] <^demon|away> apergos: server side of git-fetch-pack :) [16:42:45] New review: Demon; "I don't know the differences in the packaging without looking--for Solr we just used the debian pack..." [operations/puppet/cdh4] (master) - https://gerrit.wikimedia.org/r/65408 [16:43:19] New review: MaxSem; "I'm looking right now which of these strings can be safely dropped." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/65823 [16:43:36] thank you very much max [16:43:46] obviously i'm not too happy about that regex as it is ;) [16:44:52] paravoid: you think "nuke" is neutral? [16:44:54] apergos: okay to take ms-be4 down? sbernardin is or is on his way to the DC now [16:45:10] yes do it [16:45:17] k ....thx [16:45:36] * Aaron|home thinks jawiki would beg to differ [16:45:40] * Aaron|home looks up synonyms [16:45:53] !log ms-be4 going down for h/w replacement [16:46:02] Logged the message, Master [16:46:51] maybe 'erase', meh [16:47:02] I don't think anyone except you will be thinking about ceph scrubs ;) [16:50:43] PROBLEM - Host ms-be4 is DOWN: PING CRITICAL - Packet loss = 100% [16:55:38] <^demon|away> Aaron|home: deleteDeletedFiles? ;-) [17:00:26] New patchset: Andrew Bogott; "No longer run and deploy Parsoid on the Parsoid Varnish machines" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63735 [17:00:35] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63735 [17:04:22] nuke was what I liked but I don't care that much :) [17:06:06] New patchset: Matthias Mullie; "Revert "Disable AFTv5 talk page links"" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/65827 [17:17:52] New review: Akosiaris; "Looks fine to me, the tests are cool too. " [operations/puppet/cdh4] (master) C: 2; - https://gerrit.wikimedia.org/r/65267 [17:18:04] drdee, do we have raw user-agent data for page views somewhere? [17:24:20] Change merged: BBlack; [operations/software/varnish/vhtcpd] (master) - https://gerrit.wikimedia.org/r/65639 [17:24:34] Change merged: BBlack; [operations/software/varnish/vhtcpd] (master) - https://gerrit.wikimedia.org/r/65640 [17:24:50] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/65827 [17:26:00] New patchset: BBlack; "bump version to 0.0.4" [operations/software/varnish/vhtcpd] (master) - https://gerrit.wikimedia.org/r/65829 [17:26:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:27:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [17:27:26] !log taking down ms-be4 for server replacement [17:27:35] Logged the message, Master [17:31:11] New patchset: BBlack; "Merge branch 'master' into debian" [operations/software/varnish/vhtcpd] (debian) - https://gerrit.wikimedia.org/r/65830 [17:31:11] New patchset: BBlack; "bump pkg version" [operations/software/varnish/vhtcpd] (debian) - https://gerrit.wikimedia.org/r/65831 [17:31:25] Change merged: BBlack; [operations/software/varnish/vhtcpd] (debian) - https://gerrit.wikimedia.org/r/65830 [17:31:41] Change merged: BBlack; [operations/software/varnish/vhtcpd] (debian) - https://gerrit.wikimedia.org/r/65831 [17:36:44] bblack: feel free to deploy vhtcpd on all varnish boxes if you're happy with it [17:37:46] hehe [17:38:20] Change abandoned: Mwalker; "Not the best solution to this problem -- symlinks in /a are apparently evil!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/65409 [17:38:38] mark: I had a crasher bug I was sorting out on Sunday from the hotel. I fixed it and was watching it, and then we lost internet access. And then I got home to find my access here down as well :) [17:38:58] haha [17:39:18] mark: so now I'm back online and the fix seems to ahve worked and it's not crashing, so I'm packaging up this version now to really deploy to cp1029 correctly and let it run as a proper daemon for a bit while I sort out pkg repo + puppet stuff [17:39:27] ok [17:40:08] its no like the perl version was more stable ;) [17:41:02] is the perl version of anything every more stable? :) [17:41:26] s/every/ever/ [18:13:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:14:05] !log mlitn Started syncing Wikimedia installation... : Update ArticleFeedbackv5 & Echo to master [18:14:14] Logged the message, Master [18:14:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [18:15:14] drdee, do we have raw user-agent data for page views somewhere? [18:17:12] PROBLEM - Puppet freshness on db1032 is CRITICAL: No successful Puppet run in the last 10 hours [18:18:45] New patchset: Cmjohnson; "Changing site.pp and netboot.cfg for ms-be1 and ms-be4" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/65834 [18:18:46] OuKB: yes we do [18:19:30] do I need stat1 access for it? [18:20:22] i am sorry but i don't recognize your IRC handle,what's your name? [18:20:29] MaxSem [18:20:37] k [18:20:50] got locked into this one after a bunch of reconnects [18:21:24] stat1002 is preferable, submit RT ticket and then we can give you access [18:23:01] !log mlitn Finished syncing Wikimedia installation... : Update ArticleFeedbackv5 & Echo to master [18:23:10] Logged the message, Master [18:23:12] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [18:23:12] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [18:23:12] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [18:26:09] PROBLEM - Puppet freshness on pdf1 is CRITICAL: No successful Puppet run in the last 10 hours [18:26:09] PROBLEM - Puppet freshness on pdf2 is CRITICAL: No successful Puppet run in the last 10 hours [18:27:41] PROBLEM - DPKG on mc15 is CRITICAL: Timeout while attempting connection [18:28:39] RECOVERY - DPKG on mc15 is OK: All packages OK [18:37:02] New patchset: Hashar; "beta: disable WikibaseRepo on wikidata" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/65838 [18:37:32] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/65838 [18:38:10] New patchset: Cmjohnson; "Changing site.pp and netboot.cfg for ms-be1 and ms-be4" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/65834 [18:39:17] Change merged: Cmjohnson; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/65834 [18:48:00] New patchset: Hashar; "Revert "beta: disable WikibaseRepo on wikidata"" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/65839 [18:48:38] New patchset: Hashar; "beta: reenable WikibaseRepo on wikidata" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/65839 [18:49:09] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/65839 [18:52:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:52:36] !log taking down ms-be1 for server replacement [18:52:45] Logged the message, Master [18:53:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [18:53:26] !log mlitn synchronized php-1.22wmf4/extensions/ArticleFeedbackv5 'Update ArticleFeedbackv5 to master' [18:53:35] Logged the message, Master [18:57:12] PROBLEM - Host ms-be1 is DOWN: PING CRITICAL - Packet loss = 100% [18:57:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:58:32] RECOVERY - Host ms-be4 is UP: PING OK - Packet loss = 0%, RTA = 28.35 ms [18:59:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [19:00:42] PROBLEM - swift-account-server on ms-be4 is CRITICAL: Timeout while attempting connection [19:00:52] PROBLEM - swift-object-auditor on ms-be4 is CRITICAL: Timeout while attempting connection [19:00:52] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: Timeout while attempting connection [19:00:52] PROBLEM - RAID on ms-be4 is CRITICAL: Timeout while attempting connection [19:01:12] PROBLEM - swift-container-replicator on ms-be4 is CRITICAL: Timeout while attempting connection [19:01:12] PROBLEM - swift-object-updater on ms-be4 is CRITICAL: Timeout while attempting connection [19:01:12] PROBLEM - swift-account-reaper on ms-be4 is CRITICAL: Timeout while attempting connection [19:01:22] PROBLEM - swift-container-updater on ms-be4 is CRITICAL: Timeout while attempting connection [19:01:23] PROBLEM - SSH on ms-be4 is CRITICAL: Connection timed out [19:01:32] PROBLEM - swift-container-server on ms-be4 is CRITICAL: Timeout while attempting connection [19:01:32] PROBLEM - DPKG on ms-be4 is CRITICAL: Timeout while attempting connection [19:01:32] PROBLEM - swift-account-replicator on ms-be4 is CRITICAL: Timeout while attempting connection [19:01:42] PROBLEM - swift-object-server on ms-be4 is CRITICAL: Timeout while attempting connection [19:01:42] PROBLEM - swift-object-replicator on ms-be4 is CRITICAL: Timeout while attempting connection [19:01:42] PROBLEM - Disk space on ms-be4 is CRITICAL: Timeout while attempting connection [19:01:42] PROBLEM - swift-account-auditor on ms-be4 is CRITICAL: Timeout while attempting connection [19:10:51] PROBLEM - Puppet freshness on db45 is CRITICAL: No successful Puppet run in the last 10 hours [19:13:21] PROBLEM - NTP on ms-be4 is CRITICAL: NTP CRITICAL: No response from NTP server [19:23:37] andrewbogott: any reason you +1d https://gerrit.wikimedia.org/r/#/c/63080/ rather than +2? were you looking for further input from someone specific? [19:24:38] ori-l: only because there were a ton of reviewers and I figured some of them might want a look. I can merge it now if you like. [19:25:06] yes plz :) [19:25:32] ^ superm401 [19:26:03] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63080 [19:26:09] Thanks, andrewbogott [19:26:21] RECOVERY - SSH on ms-be4 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [19:30:31] PROBLEM - Disk space on mc15 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:31:22] RECOVERY - Disk space on mc15 is OK: DISK OK [19:34:22] PROBLEM - Host wtp1008 is DOWN: PING CRITICAL - Packet loss = 100% [19:35:01] RECOVERY - Host wtp1008 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [19:41:06] PROBLEM - Host ms-be4 is DOWN: PING CRITICAL - Packet loss = 100% [19:44:01] New patchset: Jdlrobson; "Add schema for logging editing" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/65843 [19:44:20] thanks andrew [19:46:16] RECOVERY - Host ms-be4 is UP: PING OK - Packet loss = 0%, RTA = 26.59 ms [19:48:27] PROBLEM - SSH on ms-be4 is CRITICAL: Connection timed out [19:51:26] RECOVERY - SSH on ms-be4 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [19:58:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:00:28] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [20:05:37] RECOVERY - Disk space on ms-be4 is OK: DISK OK [20:05:48] RECOVERY - RAID on ms-be4 is OK: OK: State is Optimal, checked 1 logical device(s) [20:06:16] RECOVERY - swift-object-updater on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [20:06:16] RECOVERY - swift-container-replicator on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [20:06:16] RECOVERY - swift-container-updater on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [20:06:16] RECOVERY - swift-container-auditor on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [20:06:26] RECOVERY - swift-container-server on ms-be4 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [20:06:26] RECOVERY - DPKG on ms-be4 is OK: All packages OK [20:06:27] RECOVERY - swift-object-auditor on ms-be4 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [20:06:48] RECOVERY - swift-account-server on ms-be4 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [20:07:28] RECOVERY - swift-account-auditor on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [20:07:38] RECOVERY - swift-object-replicator on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [20:07:48] RECOVERY - swift-object-server on ms-be4 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [20:08:18] RECOVERY - swift-account-reaper on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [20:08:28] RECOVERY - swift-account-replicator on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [20:17:48] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [20:18:49] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [20:21:48] PROBLEM - Puppet freshness on cp1029 is CRITICAL: No successful Puppet run in the last 10 hours [20:23:18] RECOVERY - NTP on ms-be4 is OK: NTP OK: Offset -0.02200067043 secs [20:38:13] New patchset: Dr0ptp4kt; "Instruct robots to stop indexing zero.wikipedia.org and its subdomains." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64629 [20:49:33] Hello guys, API for commons is returning 504 Errors [20:53:15] When doin what? [20:56:24] 10.64.32.41 apache2[17099]: [notice] child pid 7945 exit signal Bus error (7) [20:57:00] Eek [20:58:05] and a lot of that [20:58:12] from the same host [21:06:32] dear ops, apache on 10.64.32.41 is still melting down [21:07:26] preparing to scap... [21:09:05] New review: Jdlrobson; "As stated I don't like this but I'll let someone else decide since I've raised my concerns... :)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64629 [21:16:26] !log maxsem Started syncing Wikimedia installation... : Weekly mobile deployment [21:16:35] Logged the message, Master [21:17:08] cp: cannot create regular file `/a/common/wmf-config/ExtensionMessages-1.22wmf5.php': Permission denied [21:23:42] fixed [21:27:22] !log maxsem Finished syncing Wikimedia installation... : Weekly mobile deployment [21:27:30] Logged the message, Master [21:39:42] New patchset: Spage; "beta: default to new login and create acct forms" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/65851 [21:41:01] New patchset: Spage; "beta: default to new login and create acct forms" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/65851 [21:43:25] PROBLEM - DPKG on mc15 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:44:15] RECOVERY - DPKG on mc15 is OK: All packages OK [21:44:57] New review: Cmcmahon; "let's see this in action and get it tested" [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/65851 [21:51:38] New patchset: GWicke; "New Parsoid Varnish puppetization" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63890 [21:57:05] New review: GWicke; "Addressed some of the VCL issues, but am not yet sure about the last of them. See inline question." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63890 [22:00:06] New review: Catrope; "(1 comment)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63890 [22:01:36] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/65851 [22:29:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:30:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.133 second response time [22:46:35] PROBLEM - NTP on ssl3002 is CRITICAL: NTP CRITICAL: No response from NTP server [22:50:43] New patchset: MaxSem; "Add all mobile UAs currently listed in the Squid configuration" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/65823 [22:51:35] PROBLEM - NTP on ssl3003 is CRITICAL: NTP CRITICAL: No response from NTP server [22:52:37] New review: MaxSem; "* Sorted alphabetically." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/65823 [23:49:44] New review: MZMcBride; "This appears to be related to bug 48856." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64629