[00:01:33] (03PS2) 10Ori.livneh: New module: 'statsd' [operations/puppet] - 10https://gerrit.wikimedia.org/r/82201 [00:01:56] (03CR) 10jenkins-bot: [V: 04-1] New module: 'statsd' [operations/puppet] - 10https://gerrit.wikimedia.org/r/82201 (owner: 10Ori.livneh) [00:02:21] whoops [00:02:53] there's a stray comma [00:03:01] 144 [00:03:01] 135 $endpoint = 'tcp://vanadium.eqiad.wmnet:8600' $endpoint = 'tcp://vanadium.eqiad.wmnet:8600' 145 [00:03:04] 136 $statsd_host = '127.0.0.1' $statsd_host = 'professor.pmtpa.wmnet', [00:03:48] (03PS3) 10Ori.livneh: New module: 'statsd' [operations/puppet] - 10https://gerrit.wikimedia.org/r/82201 [00:04:03] adding ganglia support is as simple as plopping https://github.com/jbuchbinder/statsd-ganglia-backend/blob/master/statsd-ganglia-backend/index.js in /etc/statsd/backends/ganglia [00:04:39] which i'll grant is not as nice as having it built-in, but i kept finding really basic errors in some of the other implementations [00:04:46] (03CR) 10Faidon Liambotis: [C: 032] New module: 'statsd' [operations/puppet] - 10https://gerrit.wikimedia.org/r/82201 (owner: 10Ori.livneh) [00:05:24] merged [00:05:33] thanks a lot for working on this [00:05:40] it's a very simple tool to implement -- listen on udp, collect stats, perform some simple computations, flush every n seconds to carbon via tcp. there are lots of awful implementations. [00:05:48] ooh, awesome. thanks a lot for reviewing. isn't it like 3 AM? :) [00:05:50] feel free to add me as a reviewer/ping me for more [00:05:55] it is :) [00:06:01] (damn) [00:06:23] i'll force a puppet run on professor [00:07:26] (03PS2) 10Faidon Liambotis: Notify navtiming service when its Python code changes [operations/puppet] - 10https://gerrit.wikimedia.org/r/81572 (owner: 10Ori.livneh) [00:07:39] gah gerrit [00:08:01] (03PS3) 10Faidon Liambotis: Notify navtiming service when its Python code changes [operations/puppet] - 10https://gerrit.wikimedia.org/r/81572 (owner: 10Ori.livneh) [00:08:13] (03CR) 10Faidon Liambotis: [C: 032] Notify navtiming service when its Python code changes [operations/puppet] - 10https://gerrit.wikimedia.org/r/81572 (owner: 10Ori.livneh) [00:08:23] (03PS4) 10Faidon Liambotis: Notify navtiming service when its Python code changes [operations/puppet] - 10https://gerrit.wikimedia.org/r/81572 (owner: 10Ori.livneh) [00:08:27] (03CR) 10Faidon Liambotis: [C: 032] Notify navtiming service when its Python code changes [operations/puppet] - 10https://gerrit.wikimedia.org/r/81572 (owner: 10Ori.livneh) [00:08:45] the simplest things [00:08:59] Linux professor 2.6.32-41-server #94-Ubuntu SMP Fri Jul 6 18:15:07 UTC 2012 x86_64 GNU/Linux [00:09:00] Ubuntu 10.04.4 LTS [00:09:02] * ori-l cries. [00:09:07] oh heh [00:09:28] mind if i move it back to hafnium for now? [00:09:43] nodejs: Installed: (none) Candidate: 0.4.9-wm2 [00:09:47] yeah that won't do it [00:10:27] sigh, yes, let's move it [00:12:28] paravoid: hi sir [00:12:32] paravoid: I saw your ping [00:12:35] hello! [00:13:17] paravoid: let's hangout on g+ if you want, voice-only is fine too [00:13:45] I'd prefer not right now [00:14:02] (03PS1) 10Ori.livneh: Remove misc::graphite::statsd from professor, use hafnium instead [operations/puppet] - 10https://gerrit.wikimedia.org/r/82203 [00:14:06] paravoid: then we can talk here [00:14:44] (03CR) 10Faidon Liambotis: [C: 04-1] "Careful of whitespace! :)" [operations/dns] - 10https://gerrit.wikimedia.org/r/81667 (owner: 10Hashar) [00:14:45] paravoid: I have two questions for you before we go further [00:15:44] (03PS2) 10Faidon Liambotis: Remove misc::graphite::statsd from professor, use hafnium instead [operations/puppet] - 10https://gerrit.wikimedia.org/r/82203 (owner: 10Ori.livneh) [00:15:48] sure [00:17:04] (03PS3) 10Faidon Liambotis: Remove misc::graphite::statsd from professor, use hafnium instead [operations/puppet] - 10https://gerrit.wikimedia.org/r/82203 (owner: 10Ori.livneh) [00:17:27] * ori-l is chastened. [00:17:28] thanks [00:17:41] paravoid: 1) we need a place to put all the geoip related databases(the ones maxmind provide). because we need all of them in order to geolocate requests. in order to understand why this is necessary think about a squid line from 2010 March, it should only be geolocated with a geoip maxmind from around 2010 March. Therefore, by induction, we need this for every month(or better said every release of the GeoIP database). [00:18:04] this was raised by Toby on the ops list this week [00:18:48] (03CR) 10Faidon Liambotis: [C: 032] Remove misc::graphite::statsd from professor, use hafnium instead [operations/puppet] - 10https://gerrit.wikimedia.org/r/82203 (owner: 10Ori.livneh) [00:19:04] paravoid: 2) now, about dtrees, they are similar to 1) except they can be public, but we need to be able to store them somewhere. last time the discussion on gerrit left us in a position of thinking where is the best place to put them. Do you agree ? Let's continue this conversation from where it left off last time, in particular, a place where we should put them [00:20:14] where we left it was that if dtrees are like geoip databases (which, aiui, they are) and frequently updated, then they don't belong in the package [00:20:34] so, abandon that change and work with ottomata on how to manage dtrees & geoip and their history effectively? [00:21:14] ori-l: hey, why is https://gerrit.wikimedia.org/r/#/c/74837/ still unmerged? [00:21:17] did I miss it? [00:21:34] yes, they are like geoip databases, their frequency of update is the same as OpenDDR frequency of update. If you want to know that frequency please have a look here in the commit messages of OpenDDR https://github.com/OpenDDRdotORG/OpenDDR-Resources/commits/master [00:22:16] average: my review comments on the patchset stand then :) [00:22:24] paravoid: yes, indeed they do [00:22:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:22:55] paravoid: I've been busy on wikimetrics btw, so me and ottomata haven't looked into finding a place for these dtree files [00:22:58] it's silly to go through gerrit patchset -> review, merge -> build package -> redeploy package for every openddr update [00:23:13] we have puppet volatile [00:23:24] anyway, do you want to abandon that changeset and work with otto? [00:23:28] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [00:23:34] I'd be happy to discuss ideas further with you and/or otto [00:23:44] ottomata: you around ? [00:23:52] you know it's sunday, right? :) [00:23:56] paravoid: i think so (re: missing it), but it's been long enough that i should probably rebase it myself and make sure it rebases cleanly [00:24:18] PROBLEM - Puppet freshness on ssl1 is CRITICAL: No successful Puppet run in the last 10 hours [00:24:27] The change could not be rebased due to a path conflict during merge. [00:24:39] yeah, i'll deal with it. you should get some sleep, tho :P [00:24:41] paravoid: yes, I know [00:24:59] I won't sleep until this gets merged [00:25:05] no pressure [00:25:32] paravoid: do you have a suggestion for a place to put our dtree and geoip files ? there is a difference between them. geoip needs private. dtree can be public. [00:26:03] puppet volatile, probably [00:26:18] or alternatively, hdfs? :) [00:27:16] is hdfs distributed over all nodes in the hadoop cluster ? [00:27:35] over all datanodes [00:27:41] what does it matter? [00:27:46] the mappers will need access to geoip and dtrees [00:28:00] * average is not a hadoop expert [00:28:26] I'm not either [00:28:31] coordinate with otto [00:28:40] ok [00:28:44] paravoid: thank you [00:30:18] PROBLEM - Puppet freshness on ssl1006 is CRITICAL: No successful Puppet run in the last 10 hours [00:30:29] (03PS2) 10Faidon Liambotis: missing A entries for LVS Ethernet interfaces [operations/dns] - 10https://gerrit.wikimedia.org/r/81667 (owner: 10Hashar) [00:31:12] (03CR) 10Faidon Liambotis: [C: 032] missing A entries for LVS Ethernet interfaces [operations/dns] - 10https://gerrit.wikimedia.org/r/81667 (owner: 10Hashar) [00:31:30] Reedy: round? [00:33:52] (03PS2) 10Ori.livneh: Report MediaWiki exceptions & fatals to Ganglia [operations/puppet] - 10https://gerrit.wikimedia.org/r/74837 [00:34:12] lol [00:34:17] you know I didn't mean it, right? [00:34:18] :) [00:35:08] (03CR) 10Faidon Liambotis: [C: 032] Report MediaWiki exceptions & fatals to Ganglia [operations/puppet] - 10https://gerrit.wikimedia.org/r/74837 (owner: 10Ori.livneh) [00:37:52] (03PS3) 10Yuvipanda: Change Wikivoyage Logo and favicon [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/82119 (owner: 10Jalexander) [00:38:09] PROBLEM - Puppet freshness on ssl1008 is CRITICAL: No successful Puppet run in the last 10 hours [00:41:09] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [00:41:20] yes, obviously :P thanks [00:45:14] (03CR) 10Faidon Liambotis: [C: 032] Change Wikivoyage Logo and favicon [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/82119 (owner: 10Jalexander) [00:45:27] (03Merged) 10jenkins-bot: Change Wikivoyage Logo and favicon [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/82119 (owner: 10Jalexander) [00:46:25] should i sync that? [00:47:13] I'm about to [00:47:17] kk [00:47:36] thanks much [00:48:39] !log faidon synchronized docroot/bits/favicon/wikivoyage.ico 'new wikivoyage favicon' [00:48:45] Logged the message, Master [00:49:09] !log faidon synchronized wmf-config/InitialiseSettings.php 'new wikivoyage logo' [00:49:15] Logged the message, Master [00:49:37] logo is up, favicon will need purges [00:49:47] or not? [00:49:55] seems to work for me [00:49:56] (both) [00:50:08] the old favicon was 32x32 btw [00:50:08] thank you much, I'm sure some will have cached it but that will clear out [00:50:09] PROBLEM - Puppet freshness on ssl1001 is CRITICAL: No successful Puppet run in the last 10 hours [00:50:40] the new is 16x16 [00:50:51] hmmmm, it should be 32x32 … I thought I added that one [00:50:55] it actually has both embeded [00:51:03] * Jamesofur looks [00:51:09] PROBLEM - Puppet freshness on amssq47 is CRITICAL: No successful Puppet run in the last 10 hours [00:51:09] could be, firefox shows it smaller [00:51:14] yeah [00:51:38] on my side the file at least still shows as 32 in my local repository [00:51:48] it's designed to allow different browsers to use what they want [00:52:02] www also needs updating [00:52:03] most browsers use 16x16 but some like Windows 8 use 32x32 for other areas [00:52:05] yeah [00:52:08] looking at that now [00:52:11] (03PS1) 10Ori.livneh: Correct class name (role::logging::mediawiki::ganglia, should be ::errors) [operations/puppet] - 10https://gerrit.wikimedia.org/r/82206 [00:52:12] that's updated on meta [00:52:28] * paravoid liked the old logo better [00:52:30] paravoid: sorry, I missed that. [00:52:50] (03CR) 10Faidon Liambotis: [C: 032] Correct class name (role::logging::mediawiki::ganglia, should be ::errors) [operations/puppet] - 10https://gerrit.wikimedia.org/r/82206 (owner: 10Ori.livneh) [00:53:42] statsd on hafnium looks good [00:53:50] yay [00:53:59] personally I wanted the snake on a towell [00:54:09] PROBLEM - Puppet freshness on ssl1003 is CRITICAL: No successful Puppet run in the last 10 hours [00:54:09] PROBLEM - Puppet freshness on ssl1005 is CRITICAL: No successful Puppet run in the last 10 hours [00:54:09] PROBLEM - Puppet freshness on ssl4 is CRITICAL: No successful Puppet run in the last 10 hours [00:54:10] much more hitchhikers guide to the galaxy [00:56:13] oh fuck me [00:56:32] maybe you didn't merge it because you intuited on some deep level that it was a careless patch [00:56:53] (03PS1) 10Ori.livneh: Fix typo in template path (mwerrors.pyconf.erb) [operations/puppet] - 10https://gerrit.wikimedia.org/r/82207 [00:57:09] PROBLEM - Puppet freshness on ssl1007 is CRITICAL: No successful Puppet run in the last 10 hours [00:57:09] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [00:57:14] (03CR) 10Faidon Liambotis: [C: 032] Fix typo in template path (mwerrors.pyconf.erb) [operations/puppet] - 10https://gerrit.wikimedia.org/r/82207 (owner: 10Ori.livneh) [00:57:42] thanks, i owe you one [00:57:45] or rather: one more [00:58:13] and to think I reviewed this [00:59:37] it's always like that, there's some famous statistic (possibly apocryphal) about most car accidents happening in good weather within a short radius from one's home [01:00:09] PROBLEM - Puppet freshness on ssl1002 is CRITICAL: No successful Puppet run in the last 10 hours [01:00:09] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: No successful Puppet run in the last 10 hours [01:00:19] RECOVERY - check_job_queue on hume is OK: JOBQUEUE OK - all job queues below 10,000 [01:00:45] 'neon puppet snmp trap' sounds like a william gibson novel [01:03:09] PROBLEM - Puppet freshness on ssl1004 is CRITICAL: No successful Puppet run in the last 10 hours [01:03:29] PROBLEM - check_job_queue on hume is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:04:04] vanadium looking good too, thanks again [01:05:09] PROBLEM - Puppet freshness on ssl3003 is CRITICAL: No successful Puppet run in the last 10 hours [01:06:09] PROBLEM - Puppet freshness on ssl1009 is CRITICAL: No successful Puppet run in the last 10 hours [01:07:09] PROBLEM - Puppet freshness on ssl3 is CRITICAL: No successful Puppet run in the last 10 hours [01:07:09] PROBLEM - Puppet freshness on ssl3002 is CRITICAL: No successful Puppet run in the last 10 hours [01:11:27] PROBLEM - Puppet freshness on ssl2 is CRITICAL: No successful Puppet run in the last 10 hours [01:28:47] RECOVERY - Ceph on ms-fe1004 is OK: Ceph HEALTH_OK [01:29:28] RECOVERY - Ceph on ms-fe1001 is OK: Ceph HEALTH_OK [01:29:38] RECOVERY - Ceph on ms-fe1003 is OK: Ceph HEALTH_OK [02:04:38] (03CR) 10TTO: [C: 04-1] "Needs a rebase. My patience is being tested by this patch." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78637 (owner: 10TTO) [02:07:47] !log LocalisationUpdate completed (1.22wmf14) at Mon Sep 2 02:07:46 UTC 2013 [02:07:54] Logged the message, Master [02:11:02] PROBLEM - Puppet freshness on pdf1 is CRITICAL: No successful Puppet run in the last 10 hours [02:13:51] !log LocalisationUpdate completed (1.22wmf15) at Mon Sep 2 02:13:50 UTC 2013 [02:13:57] Logged the message, Master [02:25:02] !log LocalisationUpdate ResourceLoader cache refresh completed at Mon Sep 2 02:25:02 UTC 2013 [02:25:08] Logged the message, Master [02:41:46] (03PS7) 10Ori.livneh: Continuing to clean up InitialiseSettings.php [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78637 (owner: 10TTO) [03:10:35] (03PS1) 10Springle: purge script bug [operations/puppet] - 10https://gerrit.wikimedia.org/r/82214 [03:11:13] (03CR) 10Springle: [C: 032 V: 032] purge script bug [operations/puppet] - 10https://gerrit.wikimedia.org/r/82214 (owner: 10Springle) [03:24:52] (03CR) 10Ori.livneh: [C: 031] "> Needs a rebase. My patience is being tested by this patch." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78637 (owner: 10TTO) [03:47:08] PROBLEM - Puppet freshness on fenari is CRITICAL: No successful Puppet run in the last 10 hours [03:47:08] PROBLEM - Puppet freshness on analytics1003 is CRITICAL: No successful Puppet run in the last 10 hours [03:57:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:58:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.137 second response time [04:04:38] RECOVERY - MySQL Slave Running on db35 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [04:06:38] PROBLEM - MySQL Slave Delay on db35 is CRITICAL: CRIT replication delay 853506 seconds [04:33:05] !log granted process/show/usage to watchdog on db masters [04:33:12] Logged the message, Master [04:42:26] PROBLEM - DPKG on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:46:46] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 1173852 seconds [05:00:36] PROBLEM - DPKG on mw31 is CRITICAL: Timeout while attempting connection [05:01:26] PROBLEM - Host mw31 is DOWN: PING CRITICAL - Packet loss = 100% [05:02:36] RECOVERY - Host mw31 is UP: PING OK - Packet loss = 0%, RTA = 26.82 ms [05:05:46] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [05:07:42] PROBLEM - MySQL Slave Running on db33 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Query caused different errors on master and slave. Error on maste [05:09:42] RECOVERY - MySQL Slave Running on db33 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [05:11:42] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 11340065 seconds [05:16:42] PROBLEM - NTP on mw31 is CRITICAL: NTP CRITICAL: Offset unknown [05:21:43] RECOVERY - NTP on mw31 is OK: NTP OK: Offset 0.0009467601776 secs [06:02:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:03:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [06:08:51] PROBLEM - Puppet freshness on sq36 is CRITICAL: No successful Puppet run in the last 10 hours [06:20:44] !log granted process/show/usage to watchdog on db masters [07:05:24] (03PS1) 10Springle: add percona processlist monitoring, and start migrating to PMP v1.0.4 [operations/puppet] - 10https://gerrit.wikimedia.org/r/82221 [07:11:22] (03CR) 10TTO: "Thanks, Ori." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78637 (owner: 10TTO) [07:27:14] RECOVERY - check_job_queue on fenari is OK: JOBQUEUE OK - all job queues below 10,000 [07:28:54] PROBLEM - Puppet freshness on mw1126 is CRITICAL: No successful Puppet run in the last 10 hours [07:30:24] PROBLEM - check_job_queue on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:03:00] RECOVERY - check_job_queue on hume is OK: JOBQUEUE OK - all job queues below 10,000 [08:06:10] PROBLEM - check_job_queue on hume is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:32:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:33:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [08:39:54] PROBLEM - Puppet freshness on analytics1011 is CRITICAL: No successful Puppet run in the last 10 hours [08:39:54] PROBLEM - Puppet freshness on analytics1026 is CRITICAL: No successful Puppet run in the last 10 hours [08:39:54] PROBLEM - Puppet freshness on analytics1027 is CRITICAL: No successful Puppet run in the last 10 hours [09:12:23] PROBLEM - DPKG on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:22:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:23:13] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [09:28:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:29:13] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.134 second response time [09:30:53] PROBLEM - Puppet freshness on virt0 is CRITICAL: No successful Puppet run in the last 10 hours [09:50:17] PROBLEM - Puppet freshness on sodium is CRITICAL: No successful Puppet run in the last 10 hours [10:25:14] PROBLEM - Puppet freshness on ssl1 is CRITICAL: No successful Puppet run in the last 10 hours [10:31:14] PROBLEM - Puppet freshness on ssl1006 is CRITICAL: No successful Puppet run in the last 10 hours [10:38:46] PROBLEM - Puppet freshness on ssl1008 is CRITICAL: No successful Puppet run in the last 10 hours [10:41:46] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [10:43:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:44:14] (03PS1) 10QChris: Specify push configurations for gerrit's replication in lists [operations/puppet] - 10https://gerrit.wikimedia.org/r/82231 [10:44:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.122 second response time [10:46:06] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:46:56] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [10:50:46] PROBLEM - Puppet freshness on ssl1001 is CRITICAL: No successful Puppet run in the last 10 hours [10:51:46] PROBLEM - Puppet freshness on amssq47 is CRITICAL: No successful Puppet run in the last 10 hours [10:52:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:53:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [10:54:46] PROBLEM - Puppet freshness on ssl1003 is CRITICAL: No successful Puppet run in the last 10 hours [10:54:46] PROBLEM - Puppet freshness on ssl1005 is CRITICAL: No successful Puppet run in the last 10 hours [10:54:46] PROBLEM - Puppet freshness on ssl4 is CRITICAL: No successful Puppet run in the last 10 hours [10:57:46] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [10:57:46] PROBLEM - Puppet freshness on ssl1007 is CRITICAL: No successful Puppet run in the last 10 hours [10:57:47] Any ideas on the timeout errors in https://bugzilla.wikimedia.org/show_bug.cgi?id=53400 ("MathRenderer::writeToDatabase Error: 1205 Lock wait timeout exceeded")? [11:00:46] PROBLEM - Puppet freshness on ssl1002 is CRITICAL: No successful Puppet run in the last 10 hours [11:00:46] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: No successful Puppet run in the last 10 hours [11:02:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:03:46] PROBLEM - Puppet freshness on ssl1004 is CRITICAL: No successful Puppet run in the last 10 hours [11:05:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 4.954 second response time [11:05:46] PROBLEM - Puppet freshness on ssl3003 is CRITICAL: No successful Puppet run in the last 10 hours [11:06:46] PROBLEM - Puppet freshness on ssl1009 is CRITICAL: No successful Puppet run in the last 10 hours [11:07:46] PROBLEM - Puppet freshness on ssl3 is CRITICAL: No successful Puppet run in the last 10 hours [11:07:46] PROBLEM - Puppet freshness on ssl3002 is CRITICAL: No successful Puppet run in the last 10 hours [11:11:46] PROBLEM - Puppet freshness on ssl2 is CRITICAL: No successful Puppet run in the last 10 hours [11:21:23] (03PS1) 10Nemo bis: Add unstructured element translation to Commons [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/82232 [11:22:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:22:37] (03PS2) 10Nemo bis: Add unstructured element translation to Commons [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/82232 [11:23:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [11:52:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:53:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [12:12:02] PROBLEM - Puppet freshness on pdf1 is CRITICAL: No successful Puppet run in the last 10 hours [12:13:22] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:14:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [12:18:52] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:19:52] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [12:22:22] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:23:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [12:23:31] (03CR) 10Akosiaris: "After talking with Mark (the original author of position-of-the-moon, (sorry Ryan, I got misled by git blame), he expressed fears that inl" [operations/puppet] - 10https://gerrit.wikimedia.org/r/81898 (owner: 10Akosiaris) [12:31:46] (03PS3) 10Akosiaris: Replace position-of-the-moon [operations/puppet] - 10https://gerrit.wikimedia.org/r/81898 [12:33:45] i didn't express fears [12:33:47] faidon did ;) [12:34:13] all I expressed was... nostalgia! ;) [12:36:42] mark: damn... I must start eating more fish oils... my memory fails me... :-) [12:40:22] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:41:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [12:44:50] (03CR) 10Akosiaris: [C: 032] Replace position-of-the-moon [operations/puppet] - 10https://gerrit.wikimedia.org/r/81898 (owner: 10Akosiaris) [12:47:02] nah, wasn't fear either [12:49:46] also, I chuckle at the word "author" for that script ;) [12:51:25] perpetrator, perhaps [12:51:47] lol. Next time I think 'll go for "aladeen" [12:52:22] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:52:27] vague enough thanks to sasha baron cohen [12:52:30] :P [12:52:47] heh [12:53:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [13:13:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:14:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [13:22:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:23:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [13:24:09] any reason stafford is at 2.7.11-1ubuntu2.3 puppet/puppetmaster version? I will upgrade it to 2.7.11-1ubuntu2.4... feel free to stop me. [13:24:27] yes there's a reason [13:24:28] stop [13:24:31] faidon has all the details [13:24:33] damn [13:24:37] i was afraid of that [13:24:46] CPU spikes after the upgrade [13:24:54] and stafford gets overloaded [13:26:06] they've added some patches for file path verification or something [13:26:12] SAL.. 05:46 paravoid: downgraded puppetmaster to previous release, new minor update had significantly increased CPU requirements and DoSed stafford. [13:26:22] yeah... so ... what do we do? [13:26:23] yes, that would be it [13:27:09] file a bug report against ubuntu would be one option [13:27:16] but it needs to be tracked down properly [13:28:07] including 1ubuntu2.3 in our repo would be another option (or in the meanwhile), the security update wasn't anything serious [13:30:14] (03PS1) 10Akosiaris: Add nfs[12] to backup scheme [operations/puppet] - 10https://gerrit.wikimedia.org/r/82242 [13:31:34] (03PS1) 10Springle: return db1027 to the pool [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/82243 [13:32:44] (03CR) 10Springle: [C: 032 V: 032] return db1027 to the pool [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/82243 (owner: 10Springle) [13:35:08] !log springle synchronized wmf-config/db-eqiad.php 'reutrn db1027 to the pool' [13:35:14] Logged the message, Master [13:40:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:41:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 1.420 second response time [13:41:36] !log aptitude hold puppet puppet-common puppetmaster puppetmaster-common puppetmaster-passenger on stafford till we solve the problem with CPU spiking present in puppet 2.7.11-1ubuntu2.4 [13:41:42] Logged the message, Master [13:44:07] hm, I wonder if that will do it [13:44:21] puppet has ensure => latest [13:45:21] I think I broken jenkins [13:45:28] PROBLEM - DPKG on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:45:33] paravoid: it didn't... [13:45:35] grrrr [13:45:44] fixing it ... [13:46:17] :) [13:46:24] thanks [13:46:37] I kinda left it in a limbo situation [13:46:45] (03CR) 10jenkins-bot: [V: 04-1] Add nfs[12] to backup scheme [operations/puppet] - 10https://gerrit.wikimedia.org/r/82242 (owner: 10Akosiaris) [13:47:03] akosiaris: sorry that jenkins failure is probably caused be me :( [13:47:05] puppet upgraded itself and I was investigating for a while before I realized [13:47:13] (03PS2) 10Hashar: Add nfs[12] to backup scheme [operations/puppet] - 10https://gerrit.wikimedia.org/r/82242 (owner: 10Akosiaris) [13:47:28] and when I finally found it, I really didn't have the time nor will to properly fix it, so I just hacked it around [13:47:28] hashar: no worries [13:47:31] (03CR) 10Hashar: "Jenkins related failure, sorry." [operations/puppet] - 10https://gerrit.wikimedia.org/r/82242 (owner: 10Akosiaris) [13:48:08] PROBLEM - Puppet freshness on analytics1003 is CRITICAL: No successful Puppet run in the last 10 hours [13:48:08] PROBLEM - Puppet freshness on fenari is CRITICAL: No successful Puppet run in the last 10 hours [13:48:43] paravoid: Yeah.. Puppet was failing. After the hold, it run and managed to upgrade itself... grrr I am looking into it... though i may have to undo the ensure=> latest part... [13:52:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:54:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 1.709 second response time [13:57:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:57:39] PROBLEM - RAID on db35 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:00:28] RECOVERY - RAID on db35 is OK: OK: 1 logical device(s) checked [14:02:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [14:07:43] akosiaris: I'm wondering if we could have more magic into backup::host [14:08:02] i.e. it there's a /var/opendj, it should back it up without us explicitly asking for it [14:10:18] paravoid: auto-discovery of sets ? Sounds cool. What would the rules be ? [14:11:01] (03CR) 10Akosiaris: "recheck" [operations/puppet] - 10https://gerrit.wikimedia.org/r/82242 (owner: 10Akosiaris) [14:11:56] akosiaris: its broken sorry :( [14:12:20] ah.. i thought you fixed it... ok no worries. [14:12:22] thanks [14:22:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:23:04] I am screwed :/ [14:23:15] I got a process locked in write() [14:23:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [14:23:24] and the receiver locked in poll() : ( [14:29:59] (03CR) 10jenkins-bot: [V: 04-1] Add nfs[12] to backup scheme [operations/puppet] - 10https://gerrit.wikimedia.org/r/82242 (owner: 10Akosiaris) [14:30:19] :( [14:30:55] so yeah [14:30:57] I broke jenkins [14:31:12] !log broken Jenkins whenever a huge repository is being used (i.e. mediawiki/core or operations/puppet ) [14:31:18] Logged the message, Master [14:35:07] !log Jenkins: restoring mediawiki-core jobs to previous state [14:35:12] Logged the message, Master [14:41:04] RECOVERY - MySQL Slave Delay on db35 is OK: OK replication delay 0 seconds [14:41:54] RECOVERY - MySQL Replication Heartbeat on db35 is OK: OK replication delay -0 seconds [14:43:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:44:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [14:52:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:53:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [14:54:16] akosiaris, around today? [14:55:57] andrewbogott: yes [14:56:18] Hello! [14:56:36] I just rebased https://gerrit.wikimedia.org/r/#/c/77332/ and it had some conflicts with your recent nrpe work... [14:57:04] Or, not so much conflicts, as, the rebase rendered your changes invisible :( So, I'm pretty sure I've reproduced everything you did, but I'll feel better if you have a look. [14:57:53] ok i will. thanks for telling me [14:58:59] Thank you! I'm happy for your thoughts about that patch in general… I feel OK about it but it's seriously invasive so I'm kind of scared to merge. [14:59:34] andrewbogott: hey, btw, is ori's sysctl still on your todo? [14:59:43] you were on it so I didn't want to step on your toes [14:59:58] but if you've busy with other things I could have a look [15:00:25] Oh… it should be, I forgot that I'd promised to merge it. [15:00:32] :) [15:00:45] I can rebase and merge it today. [15:01:03] I'm not in any hurry, I just wanted to be sure we're not deadlocked or something :) [15:01:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:01:34] PROBLEM - SSH on searchidx1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:01:47] Well, I'm deadlocked by a combination of illness and too much vacation, but not otherwise blocked :) [15:02:10] also note 2527f1f96680781c3652d87f0a4ee7946a78c48f & 63766f418324d95860f8dc1add8095a2c7294de5 during your rebase [15:02:26] * andrewbogott looks [15:02:34] RECOVERY - SSH on searchidx1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [15:03:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [15:03:38] and as for the base module, have this the concept behind this in mind if you want https://gerrit.wikimedia.org/r/#/c/33066/ [15:03:47] or not, I'll leave it up to you :) [15:04:06] I was bad and never actually merged it, it has become stale since :( [15:04:41] (03PS3) 10Hashar: Add nfs[12] to backup scheme [operations/puppet] - 10https://gerrit.wikimedia.org/r/82242 (owner: 10Akosiaris) [15:07:33] akosiaris: your change finally got validated by jenkins, sorry :( https://gerrit.wikimedia.org/r/#/c/82242/ [15:07:43] yippi [15:07:49] hashar: thanks :-) [15:08:05] akosiaris: are you working on the new PHP version? [15:08:05] and I found out how to reproduce my bug :] [15:08:09] now I get to find the root cause oh damn [15:08:17] I had hashar's ticket flagged, should I unflag it? :) [15:08:48] paravoid: yes but i do not like the proposed patch. It requires ZEND_DEBUG το be defined [15:09:26] or else a HashTable structure does not have the "inconsistent" member and the compilation fails [15:11:20] (03CR) 10Akosiaris: [C: 032] Add nfs[12] to backup scheme [operations/puppet] - 10https://gerrit.wikimedia.org/r/82242 (owner: 10Akosiaris) [15:11:24] Bah, why would I get a conflict with a submodule commit in a rebase? It's not touched in the patch [15:12:41] Hm, maybe I'm confused [15:12:49] akosiaris: there might be a follow up to that patch though [15:13:47] (03PS10) 10Andrew Bogott: Clean up sysctl parameters. [operations/puppet] - 10https://gerrit.wikimedia.org/r/75087 (owner: 10Ori.livneh) [15:15:29] paravoid, want to give ^ one last look? [15:16:46] * andrewbogott really wishes that gerrit could show a proper diff between two patchsets [15:17:16] hashar: it must be a great followup to make me feel less uneasy... but i 'd like to see it [15:17:47] andrewbogott: it can [15:18:10] oh wow, it's useless here though [15:18:14] oh, yeah? How so? When I change the 'reference version' I just get a diff of the whole damn tree... [15:18:22] Ah, yes, exactly :) [15:18:50] yeah that is a documented behaviour of gerrit [15:19:06] they say it's a git thing and won't be fixed [15:19:15] akosiaris: so yeah as you pointed out the patch I have linked was wrong. laruence@php.net committed another one http://git.php.net/?p=php-src.git;a=commit;h=8bd5e15ff7a57791956c4017ee8fb4a8ac0d8d2e [15:19:29] Yeah, I know… but wouldn't it be possible to show each patch as a diff against its parent and then diff /those/ two diffs? [15:19:36] akosiaris: that is at the bottom of the php bug https://bugs.php.net/bug.php?id=63055 . I should have been more careful :] [15:19:38] I think that would be useful but perhaps I'm deceived. [15:19:51] akosiaris: RT 5209 updated :-] [15:19:53] (03CR) 10Faidon Liambotis: [C: 031] Clean up sysctl parameters. [operations/puppet] - 10https://gerrit.wikimedia.org/r/75087 (owner: 10Ori.livneh) [15:20:49] hashar: snif.... Ok I will look into it [15:22:16] thx :) [15:22:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:22:42] (03PS11) 10Andrew Bogott: Clean up sysctl parameters. [operations/puppet] - 10https://gerrit.wikimedia.org/r/75087 (owner: 10Ori.livneh) [15:22:45] ^ just a typo fix, no need to re-review [15:24:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [15:26:48] (03PS1) 10Akosiaris: Hold puppetmaster version to 2.7.11-1ubuntu2.3 [operations/puppet] - 10https://gerrit.wikimedia.org/r/82252 [15:52:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:53:13] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [15:53:46] paravoid, do you mind doing a puppet run on a ceph box after I merge this, just to make sure sysctl.conf remains sane? [15:55:26] sure [15:55:26] ok, merging now... [15:55:27] (03PS12) 10Andrew Bogott: Clean up sysctl parameters. [operations/puppet] - 10https://gerrit.wikimedia.org/r/75087 (owner: 10Ori.livneh) [15:55:27] heh, it already needed another rebase [15:57:32] ok, /now/ I am merging [15:57:49] (03CR) 10Andrew Bogott: [C: 032] Clean up sysctl parameters. [operations/puppet] - 10https://gerrit.wikimedia.org/r/75087 (owner: 10Ori.livneh) [15:59:36] (03PS1) 10Reedy: Enable Collection on Wikimania wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/82257 [15:59:51] (03CR) 10Reedy: [C: 032] Enable Collection on Wikimania wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/82257 (owner: 10Reedy) [16:00:24] !log reedy synchronized wmf-config/InitialiseSettings.php [16:00:30] Logged the message, Master [16:01:03] (03PS2) 10Akosiaris: Hold puppetmaster version to 2.7.11-1ubuntu2.3 [operations/puppet] - 10https://gerrit.wikimedia.org/r/82252 [16:09:47] PROBLEM - Puppet freshness on sq36 is CRITICAL: No successful Puppet run in the last 10 hours [16:10:34] (03Merged) 10jenkins-bot: Enable Collection on Wikimania wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/82257 (owner: 10Reedy) [16:13:30] !log Running cleanupTitles.php in screen on terbium [16:13:36] Logged the message, Master [16:13:53] root@ms-be1002:~# diff -uw sysctl-{before,after}-sorted [16:13:53] root@ms-be1002:~# [16:14:02] andrewbogott: ^ [16:14:18] excellent, thanks for testing. [16:19:39] !log restarted Zuul (paranoia) [16:19:45] Logged the message, Master [16:19:58] and I am off for now [16:20:15] will be back later this evening for some routine check [16:20:31] Today is a US holiday, so likely to be pretty dead around here. [16:21:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:22:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [16:22:50] (03CR) 10Akosiaris: [C: 032] Hold puppetmaster version to 2.7.11-1ubuntu2.3 [operations/puppet] - 10https://gerrit.wikimedia.org/r/82252 (owner: 10Akosiaris) [16:27:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:28:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [16:35:48] (03PS1) 10Akosiaris: Fix wrong package specification of 38717f3 [operations/puppet] - 10https://gerrit.wikimedia.org/r/82261 [16:36:06] (03CR) 10jenkins-bot: [V: 04-1] Fix wrong package specification of 38717f3 [operations/puppet] - 10https://gerrit.wikimedia.org/r/82261 (owner: 10Akosiaris) [16:37:54] (03PS2) 10Akosiaris: Fix wrong package specification of 38717f3 [operations/puppet] - 10https://gerrit.wikimedia.org/r/82261 [16:39:19] (03CR) 10Akosiaris: [C: 032] Fix wrong package specification of 38717f3 [operations/puppet] - 10https://gerrit.wikimedia.org/r/82261 (owner: 10Akosiaris) [16:44:26] andrewbogott_afk: info: /File[/etc/sysctl.d/10-zeropage.conf]: Filebucketed /etc/sysctl.d/10-zeropage.conf to puppet with sum 8d7193abcc4dfedaf519dd03016a5e59 [16:44:26] info: /File[/etc/sysctl.d/10-ptrace.conf]: Filebucketed /etc/sysctl.d/10-ptrace.conf to puppet with sum 47f40494b2fc698e15549e0a4a79e81c [16:44:26] info: /File[/etc/sysctl.d/10-kernel-hardening.conf]: Filebucketed /etc/sysctl.d/10-kernel-hardening.conf to puppet with sum 5c1388f00011a287cdeba60208c674e1 [16:44:26] info: /File[/etc/sysctl.d/10-network-security.conf]: Filebucketed /etc/sysctl.d/10-network-security.conf to puppet with sum 4ac7258f5336e7eeaf448c05ab668d3c [16:44:26] info: /File[/etc/sysctl.d/10-console-messages.conf]: Filebucketed /etc/sysctl.d/10-console-messages.conf to puppet with sum 154f6f5c5810d10bb303fb6a8e907c6a [16:44:26] info: /File[/etc/sysctl.d/50-wikimedia-base.conf]: Filebucketed /etc/sysctl.d/50-wikimedia-base.conf to puppet with sum 5eaabd02e5a6e6318acd5f3e776582f9 [16:44:26] Just letting you know [16:44:48] since you just touched that. [16:46:02] that's normal [16:55:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:00:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 6.845 second response time [17:11:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:15:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 6.211 second response time [17:15:39] (03PS1) 10Faidon Liambotis: authdns: introduce an authdns::lint class [operations/puppet] - 10https://gerrit.wikimedia.org/r/82264 [17:16:36] (03PS2) 10Faidon Liambotis: authdns: introduce an authdns::lint class [operations/puppet] - 10https://gerrit.wikimedia.org/r/82264 [17:17:13] (03CR) 10Faidon Liambotis: "Antoine, could you please have a look at this, possibly hooking it up in the right CI role classes & configs?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/82264 (owner: 10Faidon Liambotis) [17:20:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:28:59] PROBLEM - Puppet freshness on mw1126 is CRITICAL: No successful Puppet run in the last 10 hours [17:29:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.133 second response time [17:52:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:53:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [18:13:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:14:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [18:22:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:23:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [18:40:15] PROBLEM - Puppet freshness on analytics1026 is CRITICAL: No successful Puppet run in the last 10 hours [18:40:15] PROBLEM - Puppet freshness on analytics1011 is CRITICAL: No successful Puppet run in the last 10 hours [18:40:15] PROBLEM - Puppet freshness on analytics1027 is CRITICAL: No successful Puppet run in the last 10 hours [18:52:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:53:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.132 second response time [19:09:15] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:10:05] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [19:15:18] ah I am happ [19:15:19] y [19:15:45] I found out the issue I had earlier today is finally related to the git version shipped in Precise :] [19:16:32] what was it? [19:19:47] I have no clue :-D [19:20:03] ori-l: some how git-http-backend was hanging out due to some deadlock whenever using git clone --reference [19:20:09] https://bugzilla.wikimedia.org/show_bug.cgi?id=53683 [19:21:09] so either I can attempt to bisect the issue [19:21:19] or I get the cluster upgraded to a more recent version of git :] [19:22:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:23:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.137 second response time [19:30:55] PROBLEM - Puppet freshness on virt0 is CRITICAL: No successful Puppet run in the last 10 hours [19:36:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:37:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [19:50:19] PROBLEM - Puppet freshness on sodium is CRITICAL: No successful Puppet run in the last 10 hours [19:52:30] PROBLEM - DPKG on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:52:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:54:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [20:22:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:23:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [20:26:08] PROBLEM - Puppet freshness on ssl1 is CRITICAL: No successful Puppet run in the last 10 hours [20:30:38] RECOVERY - check_job_queue on hume is OK: JOBQUEUE OK - all job queues below 10,000 [20:32:08] PROBLEM - Puppet freshness on ssl1006 is CRITICAL: No successful Puppet run in the last 10 hours [20:33:48] PROBLEM - check_job_queue on hume is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:34:08] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:34:58] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [20:38:59] PROBLEM - Puppet freshness on ssl1008 is CRITICAL: No successful Puppet run in the last 10 hours [20:41:59] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [20:50:59] PROBLEM - Puppet freshness on ssl1001 is CRITICAL: No successful Puppet run in the last 10 hours [20:51:59] PROBLEM - Puppet freshness on amssq47 is CRITICAL: No successful Puppet run in the last 10 hours [20:52:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:53:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.141 second response time [20:54:59] PROBLEM - Puppet freshness on ssl1003 is CRITICAL: No successful Puppet run in the last 10 hours [20:54:59] PROBLEM - Puppet freshness on ssl1005 is CRITICAL: No successful Puppet run in the last 10 hours [20:54:59] PROBLEM - Puppet freshness on ssl4 is CRITICAL: No successful Puppet run in the last 10 hours [20:57:59] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [20:57:59] PROBLEM - Puppet freshness on ssl1007 is CRITICAL: No successful Puppet run in the last 10 hours [21:00:59] PROBLEM - Puppet freshness on ssl1002 is CRITICAL: No successful Puppet run in the last 10 hours [21:00:59] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: No successful Puppet run in the last 10 hours [21:03:59] PROBLEM - Puppet freshness on ssl1004 is CRITICAL: No successful Puppet run in the last 10 hours [21:06:03] PROBLEM - Puppet freshness on ssl3003 is CRITICAL: No successful Puppet run in the last 10 hours [21:07:03] PROBLEM - Puppet freshness on ssl1009 is CRITICAL: No successful Puppet run in the last 10 hours [21:08:03] PROBLEM - Puppet freshness on ssl3 is CRITICAL: No successful Puppet run in the last 10 hours [21:08:03] PROBLEM - Puppet freshness on ssl3002 is CRITICAL: No successful Puppet run in the last 10 hours [21:12:03] PROBLEM - Puppet freshness on ssl2 is CRITICAL: No successful Puppet run in the last 10 hours [21:22:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:23:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 5.001 second response time [21:27:13] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:28:03] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [21:47:25] PROBLEM - DPKG on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:52:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:53:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [21:53:45] RECOVERY - check_job_queue on fenari is OK: JOBQUEUE OK - all job queues below 10,000 [21:56:55] PROBLEM - check_job_queue on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:57:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:58:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [22:04:15] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:06:15] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [22:08:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:10:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [22:12:03] PROBLEM - Puppet freshness on pdf1 is CRITICAL: No successful Puppet run in the last 10 hours [22:22:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:23:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [22:40:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:41:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [22:52:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:53:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.131 second response time [23:22:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:23:28] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 3.486 second response time [23:48:51] PROBLEM - Puppet freshness on analytics1003 is CRITICAL: No successful Puppet run in the last 10 hours [23:48:51] PROBLEM - Puppet freshness on fenari is CRITICAL: No successful Puppet run in the last 10 hours [23:52:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:52:41] PROBLEM - DPKG on ms-be1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:53:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.135 second response time