[01:09:06] (03CR) 10Aude: "wording suggestion" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/97190 (owner: 10Faidon Liambotis) [01:40:11] PROBLEM - Varnish traffic logger on cp3012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:40:21] PROBLEM - Varnish HTTP mobile-backend on cp3012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:40:21] PROBLEM - Varnish HTCP daemon on cp3012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:41:11] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [02:00:53] aude: hrmmm, or "on". but from works too [02:11:21] !log LocalisationUpdate completed (1.23wmf4) at Sun Nov 24 02:11:21 UTC 2013 [02:11:37] Logged the message, Master [02:20:50] !log LocalisationUpdate completed (1.23wmf5) at Sun Nov 24 02:20:50 UTC 2013 [02:20:56] PROBLEM - Puppet freshness on rhodium is CRITICAL: No successful Puppet run for 1d 0h 2m 54s [02:21:06] Logged the message, Master [02:50:45] !log LocalisationUpdate ResourceLoader cache refresh completed at Sun Nov 24 02:50:44 UTC 2013 [02:51:00] Logged the message, Master [03:39:17] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [03:43:56] PROBLEM - Disk space on praseodymium is CRITICAL: DISK CRITICAL - free space: /mnt/journal 4852 MB (3% inode=99%): [04:00:55] (03CR) 10Jeremyb: "A few ways to make the generated conf more readable, please:" (032 comments) [operations/apache-config] - 10https://gerrit.wikimedia.org/r/96438 (owner: 10Tim Starling) [04:10:59] PROBLEM - Puppet freshness on cp3012 is CRITICAL: No successful Puppet run for 1d 18h 52m 35s [05:21:18] PROBLEM - Puppet freshness on rhodium is CRITICAL: No successful Puppet run for 1d 3h 3m 16s [05:24:58] does anyone use the LXC plugin in vagrant ? [05:25:15] never tried it, no [05:25:21] ori-l: hey Ori [05:25:29] LXC is much faster than vbox [05:25:49] but I'm not sure if vagrant has abstracted all the stuff from it yet [05:26:00] it has different vagrant properties than the ones you would expect [05:26:35] you can install it and give it a spin with vagrant plugin install vagrant-lxc [05:26:39] the source is here https://github.com/fgrehm/vagrant-lxc [05:26:57] ori-l: do you own an SSD ? [05:27:18] yep [05:27:31] looks like I'm the only one who doesn't have one these days [05:27:51] * average is looking to speed up his dev environment [05:31:52] !log cp3012 wio through the roof; process table filled with '/bin/ps axwwo stat uid pid ppid vsz rss pcpu etime comm args', uid icinga [05:32:09] Logged the message, Master [05:37:15] 'ps aux' hangs [05:43:13] !log cp3012: any attempt to access /proc/17722 immediately hangs ssh [05:43:25] Logged the message, Master [05:46:48] !log cp3012: kern.log, every 2s: XFS: possible memory allocation deadlock in kmem_alloc [05:47:03] Logged the message, Master [05:59:05] PROBLEM - MySQL Slave Delay on db49 is CRITICAL: CRIT replication delay 306 seconds [05:59:05] PROBLEM - MySQL Replication Heartbeat on db49 is CRITICAL: CRIT replication delay 302 seconds [06:01:05] PROBLEM - MySQL Slave Delay on db49 is CRITICAL: CRIT replication delay 337 seconds [06:02:11] !log (cp3012) CPU wait, not wio. [06:02:27] Logged the message, Master [06:06:43] PROBLEM - Puppet freshness on db72 is CRITICAL: No successful Puppet run for 1d 20h 48m 22s [06:08:43] PROBLEM - Puppet freshness on db72 is CRITICAL: No successful Puppet run for 0d 0h 2m 0s [06:10:43] PROBLEM - Puppet freshness on db72 is CRITICAL: No successful Puppet run for 0d 0h 4m 0s [06:12:43] PROBLEM - Puppet freshness on db72 is CRITICAL: No successful Puppet run for 0d 0h 6m 0s [06:12:53] wtf? [06:14:43] PROBLEM - Puppet freshness on db72 is CRITICAL: No successful Puppet run for 0d 0h 8m 0s [06:16:43] PROBLEM - Puppet freshness on db72 is CRITICAL: No successful Puppet run for 0d 0h 10m 0s [06:18:43] PROBLEM - Puppet freshness on db72 is CRITICAL: No successful Puppet run for 0d 0h 12m 0s [06:20:43] PROBLEM - Puppet freshness on db72 is CRITICAL: No successful Puppet run for 0d 0h 14m 0s [06:22:43] PROBLEM - Puppet freshness on db72 is CRITICAL: No successful Puppet run for 0d 0h 16m 0s [06:24:43] PROBLEM - Puppet freshness on db72 is CRITICAL: No successful Puppet run for 0d 0h 18m 0s [06:25:14] hey [06:25:24] hey [06:25:33] i emailed details to ops@ [06:25:39] yes, I just saw the mail [06:25:51] cool, wasn't sure if it was the !logs or the email you saw [06:26:19] also 27 mins ago: PROBLEM - MySQL Slave Delay on db49 is CRITICAL: CRIT replication delay 306 seconds [06:26:30] it's tampa, don't worry about it [06:26:43] PROBLEM - Puppet freshness on db72 is CRITICAL: No successful Puppet run for 0d 0h 20m 0s [06:26:43] ah [06:27:01] db72 is a Wikimedia Shard s4 Core Database server (dbcore). [06:27:01] The last Puppet run was at Sun Nov 24 05:59:02 UTC 2013 (25 minutes ago). [06:27:17] successful, too. [06:27:34] !log force-rebooting cp3012, kernel deadlocked [06:27:49] Logged the message, Master [06:27:54] we've seen this issue a few times :( [06:28:08] I got kernel backtraces, just in case [06:28:18] but we can't debug further, that's one of *two* mobile caches [06:28:28] need to get it back up asap [06:28:43] PROBLEM - Puppet freshness on db72 is CRITICAL: No successful Puppet run for 0d 0h 22m 0s [06:29:13] RECOVERY - Puppet freshness on db72 is OK: puppet ran at Sun Nov 24 06:29:05 UTC 2013 [06:29:13] PROBLEM - Host cp3012 is DOWN: PING CRITICAL - Packet loss = 100% [06:30:33] RECOVERY - Puppet freshness on cp3012 is OK: puppet ran at Sun Nov 24 06:30:31 UTC 2013 [06:30:43] PROBLEM - Puppet freshness on db72 is CRITICAL: No successful Puppet run for 0d 0h 1m 28s [06:30:43] RECOVERY - Host cp3012 is UP: PING OK - Packet loss = 0%, RTA = 95.33 ms [06:30:43] RECOVERY - Varnish traffic logger on cp3012 is OK: PROCS OK: 2 processes with command name varnishncsa [06:31:33] RECOVERY - Varnish HTCP daemon on cp3012 is OK: PROCS OK: 1 process with UID = 110 (vhtcpd), args vhtcpd [06:32:43] PROBLEM - Puppet freshness on db72 is CRITICAL: No successful Puppet run for 0d 0h 2m 0s [06:34:43] PROBLEM - Puppet freshness on db72 is CRITICAL: No successful Puppet run for 0d 0h 4m 0s [06:36:21] ori-l: replied to your email [06:36:25] ori-l: thanks again :) [06:36:42] thanks for fixing it [06:39:05] we have to figure out our error codes and clean them up [06:39:17] there's a 500 spike at about that time [06:39:19] PROBLEM - HTTP on virt0 is CRITICAL: Connection refused [06:39:22] but not a 5xx [06:39:35] wtf? [06:39:59] wikitech down [06:40:24] that'll teach you to stay off irc [06:40:38] j/k, obviously [06:41:01] wtf [06:41:31] tampa link? [06:41:41] no, the machine is fine, I'm logged in [06:44:19] RECOVERY - HTTP on virt0 is OK: HTTP OK: HTTP/1.1 302 Found - 457 bytes in 0.071 second response time [06:44:44] !log wikitech broken with commit 716887 (Cleanup puppetmaster's apache Listen ports); workarounded manually with /etc/apache2/conf.d/ports-wikitech.conf [06:45:00] Logged the message, Master [06:47:32] ori-l: why did you show me db72's motd? [06:47:53] because there are puppet freshness alerts with insane timestamps [06:48:13]

PROBLEM - Puppet freshness on db72 is CRITICAL: No successful Puppet run for 1d 20h 48m 22s [06:48:16] immediately after: [06:48:23]

PROBLEM - Puppet freshness on db72 is CRITICAL: No successful Puppet run for 0d 0h 2m 0s [06:49:00] did you just pasted a puppet freshness check or something? [06:49:25] 08:48 Ignore List: [06:49:25] 08:48 1 #wikimedia-operations: ALL -regexp -pattern .*Puppet freshness.* [06:49:28] :P [06:50:05] heh [06:50:15] sorry :) [06:52:29] * average brewed some tea [06:52:37] hey paravoid [06:53:25] * average finally got around making a small hadoop cluster @home so he can play with it & vagrant & puppet [06:59:09] RECOVERY - Puppet freshness on db72 is OK: puppet ran at Sun Nov 24 06:59:08 UTC 2013 [07:06:27] RECOVERY - Disk space on praseodymium is OK: DISK OK [07:13:15] hi! wikimedia-labs login node is unusable, could you please kill off some procs from it please -- i think a user forgot to submit them to worker nodes properly [07:14:29] which node? [07:14:46] login node, tools-login.wmflabs.org [07:21:26] (fixed, see #wikimedia-labs) [07:21:30] okay, signing off now [07:24:53] hey can I ask a question ? if I have a multi-vm Vagrantfile, how can I make a manifest for a speific VM ? right now I have a manifest running on all the VMs configured in the Vagrantfile [07:28:07] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , enwiki (272633), Total (333443) [07:28:36] (03PS1) 10Ori.livneh: webperf/asset-check: fix var reference [operations/puppet] - 10https://gerrit.wikimedia.org/r/97362 [07:31:59] average: You need to use node definitions. See http://docs.puppetlabs.com/puppet/2.7/reference/lang_node_definitions.html [07:32:55] bd808: reading [07:33:22] Set a hostname for each vm using config.vm.hostname and then make a matching node definition to provision each one [07:38:51] bd808: sounds like what I need, thanks ! [08:22:00] PROBLEM - Puppet freshness on rhodium is CRITICAL: No successful Puppet run for 1d 6h 3m 58s [08:55:55] paravoid, jeremyb or anyone, is a bugzilla or RT ticket filed for the !log relay to @wikimediatech being broken? [08:58:51] (03CR) 10Ori.livneh: [C: 032] webperf/asset-check: fix var reference [operations/puppet] - 10https://gerrit.wikimedia.org/r/97362 (owner: 10Ori.livneh) [09:01:31] (03CR) 10TTO: [C: 04-1] "per Dereckson, need to wait" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97331 (owner: 10Dereckson) [09:21:20] !log global job queue length graph is still dead https://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Miscellaneous%20pmtpa&h=hume.wikimedia.org&v=823574&m=Global_JobQueue_length&r=hour&z=default&jr=&js=&st=1365625056&z=large [09:21:36] Logged the message, Master [09:52:34] Nemo_bis: i don't think that's !log material [09:52:47] what then [09:53:03] well, i just fixed it [09:53:06] I can't know if there's a ticket open already [09:53:12] rendering the question moot [09:53:13] :) [09:53:27] better so then [09:53:35] but RT or BZ probably [09:54:22] simple things are usually either ignored for years or fixed immediately out of chance :) [09:55:51] !log hume: ln -s /usr/local/apache/common-local /a/common (previously clobbered by Iaf19f0e8c) [09:56:08] Logged the message, Master [09:56:48] (03PS1) 10Odder: Add an alias to NS_PROJECT on Old English Wiktionary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97376 [09:57:06] RECOVERY - MySQL Slave Delay on db49 is OK: OK replication delay 0 seconds [09:57:25] RECOVERY - MySQL Replication Heartbeat on db49 is OK: OK replication delay 0 seconds [09:58:28] !log (...thereby fixing the global job queue length graph) [09:58:44] Logged the message, Master [09:58:48] https://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Miscellaneous%20pmtpa&h=hume.wikimedia.org&v=823574&m=Global_JobQueue_length&r=hour&z=default&jr=&js=&st=1365625056&z=large [09:59:50] (03CR) 10Odder: "Can someone merge this today, please? The event in question will start tomorrow (November 25) at 09:00 UTC." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96718 (owner: 10Odder) [10:02:05] (03PS2) 10Ori.livneh: (bug 57345) Raise account creation limit for cawiki GLAM event [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96718 (owner: 10Odder) [10:03:05] (03CR) 10Ori.livneh: [C: 032] "(Fixed inline tabs in PS2; tabs should only be used for leading whitespace. Confusing, but that's the convention.)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96718 (owner: 10Odder) [10:03:45] !log ori updated /a/common to {{Gerrit|Icf844849e}}: (bug 57345) Raise account creation limit for cawiki GLAM event [10:03:55] thanks ori-l [10:04:00] Logged the message, Master [10:04:51] !log ori synchronized wmf-config/throttle.php 'Icf844849e: (bug 57345) Raise account creation limit for cawiki GLAM event' [10:05:06] Logged the message, Master [10:05:12] twkozlowski: np [10:06:00] Nemo_bis: getJobQueueLengths.php is kind of expensive, and we currently run it twice, once for hume, the other for the spoofed www.wikimedia.org [10:06:18] can we remove one? [10:08:43] ori-l: does that happen in the same function? [10:09:36] what function? [10:09:51] bd808|BUFFER: I used nodes in puppet [10:10:07] bd808|BUFFER: so I put specifics in classes and then I include classes in nodes right ? [10:12:15] I get some problems related to that [10:12:39] in particular now I have a node that includes a class, and when I provision the class's actions(is that how you call them?) don't run [10:16:17] ori-l: * class misc::monitoring::jobqueue ; and yes it does, I didn't remember: I have no idea if it can be run just once, I just copied the approach from nagios::ganglia::monitor::enwiki [10:17:32] well, you could stash the value in a shell var and then call gmetric twice [10:17:42] but does it really need to be logged under two hostnames? [10:18:31] it should be in gdash, probably [10:18:35] if it isn't already [10:18:43] since it can be computed more cheaply via graphite [10:18:57] as the difference of queue insert and queue pop event counts [10:19:14] sigh. can you tell i'm procrastinating? :/ [11:22:46] PROBLEM - Puppet freshness on rhodium is CRITICAL: No successful Puppet run for 1d 9h 4m 44s [12:03:34] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [12:12:27] finally got this puppet thing to work [14:23:32] PROBLEM - Puppet freshness on rhodium is CRITICAL: No successful Puppet run for 1d 12h 5m 30s [17:24:01] PROBLEM - Puppet freshness on rhodium is CRITICAL: No successful Puppet run for 1d 15h 5m 59s [20:24:41] PROBLEM - Puppet freshness on rhodium is CRITICAL: No successful Puppet run for 1d 18h 6m 39s [20:33:40] (03CR) 10QChris: "> We could of course fix it in place, but the fact that it's so unreadable" [operations/puppet] - 10https://gerrit.wikimedia.org/r/97021 (owner: 10Faidon Liambotis) [23:24:57] PROBLEM - Puppet freshness on rhodium is CRITICAL: No successful Puppet run for 1d 21h 6m 55s [23:48:26] (03CR) 10Tim Starling: "> I'm guessing you made some scripts of your own to do the conversion?" (035 comments) [operations/apache-config] - 10https://gerrit.wikimedia.org/r/96438 (owner: 10Tim Starling)