[07:49:15] PROBLEM - Puppet freshness on mw1 is CRITICAL: Puppet has not run in the last 10 hours [07:49:15] PROBLEM - Puppet freshness on mw12 is CRITICAL: Puppet has not run in the last 10 hours [07:49:15] PROBLEM - Puppet freshness on mw11 is CRITICAL: Puppet has not run in the last 10 hours [07:49:15] PROBLEM - Puppet freshness on mw10 is CRITICAL: Puppet has not run in the last 10 hours [07:49:15] PROBLEM - Puppet freshness on mw13 is CRITICAL: Puppet has not run in the last 10 hours [07:49:16] PROBLEM - Puppet freshness on mw14 is CRITICAL: Puppet has not run in the last 10 hours [07:49:16] PROBLEM - Puppet freshness on mw16 is CRITICAL: Puppet has not run in the last 10 hours [07:49:17] PROBLEM - Puppet freshness on mw2 is CRITICAL: Puppet has not run in the last 10 hours [07:49:17] PROBLEM - Puppet freshness on mw3 is CRITICAL: Puppet has not run in the last 10 hours [07:49:18] PROBLEM - Puppet freshness on mw4 is CRITICAL: Puppet has not run in the last 10 hours [07:49:18] PROBLEM - Puppet freshness on mw5 is CRITICAL: Puppet has not run in the last 10 hours [07:49:19] PROBLEM - Puppet freshness on mw6 is CRITICAL: Puppet has not run in the last 10 hours [07:49:19] PROBLEM - Puppet freshness on mw15 is CRITICAL: Puppet has not run in the last 10 hours [07:49:20] PROBLEM - Puppet freshness on mw7 is CRITICAL: Puppet has not run in the last 10 hours [07:49:20] PROBLEM - Puppet freshness on mw9 is CRITICAL: Puppet has not run in the last 10 hours [07:49:21] PROBLEM - Puppet freshness on mw8 is CRITICAL: Puppet has not run in the last 10 hours [07:52:07] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [07:53:09] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.54 ms [07:55:47] pooor mw boxes [07:57:12] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [08:07:47] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [08:08:59] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [08:16:56] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.023 second response time [08:42:28] New patchset: Hashar; "zuul role for labs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24934 [08:43:20] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24934 [08:45:53] PROBLEM - Puppet freshness on aluminium is CRITICAL: Puppet has not run in the last 10 hours [08:45:59] New patchset: Hashar; "zuul role for labs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24934 [08:46:48] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24934 [09:06:53] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [09:08:23] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [09:12:26] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [09:36:53] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.025 second response time [10:34:20] New patchset: Lydia Pintscher; "update favicon to match new Wikidata logo" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24987 [11:18:34] New patchset: Mark Bergsma; "Add asw2-d3-sdtpa to Torrus" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24991 [11:19:29] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24991 [11:19:40] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24991 [11:26:16] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [11:26:16] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [11:26:16] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [11:26:16] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [11:26:16] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [11:26:17] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [11:31:57] New patchset: Mark Bergsma; "Make cadmium standard" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24992 [11:32:45] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24992 [11:32:57] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24992 [11:50:21] New patchset: Lcarr; "adding new switches and firewalls to torrus" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24993 [11:51:13] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24993 [11:51:21] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24993 [12:10:20] PROBLEM - Host cadmium is DOWN: PING CRITICAL - Packet loss = 100% [12:13:58] !log Powercycled carbon, has been down for 4d [12:14:09] Logged the message, Master [12:14:59] RECOVERY - Host carbon is UP: PING OK - Packet loss = 0%, RTA = 26.64 ms [12:27:44] RECOVERY - Host cadmium is UP: PING OK - Packet loss = 0%, RTA = 26.48 ms [12:34:29] PROBLEM - Host cadmium is DOWN: PING CRITICAL - Packet loss = 100% [12:44:21] New patchset: Lcarr; "adding new routers to rancid db" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24994 [12:45:16] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24994 [12:45:16] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24994 [12:46:02] RECOVERY - Host cadmium is UP: PING OK - Packet loss = 0%, RTA = 26.56 ms [12:50:32] PROBLEM - SSH on cadmium is CRITICAL: Connection refused [13:02:10] RECOVERY - SSH on cadmium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [13:07:07] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [13:08:28] PROBLEM - NTP on cadmium is CRITICAL: NTP CRITICAL: No response from NTP server [13:08:46] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [13:12:49] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [13:17:37] RECOVERY - NTP on cadmium is OK: NTP OK: Offset -0.00147664547 secs [13:26:28] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [13:29:28] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.038 second response time [13:30:44] New patchset: Demon; "Fix two unknown attributes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24998 [13:31:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24998 [13:33:00] New patchset: Mark Bergsma; "Replace explicit paths by variables" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24999 [13:33:47] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24999 [13:35:09] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24999 [13:39:13] hmm [14:07:21] New patchset: Mark Bergsma; "Move deployment scripts to its own dir, serve recursively" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25001 [14:08:06] New patchset: Mark Bergsma; "Remove password scripts, generate them on the fly" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25002 [14:08:59] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/25001 [14:08:59] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/25002 [14:09:41] New patchset: Ottomata; "Adding Howie Fung and Stefan Petrea on stat1." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25003 [14:10:32] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25003 [14:14:29] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [14:17:21] morning all! [14:17:38] i have a couple of easy reviews waiting, anyone avail to help me out? [14:17:50] https://gerrit.wikimedia.org/r/#/c/24815/ [14:17:50] https://gerrit.wikimedia.org/r/#/c/25003/ [14:23:15] New patchset: Mark Bergsma; "Move deployment scripts to its own dir, serve recursively" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25001 [14:24:08] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25001 [14:26:00] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25001 [14:28:50] New patchset: Mark Bergsma; "Fix dependencies" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25007 [14:29:48] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25007 [14:29:58] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25007 [14:34:43] New patchset: Mark Bergsma; "Revert "Fix dependencies"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25010 [14:35:44] New patchset: Mark Bergsma; "Revert "Move deployment scripts to its own dir, serve recursively"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25011 [14:35:56] !log ms-be8 shutting down to check jumpers [14:36:06] Logged the message, Master [14:36:38] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25010 [14:36:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25011 [14:38:37] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25011 [14:39:39] !log ms-be7 shutting down to check jumpers [14:39:49] Logged the message, Master [14:40:18] lesliecarr: can you check to see if the port (0/14) on asw2-d3-sdtpa is up please [14:40:53] k [14:41:01] good morning cmjohnson1 [14:41:04] how did yesterday go ? [14:41:22] with dell? [14:41:27] yep [14:41:35] 0/14 appears up - 0/15 (mc16) is down though [14:41:36] they realized that we're right and it's h/w related [14:41:42] heh [14:41:43] of course [14:42:05] PROBLEM - Host ms-be7 is DOWN: PING CRITICAL - Packet loss = 100% [14:42:08] odd...i may have the fibers reversed...can you enable [14:46:44] RECOVERY - swift-object-auditor on ms-be7 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [14:46:44] RECOVERY - swift-account-server on ms-be7 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [14:46:53] RECOVERY - swift-account-reaper on ms-be7 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [14:46:53] RECOVERY - Host ms-be7 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [14:47:02] RECOVERY - swift-container-replicator on ms-be7 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [14:47:02] RECOVERY - swift-object-replicator on ms-be7 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [14:47:11] RECOVERY - swift-container-server on ms-be7 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [14:47:20] RECOVERY - swift-account-replicator on ms-be7 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [14:47:38] RECOVERY - swift-object-server on ms-be7 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [14:47:47] RECOVERY - swift-object-updater on ms-be7 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [14:47:47] RECOVERY - swift-container-auditor on ms-be7 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [14:47:47] RECOVERY - swift-account-auditor on ms-be7 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [14:48:14] RECOVERY - swift-container-updater on ms-be7 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [14:54:22] cmjohnson1: 14 is stil up and 15 is still down [14:54:26] both are enabled [14:55:05] ok...thx [14:58:58] heya mark, puppet Q for you, if you are around [14:59:24] peter g wants to turn on 1:1 logging for the banner impression logging for 24 hours [14:59:49] I need to edit the udp2log config file and restart udp2log [14:59:54] since this is a temporary thing [14:59:57] would it be better to: [15:00:10] just stop puppet, and edit the file manually [15:00:12] or [15:00:32] commit to puppet, get it reviewed, and then commit again tomorrow to disable it and get it reviewed again tomorrow? [15:00:47] i'd rather you did the latter [15:01:07] okeydokey, i'm cool with that, can you help me push it through? [15:01:17] yep [15:01:21] just for you :) [15:01:26] you are a dear [15:04:16] flattery will get you everywhere [15:04:45] what's the timeline ? [15:04:47] New patchset: Ottomata; "Turning on 1:1 banner impression logging for 24 hours." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25019 [15:04:53] right now! [15:04:54] :) [15:04:58] hehe, yup! [15:05:19] was that you secretly wanting me to approve the dependancy ? [15:05:32] AH [15:05:33] dohp [15:05:33] I'm checking on this bug: https://bugzilla.wikimedia.org/show_bug.cgi?id=40428 "many API requests failing" http requests timing out and "siincludeAllDenied:Cannot view all servers info unless $wgShowHostnames is true" [15:05:37] i was wondering if it would do that [15:05:39] grrrrr [15:05:40] ok good to know [15:05:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25019 [15:05:42] i'm looking anyways [15:05:50] not to create a new branch while having my old topic branch currently checked out [15:05:51] thanks! [15:05:57] i do need it approved too [15:06:11] it's an unclear bug report that mushes 3 separate problems together :/ but I think Ops people are way more qualified to look at it than I am [15:07:29] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25003 [15:08:06] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25019 [15:08:13] ottomata: prepared for merging ? [15:08:16] yup [15:08:17] can do [15:08:21] um, while we're at it: [15:08:21] https://gerrit.wikimedia.org/r/#/c/24815/ [15:08:23] :D [15:08:23] ? [15:09:08] Damianz: can you take a look at https://bugzilla.wikimedia.org/show_bug.cgi?id=40428 and ask questions that get more useful info out of the bug reporter? I'm not sure what questions to ask [15:09:14] oh yay, yes because you specified a fully qualified variable [15:09:25] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24815 [15:09:29] hehe, yeah i realized I had hard coded the username there, shoudla just used the var [15:09:33] yay, t hank you so much! [15:09:50] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , frwiki (60439) [15:09:54] I could after I find coffee, there's like more than 2 paragraphs of text there and I just got back from the swimming pool...err car park. [15:09:59] ottomata: merged on sockpuppet [15:10:10] thanks Damianz [15:10:11] oh, thanks! [15:10:37] hm [15:10:42] i don't need it since you merged [15:10:48] sockpuppet is rejecting my ssh public key [15:11:08] i can log into other machines though, hmmmmmm, weird [15:11:11] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , frwiki (60446) [15:11:12] oh well, don't need it right now [15:12:08] also, why do you hate open source so much ? [15:12:11] anti-communist [15:15:37] ssmollett: He's probably going to think I'm an ass now :P [15:19:23] hahaha [15:20:37] (Leslie do not tell anyone, but I am a secret spy working for Chris Dodd) [15:20:45] !! [15:20:50] i knew it [15:21:18] (He told me if I put things in parens only you could read it, which is why I feel safe telling you) [15:21:28] (it's true) [15:21:40] i'm out of here for now [15:21:50] someone someday is going to google search for chris dodd and get these logs and be very confused [15:22:04] hahaha [15:22:11] ok, thanks for your help Lesles [15:22:15] :) [15:37:26] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [16:19:08] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [16:20:20] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [16:21:16] \o/ [16:21:35] seems worth a !log even [17:00:05] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [17:00:05] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [17:25:57] notpeter: ping [17:26:39] paravoid: ping [17:26:50] paravoid: can you merge and push https://gerrit.wikimedia.org/r/25039 [17:26:58] preilly: sup? [17:27:15] notpeter: can you merge and push https://gerrit.wikimedia.org/r/25039 [17:27:16] !log adding otto to ops group in LDAP [17:27:26] Logged the message, Master [17:28:23] kk [17:28:30] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25039 [17:29:02] pushing out now [17:38:57] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , zhwiki (50246) [17:40:06] New patchset: Ottomata; "Test commit." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25040 [17:40:27] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , zhwiki (50713) [17:41:12] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25040 [17:49:54] PROBLEM - Puppet freshness on mw10 is CRITICAL: Puppet has not run in the last 10 hours [17:49:54] PROBLEM - Puppet freshness on mw11 is CRITICAL: Puppet has not run in the last 10 hours [17:49:54] PROBLEM - Puppet freshness on mw12 is CRITICAL: Puppet has not run in the last 10 hours [17:49:54] PROBLEM - Puppet freshness on mw13 is CRITICAL: Puppet has not run in the last 10 hours [17:49:54] PROBLEM - Puppet freshness on mw14 is CRITICAL: Puppet has not run in the last 10 hours [17:49:55] PROBLEM - Puppet freshness on mw15 is CRITICAL: Puppet has not run in the last 10 hours [17:49:55] PROBLEM - Puppet freshness on mw16 is CRITICAL: Puppet has not run in the last 10 hours [17:49:56] PROBLEM - Puppet freshness on mw1 is CRITICAL: Puppet has not run in the last 10 hours [17:49:56] PROBLEM - Puppet freshness on mw3 is CRITICAL: Puppet has not run in the last 10 hours [17:49:57] PROBLEM - Puppet freshness on mw6 is CRITICAL: Puppet has not run in the last 10 hours [17:49:57] PROBLEM - Puppet freshness on mw4 is CRITICAL: Puppet has not run in the last 10 hours [17:49:58] PROBLEM - Puppet freshness on mw8 is CRITICAL: Puppet has not run in the last 10 hours [17:49:58] PROBLEM - Puppet freshness on mw5 is CRITICAL: Puppet has not run in the last 10 hours [17:49:59] PROBLEM - Puppet freshness on mw7 is CRITICAL: Puppet has not run in the last 10 hours [17:49:59] PROBLEM - Puppet freshness on mw2 is CRITICAL: Puppet has not run in the last 10 hours [17:50:00] PROBLEM - Puppet freshness on mw9 is CRITICAL: Puppet has not run in the last 10 hours [17:56:57] !log temp stopping puppet on brewster [17:57:08] Logged the message, notpeter [17:57:36] !log restarted udp2log on locke to clean up ~20 defunct procs [17:57:46] Logged the message, Master [18:00:16] PROBLEM - Varnish traffic logger on cp1022 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [18:00:24] PROBLEM - Varnish traffic logger on cp1042 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [18:00:24] PROBLEM - Varnish traffic logger on cp1028 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [18:00:33] PROBLEM - Varnish traffic logger on cp1024 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [18:00:33] PROBLEM - Varnish traffic logger on cp1044 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [18:01:09] PROBLEM - Varnish traffic logger on cp1026 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [18:01:18] PROBLEM - Varnish traffic logger on cp1043 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [18:01:18] PROBLEM - Varnish traffic logger on cp1021 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [18:01:18] PROBLEM - Varnish traffic logger on cp1027 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [18:01:27] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [18:01:36] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [18:03:33] RECOVERY - Varnish traffic logger on cp1042 is OK: PROCS OK: 3 processes with command name varnishncsa [18:03:33] RECOVERY - Varnish traffic logger on cp1028 is OK: PROCS OK: 3 processes with command name varnishncsa [18:05:49] RECOVERY - Varnish traffic logger on cp1026 is OK: PROCS OK: 3 processes with command name varnishncsa [18:11:00] New patchset: Cmcmahon; "should enable AFTv5 100% on beta only" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/25048 [18:43:07] notpeter: mc1 and mc15 are good to go [18:43:15] cmjohnson1: awesome!!! [18:47:06] PROBLEM - Puppet freshness on aluminium is CRITICAL: Puppet has not run in the last 10 hours [18:49:16] where can i see the contents of $mobile_langlist [18:49:27] variable used in DNS templates that is [18:53:03] docs say "/home/wikipedia/conf/langlist" which does not exist either, btw [18:59:10] instead ./commong/langlist , htdocs/wikimedia/langlist and others [19:03:58] New review: Jeremyb; "I don't know much about the ico format and can't say if this is entirely correct but at least GNU fi..." [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/24987 [19:04:23] New patchset: Pyoungmeister; "adding correct mac for mc1" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25059 [19:05:18] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25059 [19:06:05] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25059 [19:08:25] notpeter: adding correct mac for mc1 sounds like a verse from a gangster rap song [19:09:07] preilly: yeah, I actually did that checking from a tec-9 [19:09:13] you can run linux on anything these days [19:09:18] *checkin [19:09:37] preilly: do you know where i can find "mobile_langlist"? [19:10:12] mutante: not off the top of my head [19:10:18] mutante: are you looking at DNS? [19:10:40] preilly: yeah, i see $mobile_langlist being used in DNS templates, and i would like the actual list of languages though [19:11:19] mutante: isn't it something like LANGLIST=$POWERDNSDIR/langlist [19:11:57] i can't "find" it within pdns-templates ... [19:12:28] mutante: did you look in /etc/powerdns/langlist ? [19:12:43] ah, ok, /h/w/common/langlist on fenari , but mobile_langlist [19:13:13] mutante: they shouldn't be different [19:13:51] mutante: at least not that I know of [19:15:39] preilly: thanks, looking closer gen-bind.conf too [19:21:45] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [19:22:57] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [20:01:28] RECOVERY - Puppet freshness on aluminium is OK: puppet ran at Tue Sep 25 20:01:19 UTC 2012 [20:12:52] RECOVERY - Host ms-be8 is UP: PING OK - Packet loss = 0%, RTA = 1.66 ms [20:41:22] PROBLEM - Host virt5 is DOWN: PING CRITICAL - Packet loss = 100% [20:43:28] Change merged: Dzahn; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24987 [20:45:34] PROBLEM - Host ms-be8 is DOWN: PING CRITICAL - Packet loss = 100% [20:46:06] when pulling from mediawiki-config: [20:46:10] error: git checkout-index: unable to create file docroot/bits/favicon/black-globe.ico (Permission denied) [20:47:04] RECOVERY - Host virt5 is UP: PING OK - Packet loss = 0%, RTA = 1.25 ms [20:48:50] New patchset: Ryan Lane; "Set autoinstall settings for virt5-11" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25171 [20:49:42] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25171 [20:49:46] !log sync-common-file new favicon.ico for wikidata.org [20:49:51] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25171 [20:50:08] Logged the message, Master [20:51:15] New review: Dzahn; "pushed to cluster. done." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24987 [20:51:16] RECOVERY - Host ms-be8 is UP: PING OK - Packet loss = 0%, RTA = 1.32 ms [20:51:35] PROBLEM - SSH on virt5 is CRITICAL: Connection refused [20:54:43] PROBLEM - Host virt5 is DOWN: PING CRITICAL - Packet loss = 100% [20:55:46] PROBLEM - SSH on ms-be8 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:57:16] RECOVERY - SSH on ms-be8 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [20:58:08] New patchset: Hashar; "role::gerrit:labs::jenkins" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25173 [20:59:02] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25173 [21:00:25] RECOVERY - Host virt5 is UP: PING OK - Packet loss = 0%, RTA = 0.57 ms [21:02:33] !log stopping puppet on brewster [21:02:43] Logged the message, notpeter [21:16:37] RECOVERY - swift-object-updater on ms-be8 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [21:16:37] RECOVERY - swift-object-server on ms-be8 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [21:16:37] RECOVERY - swift-container-server on ms-be8 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [21:16:46] RECOVERY - swift-container-replicator on ms-be8 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [21:16:55] RECOVERY - swift-container-updater on ms-be8 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [21:18:18] New patchset: Hashar; "role::gerrit:labs::jenkins" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25173 [21:19:09] New review: Hashar; "rebased on I8c27b367" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/25173 [21:19:10] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25173 [21:19:37] RECOVERY - swift-account-reaper on ms-be8 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [21:20:04] RECOVERY - swift-account-server on ms-be8 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [21:20:06] New patchset: Hashar; "role::gerrit:labs::jenkins" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25173 [21:20:13] RECOVERY - swift-container-auditor on ms-be8 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [21:20:31] RECOVERY - swift-account-replicator on ms-be8 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [21:20:40] RECOVERY - swift-object-replicator on ms-be8 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [21:20:40] RECOVERY - swift-object-auditor on ms-be8 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [21:20:50] RECOVERY - swift-account-auditor on ms-be8 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [21:20:57] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25173 [21:22:46] PROBLEM - NTP on virt5 is CRITICAL: NTP CRITICAL: No response from NTP server [21:23:04] RECOVERY - NTP on ms-be8 is OK: NTP OK: Offset -0.005920410156 secs [21:26:58] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [21:26:58] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [21:26:58] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [21:26:58] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [21:26:58] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [21:26:59] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [21:30:35] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , frwiki (14740) [21:34:38] faidon: plz poke at ms-be8 [21:34:48] sorry paravoid ^^ [21:36:26] !log replaced iptables ruleset for loudon [21:36:36] Logged the message, Master [21:46:00] PROBLEM - Host cp1029 is DOWN: PING CRITICAL - Packet loss = 100% [21:46:45] RECOVERY - Host cp1029 is UP: PING OK - Packet loss = 0%, RTA = 26.52 ms [21:50:02] !log apt-update and reboot loudon [21:50:12] Logged the message, Master [21:50:21] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , frwiki (31678) [21:50:29] !log dropped apparently-unused IP alias on loudon, deprecated from old haproxy install [21:50:39] Logged the message, Master [21:57:51] New patchset: Hashar; "zuul role for labs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24934 [21:58:44] New patchset: Hashar; "import zuul module from OpenStack" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24878 [21:59:38] New patchset: Hashar; "zuul role for labs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24934 [22:00:35] New patchset: Hashar; "import zuul module from OpenStack" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24878 [22:01:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24934 [22:01:36] New patchset: Dzahn; "add missing ro_RO.UTF-8 locale to fix ro.planet runs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25186 [22:02:14] grmblblblllll [22:02:20] I am never going to get anything done [22:02:30] we have to classes providing 'apache' packages [22:02:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24878 [22:02:35] New review: Dzahn; "for bug 39573" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/25186 [22:02:35] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25186 [22:02:51] class web server::apache2 and web server::apache and that makes puppet unhappy when we use boths [22:06:27] hashar: ugh.. and i have still this on singer "Duplicate definition: Class[Webserver::Php5] is already defined in file /var/lib/git/operations/puppet/manifests/misc/planet.pp at line 10; cannot redefine at /var/lib/git/operations/puppet/manifests/misc/secure.pp:6" [22:08:16] New patchset: Hashar; "avoid duplicate apache2 definitions" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25188 [22:08:29] mutante: my solution is to use class generic::packages::apache2 see 25188 [22:09:12] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25188 [22:09:47] err: Could not retrieve catalog from remote server: Error 400 on SERVER: Duplicate definition: Apache_module[proxy] is already defined in file /etc/puppet/manifests/misc/contint.pp at line 238; cannot redefine at /etc/puppet/manifests/gerrit.pp:274 on node i-00000363.pmtpa.wmflabs [22:09:50] yeahh mooaaar [22:10:21] I don't even remember what I started to work on this morning :-] [22:10:22] hashar: the whole webserver stuff has been changed at one point ... [22:10:32] hashar: every day:) [22:11:18] I give up on that one [22:19:23] New patchset: Pyoungmeister; "adding macs for some of the eqiad mc servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25190 [22:20:14] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25190 [22:20:43] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25190 [22:28:07] !log generating locales on singer, add ro UTF-8 locale, run ro.planet [22:28:18] Logged the message, Master [22:44:15] !log stopping puppet on cp1044, experimenting with varnish lru / nuke params to solve allocation failures [22:44:25] Logged the message, Master [22:50:54] PROBLEM - Host ms-be8 is DOWN: PING CRITICAL - Packet loss = 100% [22:51:57] RECOVERY - Host ms-be8 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [22:52:49] that would be me [23:03:12] PROBLEM - Host virt5 is DOWN: PING CRITICAL - Packet loss = 100% [23:08:54] RECOVERY - Host virt5 is UP: PING OK - Packet loss = 0%, RTA = 0.79 ms [23:11:13] Ryan_Lane: I find this very effective for what you're doing: [23:11:14] dd if=/dev/zero of=/dev/sda bs=512 count=1 [23:13:06] PROBLEM - Puppet freshness on cadmium is CRITICAL: Puppet has not run in the last 10 hours [23:16:07] !log depooling cp1044 from lvs for testing [23:16:17] Logged the message, Master [23:16:33] PROBLEM - Host virt5 is DOWN: PING CRITICAL - Packet loss = 100% [23:20:20] !log repooled cp1044 [23:20:30] Logged the message, Master [23:21:37] New patchset: Asher; "nuke_limt=300 stopped "out of storage" 503's in prod" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25193 [23:22:15] RECOVERY - Host virt5 is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms [23:22:30] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25193 [23:25:49] !log ran "varnishadm param.set nuke_limit 300" on all mobile varnish front and back instances to match new default config [23:25:59] Logged the message, Master [23:27:12] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [23:37:41] New patchset: Dereckson; "(bug 36345) Restrict local upload for sysops on gu.wikisource" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/25196 [23:38:14] New patchset: Ryan Lane; "Fix virt numbering" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25197 [23:39:08] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25197 [23:41:38] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25197 [23:48:04] New patchset: Dzahn; "unicode author names in en planet config (bug 39724)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25199 [23:49:04] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25199