[00:04:13] Damianz: i just changed the link on wikitech. the one without /ng/ looks like the newer one, since it uses the host "professor" [00:04:30] as "profilehost [00:04:41] Cool [00:12:18] Are the files behind https://noc.wikimedia.org/conf/ in Git? [00:19:54] Probably in https://gerrit.wikimedia.org/r/gitweb?p=operations/mediawiki-config.git somewhere for anything not auto-generated. [00:33:55] RECOVERY - NTP on cp1021 is OK: NTP OK: Offset -0.04841589928 secs [00:41:53] Damianz, I mean the actual web page there, not the files it mentions [00:41:56] this: https://noc.wikimedia.org/conf/index.html [00:46:39] Oh [00:46:57] No idea then, I mis-understood your phrasing of the question, apologies [01:04:22] PROBLEM - Puppet freshness on ms-be8 is CRITICAL: Puppet has not run in the last 10 hours [01:27:35] PROBLEM - Puppet freshness on ms-be1 is CRITICAL: Puppet has not run in the last 10 hours [01:38:32] PROBLEM - Puppet freshness on ms-be2 is CRITICAL: Puppet has not run in the last 10 hours [01:41:32] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 234 seconds [01:42:44] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 261 seconds [01:48:26] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 604s [01:50:32] PROBLEM - Puppet freshness on srv193 is CRITICAL: Puppet has not run in the last 10 hours [01:54:35] PROBLEM - Puppet freshness on srv194 is CRITICAL: Puppet has not run in the last 10 hours [01:58:11] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 3 seconds [01:59:05] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 2s [01:59:32] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 8 seconds [02:18:35] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [02:18:35] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [02:18:35] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [02:18:35] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [02:24:35] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [02:40:17] RECOVERY - Varnish HTCP daemon on cp1021 is OK: PROCS OK: 1 process with UID = 997 (varnishhtcpd), args varnishhtcpd worker [02:53:38] PROBLEM - Puppet freshness on ms-fe2 is CRITICAL: Puppet has not run in the last 10 hours [03:03:41] PROBLEM - Puppet freshness on ms-be9 is CRITICAL: Puppet has not run in the last 10 hours [03:10:35] RECOVERY - Varnish traffic logger on cp1021 is OK: PROCS OK: 3 processes with command name varnishncsa [03:17:47] PROBLEM - MySQL Replication Heartbeat on db1020 is CRITICAL: CRIT replication delay 191 seconds [03:18:05] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 200 seconds [03:18:32] PROBLEM - MySQL Slave Delay on db1020 is CRITICAL: CRIT replication delay 226 seconds [03:18:41] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 229 seconds [03:29:37] Krenair: Don't think so. Not sure if they were ever in SVN either. [03:29:44] Probably worth a bug. [03:39:08] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [03:39:26] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [03:40:47] RECOVERY - MySQL Replication Heartbeat on db1020 is OK: OK replication delay 0 seconds [03:40:56] RECOVERY - MySQL Slave Delay on db1020 is OK: OK replication delay 0 seconds [03:51:08] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [03:51:08] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [04:43:33] PROBLEM - Puppet freshness on ms-be6 is CRITICAL: Puppet has not run in the last 10 hours [05:19:54] New review: Hashar; "Could probably get removed too. I think we will end up having to do some cleanup manually since I am..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/21675 [05:44:00] New patchset: Hashar; "simplify wrapper" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5778 [05:44:50] New review: Hashar; "I have simply rebased that change. Has been +1 by Aaron already." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/5778 [05:44:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5778 [05:47:55] New patchset: Hashar; "dedupe code: foreachwiki vs. foreachwikiindblist" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8434 [05:48:38] New review: Hashar; "Simply rebased the change. Removing code duplication is great to have." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/8434 [05:48:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/8434 [05:52:25] New patchset: Hashar; "(bug 38299) Computer Modern fonts for math rendering" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15550 [05:53:07] New review: Hashar; "PS3 is a rebase" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/15550 [05:53:07] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15550 [05:55:01] Change abandoned: Hashar; "cant remember what that change was for, I guess it is no more needed. If we ever have this issue aga..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16515 [05:56:38] Damianz: I already saw a bug for that, which might not state the vastness of the brekage though [06:00:21] New review: Hashar; "I guess we would also need several -dev packages to properly compile the npm packages. Isn't there a..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/19397 [06:08:11] PROBLEM - Puppet freshness on ms-be10 is CRITICAL: Puppet has not run in the last 10 hours [06:08:11] PROBLEM - Puppet freshness on ms-be3 is CRITICAL: Puppet has not run in the last 10 hours [06:10:17] PROBLEM - Puppet freshness on ms-be4 is CRITICAL: Puppet has not run in the last 10 hours [06:54:21] PROBLEM - Puppet freshness on ms-be12 is CRITICAL: Puppet has not run in the last 10 hours [06:55:24] PROBLEM - Puppet freshness on ms-be11 is CRITICAL: Puppet has not run in the last 10 hours [07:36:21] PROBLEM - Puppet freshness on ms-be5 is CRITICAL: Puppet has not run in the last 10 hours [07:46:03] hello [07:47:04] morning [07:48:37] Damianz: looks like puppet freshness is fixed in Nagios \O/ [07:48:38] ;) [07:48:56] I made tidy [07:49:12] Still some stuff to fix that's running out of date puppet [07:49:19] I am wondering if we could make service checks depending upon host being up [07:49:32] I was thinking that, it's a little hard to tell atm [07:50:23] I'm attempting to rewrite the horrid bash/wget/c#/sed/grep/tr/awk thing for building configs atm [07:50:34] dohh [07:50:52] make sure to have that code somewhere in a public place [07:50:55] Hmm this is totally not the labs channel either heh [07:51:02] we could create a git repo for you if needed [07:51:05] oh yeah [07:51:23] well ops are all sleeping anyway (or too busy right now :D ) [07:51:31] Well I'm going to write it, github it, annoy chad to make a gerrit repo, bug petan to review it pointing out it's cleaner then hopefully it can go 'live' as we fix nagios long term. [08:03:02] PROBLEM - Puppet freshness on ms-fe4 is CRITICAL: Puppet has not run in the last 10 hours [08:19:59] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: Puppet has not run in the last 10 hours [08:19:59] PROBLEM - Puppet freshness on ms-be1002 is CRITICAL: Puppet has not run in the last 10 hours [08:19:59] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [08:19:59] PROBLEM - Puppet freshness on ms-be1001 is CRITICAL: Puppet has not run in the last 10 hours [08:19:59] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [08:20:00] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [08:20:00] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [08:20:01] PROBLEM - Puppet freshness on ms-fe1001 is CRITICAL: Puppet has not run in the last 10 hours [08:20:01] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [08:20:02] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [08:20:02] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [08:20:03] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [08:20:03] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [08:25:37] New patchset: Hashar; "manpages for our misc scripts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16606 [08:26:30] New review: Hashar; "PS3: copy man files to /usr/local/share/man/man1" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/16606 [08:26:30] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16606 [08:36:26] New patchset: Hashar; "manpages for our misc scripts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16606 [08:37:10] New review: Hashar; "PS4: rebased (misc-script is now in manifests/misc/deployment.pp)." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/16606 [08:37:10] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16606 [08:46:18] New patchset: Hashar; "manpages for our misc scripts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16606 [08:47:09] New review: Hashar; "PS5: ignore asciidoc files not starting with a letter (such as the template _annotated.txt) and dele..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/16606 [08:47:09] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16606 [09:27:29] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [09:29:20] New review: Hashar; "I still want beta to be able override robots by editing [[Mediawiki:robots.txt]] so we can eventuall..." [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/21602 [09:32:23] Change abandoned: DamianZaremba; "Needs a role class adding to allow changing the var in labs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21307 [09:52:58] New patchset: Hashar; "manpages for our misc scripts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16606 [09:53:51] New review: Hashar; "PS6: fix all issue from PS5. Thanks a lot for the review Dereckson!" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/16606 [09:53:52] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16606 [10:11:44] New patchset: ArielGlenn; "wansecurity host able to rsync other and archives" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21868 [10:12:29] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21868 [10:12:49] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21868 [10:52:08] New patchset: ArielGlenn; "tool for checking which media files are used by which projects This is a one-off but I don't want to lose it in case I need it again" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/21871 [10:53:13] New review: ArielGlenn; "No, it's not great code. Yes, it needs to live somewhere. Yes, I need to really move production th..." [operations/dumps] (ariel); V: 1 C: 2; - https://gerrit.wikimedia.org/r/21871 [10:53:13] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/21871 [11:05:38] PROBLEM - Puppet freshness on ms-be8 is CRITICAL: Puppet has not run in the last 10 hours [11:28:40] PROBLEM - Puppet freshness on ms-be1 is CRITICAL: Puppet has not run in the last 10 hours [11:39:37] PROBLEM - Puppet freshness on ms-be2 is CRITICAL: Puppet has not run in the last 10 hours [11:51:37] PROBLEM - Puppet freshness on srv193 is CRITICAL: Puppet has not run in the last 10 hours [11:55:40] PROBLEM - Puppet freshness on srv194 is CRITICAL: Puppet has not run in the last 10 hours [12:19:40] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [12:19:40] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [12:19:40] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [12:19:40] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [12:26:00] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [12:38:00] PROBLEM - Puppet freshness on palladium is CRITICAL: Puppet has not run in the last 10 hours [12:42:20] !log enabling Jenkins on TitleBacklist extension (job: Ext-TitleBlacklist ) [12:42:30] Logged the message, Master [12:48:53] !log reloading Jenkins to fix up Ext-TitleBlacklist misconfiguration [12:49:03] Logged the message, Master [12:54:57] PROBLEM - Puppet freshness on ms-fe2 is CRITICAL: Puppet has not run in the last 10 hours [12:59:13] New patchset: Mark Bergsma; "Migrate pmtpa /home to the NetApp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21877 [12:59:58] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21877 [13:05:00] PROBLEM - Puppet freshness on ms-be9 is CRITICAL: Puppet has not run in the last 10 hours [13:05:16] THUNDER [13:05:20] erm... wrong channel [13:05:30] !log Shutdown fenari and hume [13:05:40] Logged the message, Master [13:07:11] quick question, can somebody invite me for the #wikimedia-staff channel? [13:07:32] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21877 [13:08:09] PROBLEM - Host hume is DOWN: CRITICAL - Host Unreachable (208.80.152.190) [13:10:14] New patchset: Mark Bergsma; "Remove admins::dctech from applicationserver role class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21878 [13:10:33] PROBLEM - Host fenari is DOWN: CRITICAL - Host Unreachable (208.80.152.165) [13:11:01] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21878 [13:12:58] RECOVERY - Puppet freshness on srv193 is OK: puppet ran at Wed Aug 29 13:12:50 UTC 2012 [13:18:21] RECOVERY - Puppet freshness on srv194 is OK: puppet ran at Wed Aug 29 13:18:04 UTC 2012 [13:25:43] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [13:26:00] stupid check [13:26:13] it can't even run the check script [13:28:57] !log Changed /home mount from nfs1 to nas1-a on srv193 and spence [13:29:06] Logged the message, Master [13:37:05] !log Mounted /home read-only on nfs1, started final rsync to nas1-a [13:37:16] Logged the message, Master [13:38:02] exciting [13:52:25] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , zhwiki (19630) [13:52:34] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [13:52:34] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [14:32:56] mark: lemme know when you want to review frackpuppet [14:34:17] <^demon> Fenari still down for usage? [14:35:41] yes, fenari still down [14:35:47] <^demon> mmk. I'll find something else to do :) [14:36:17] there's bast1001 [14:38:29] copy is taking rather long [14:38:53] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5778 [14:39:53] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8434 [14:40:25] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15550 [14:42:06] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17470 [14:42:48] mark: are we going to refactor syslog as part of the nfs1/2 deprecation? [14:43:04] I'm asking because of some syslog overlap with labs, not sure if you saw hashar's mail [14:43:54] no [14:44:06] thta is [14:44:12] i'm not depreciating nfs1/nfs2 now [14:44:17] what happens to them later, I dunno yet [14:44:53] PROBLEM - Puppet freshness on ms-be6 is CRITICAL: Puppet has not run in the last 10 hours [14:44:54] paravoid: hello :-) [14:46:28] New patchset: Faidon; "(bug 38946) hebrew fonts for SVG rendering" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21588 [14:46:48] hashar: hi :) [14:47:12] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21588 [14:47:28] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21588 [14:48:05] i certainly want that to change anyway [14:48:09] paravoid: maybe we should rewrite the syslog-ng stuff to be based on rsyslog ;) [14:48:23] logging onto nfs /home is rather suboptimal [14:48:43] indeed [14:48:45] and it's mostly the reason this copy is taking so long [14:48:46] hashar: yes we should [14:49:04] is rsyslog better than syslog-ng? [14:49:17] I generally prefer it but I have no problem using either [14:49:20] as long as we stick to one [14:49:24] we currently use both [14:49:33] for very different purposes [14:49:40] years ago I evaluated both, ended up choosing syslog-ng because it was easier to administrate [14:49:41] but i don't care much either [14:49:42] but still, it's... syslog [14:49:53] but rsyslog has more feature AFAIK [14:50:11] rsyslog has some interesting stuff, such as RELP [14:50:19] reliable logging/retransmissions etc. [14:50:44] not sure if we'd want to use that but it always feeled more modern to me [14:50:52] but I think syslog-ng has catched up lately too [14:51:57] the obvious choice would be to pick whatever Ubuntu uses by default and assume it will be better maintained by them because it is deployed on more machines [14:52:04] which is rsyslog [14:52:17] fwiw opensuse went rsyslog recently too [14:52:22] so we need to migrate our syslog-ng conf to rsyslog [14:54:44] re: https://gerrit.wikimedia.org/r/#/c/17973/ what's our policy on shell access? discuss it on an ops meeting? [15:03:06] discuss it in the rt ticket [15:03:45] there is none :) I'll ask Roan to open one [15:03:50] it's for shell access to singer [15:04:26] hiyyya paravoid, when you have a sec, could you see if this is better than the other one? [15:04:28] https://gerrit.wikimedia.org/r/#/c/21749/ [15:05:00] there needs to be one [15:05:05] hi andrew [15:05:25] ottomata: looks "good" to me. mark will probably like to comment on that too. [15:05:28] New review: Mark Bergsma; "Where is the RT ticket for this request?" [operations/puppet] (production); V: 0 C: -2; - https://gerrit.wikimedia.org/r/17973 [15:06:18] New review: Faidon; "I don't see an RT ticket for that and I think access requests need approval. Could you open one and ..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/17973 [15:06:25] argh [15:09:21] mark, I think you are probably busy with the netapp stuff right now, but when you get a sec: [15:09:21] https://gerrit.wikimedia.org/r/#/c/21749/ [15:09:24] hah [15:10:39] jeremyb: thanks for the swift 1.7 heads up. I happened to see that beforehand entirely accidentally. [15:10:55] sure [15:11:14] i think i chose the right people to send it to [15:14:00] fucking swift logs are huge [15:15:44] mark: that's because they're not sampled or anything [15:15:54] and because we log at both the proxy and object server [15:16:03] you could store them in swift ;) [15:16:12] so for each request you have 2-3 lines [15:16:25] should turn that off soon [15:23:49] New review: Dereckson; "Test procedure followed:" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/16606 [15:29:39] New patchset: Demon; "Overhauling gerrit manifest to be a role class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13484 [15:30:28] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/13484 [15:32:33] Change abandoned: Demon; "Actually, this isn't needed at all. The labs instance was running precise, which suffers from this b..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21848 [15:40:40] Change abandoned: Demon; "This needs discussion before I go forward with it. And I want my omnibus gerrit change in first." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11451 [15:48:58] !log Final rsync finished. [15:49:01] !log Powering up hume [15:49:07] Logged the message, Master [15:49:17] Logged the message, Master [15:52:33] New patchset: Mark Bergsma; "No longer enable nfs server on nfs1/nfs2" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21893 [15:53:11] RECOVERY - Host hume is UP: PING OK - Packet loss = 0%, RTA = 0.17 ms [15:53:20] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21893 [15:53:27] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21893 [15:54:43] \o/ [15:58:14] you gotta be kidding me [15:58:18] fstab on hume: [15:58:19] albert:/a/old-static/mnt/albert-staticnfsbg,soft,udp,rsize=8192,wsize=8192,timeo=1400 [15:59:10] :-D [15:59:19] well that's going to come up [15:59:56] fenari is currently in fsck [16:00:04] !log Powering up fenari [16:00:07] :P [16:00:15] RECOVERY - Host fenari is UP: PING OK - Packet loss = 0%, RTA = 0.53 ms [16:00:18] Logged the message, Master [16:02:43] jeremyb: paravoid merged some of your changes 8434 5778 :-))) [16:03:04] i kinda half saw, danke [16:03:20] 4 digit merges are at a premium now [16:03:32] PROBLEM - SSH on fenari is CRITICAL: Connection refused [16:04:04] Change merged: Demon; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/16036 [16:04:16] !log Fixed serial console on hume [16:04:17] PROBLEM - HTTP on fenari is CRITICAL: Connection refused [16:04:25] * jeremyb pokes Krinkle|detached for !g 21393 [16:04:26] Logged the message, Master [16:05:02] RECOVERY - SSH on fenari is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [16:05:17] now for 8120 ;) [16:06:25] and 8344 [16:06:28] by hashar [16:06:29] mark: I think serial consoles are broken pretty much across the fleet [16:07:01] why do you think that? [16:07:10] that's not my impression [16:07:18] we have bios console redirection enabled [16:07:22] err, bye* /me is still half asleep [16:07:26] RECOVERY - HTTP on fenari is OK: HTTP OK HTTP/1.1 200 OK - 4416 bytes in 0.020 seconds [16:07:29] and grub/linux configured to log to serial [16:07:31] yeah, grub2 barfs on that [16:07:36] i know [16:07:39] but that's just during boot [16:08:12] no, I think we have the setting for post-boot console redirection enabled [16:08:18] i know what you mean [16:08:38] looking at mysql grants for the payments database, wikiuser has Repl_client_priv -- does mediawiki do some fancy slave status checks? or is that likely an error? [16:09:23] PROBLEM - Puppet freshness on ms-be10 is CRITICAL: Puppet has not run in the last 10 hours [16:09:23] PROBLEM - Puppet freshness on ms-be3 is CRITICAL: Puppet has not run in the last 10 hours [16:10:35] !log enabling srv194 and srv281 in pybal for apache/precise live test [16:10:43] yay :) [16:10:45] Logged the message, notpeter [16:11:29] PROBLEM - Puppet freshness on ms-be4 is CRITICAL: Puppet has not run in the last 10 hours [16:11:40] Jeff_Green: maybe related to https://en.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=dbrepllag&sishowalldb= ? [16:12:05] * jeremyb runs away [16:12:12] hahahah [16:12:18] come back come back! [16:12:21] j/k. that makes sense [16:12:32] I'm pretty sure MW checks for slave replication lag [16:12:33] (literally -> subway ;) [16:12:55] so you prefer snoop to dtrace, paravoid? [16:14:03] I prefer not using Solaris [16:14:04] :P [16:15:19] that goes without saying [16:18:01] more seriously, I've never used DTrace [16:18:23] by the time it got released I was using Linux already, so... [16:18:35] oh [16:18:53] well now readers have two ways they can see the get requests over there [16:19:02] :-) [16:19:08] three actually [16:19:10] not that it matters, I was just curious [16:19:24] two listed on the page [16:19:34] oh for get, right [16:19:46] you snooped for NFS traffic [16:19:49] I snooped for HTTP traffic [16:19:50] uh huh [16:19:58] and you also dtraced [16:20:03] yup [16:20:26] did you have any luck with the hardware faults btw? [16:20:27] don't think that means I knw a bunch about it though, it's waaay too complicated [16:20:51] oh I mentioned that yesterday, I updated the ms-be-somthing rt ticket [16:20:57] about changing the disks [16:21:17] looking [16:21:19] short summary, I think we need to get dell to swap out some fcrap [16:21:38] yeah [16:26:33] New patchset: Dereckson; "(bug 39767) Break lines in review Gerrit comments" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21895 [16:27:19] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21895 [16:28:37] !log removing srv281 from apaches pool [16:28:47] Logged the message, notpeter [16:28:48] oh? [16:28:52] test failed? [16:29:19] something isn't right about it. ganglia doesn't show and resource usage and nothing is showing up in the logs on fluorine [16:29:29] cmjohnson1, RobH: do you have plan regarding swift/c2100 boxes? want to talk about it? [16:29:45] srv194 looks totally normal, though, which is great [16:29:50] paravoid: Plan for what? [16:29:58] *any resource usage [16:30:04] what to do? just replace hardware? or escalate to Dell? [16:30:06] paravoid: as in repair plans? [16:30:16] ms-be6 had its memory broken and now its controller broken too [16:30:19] is the hardware broken? ahh [16:30:23] ms-be10 has its controller broken as well [16:30:31] paravoid: first step is to escalate it to Dell [16:30:34] ok, well, you drop a ticket in pmtpa queue and chris will test the hardware and get it replaced under warranty [16:30:49] but you need to ensure he can take it offline to do so, so if its pooled link instructions and such [16:30:58] if its just offline and he can pull and work, note that in ticket. [16:31:05] we have a ticket, #3282 [16:31:21] yep�i read the comments last night from apergos [16:31:24] ...... [16:31:31] argh [16:31:39] i wish you guys would make different tickets for different issues [16:31:45] we tie those to Dell returns and the like [16:31:50] one ticket per problem in future please. [16:32:15] paravoid: so since its tampa hardware you work with chris to get it fixed [16:32:18] if it was ashburn, would be me [16:32:44] I'm talking about the general problem "C2100s have a big failure rate" [16:32:51] prolly contoller, could be also mb [16:32:51] that's not something that's related to tampa or swift or the ms-be tickets [16:33:11] paravoid: we arent ordering them anymore is my understanding [16:33:20] some folks had problems with these c2100s so bad that even after replacing mb and controller and other pieces, it still didn't resolve and they eventually got send new systems [16:33:20] but we cannot just stop using what we have, so we have to just get warranty repair. [16:33:21] they don't make them anymore either [16:33:46] so if we do enough repair work and it still doesnt work, then as apergos points out dell will eventually replace it [16:33:49] anyways it will be slog through swapping crap out with dell [16:34:01] I'm basically asking if you have an escalation plan or if we're going to deal with the issues as they come [16:34:02] yep [16:34:02] but they wont do it on any given system until we have a repair history on that specific system [16:34:07] paravoid: as they come. [16:34:12] not sure how else we would handle them [16:34:16] we have less than 20 c2100s. [16:34:24] well, wait, less than 30. [16:34:29] (i forgot anaylitics ;) [16:35:16] paravoid: What would you suggest otherwise? [16:35:24] * RobH is open to suggestions [16:35:33] cuz i hate the c2100s as well. [16:35:39] * Damianz offers thermite to paravoid [16:35:52] sledgehammers [16:35:56] so much more satisfying [16:35:57] Damianz: the folks below our datacenter floor do not appreciate that. [16:36:18] I'm sure dell would replace their kit if it's under warrenty :D [16:36:21] however the good news is that eventually people did get working systems [16:36:22] they are in tampa, the broken servers ;] [16:36:32] Damianz: yep, so we have to handle as they break is my stance [16:36:48] paravoid: since we are talking ms-be?�there are a few bad drives in ms-be7 and 8 that i have replacements for �do you wanna swap them out? can you see which drives are not mounted? [16:36:52] robh: yes [16:37:12] the latest batch of c2100's have come littered w/ hdd problems. [16:37:19] 2 or 3 per box [16:37:27] and apparently controllers now too [16:37:30] it's easy to see which drives are causing complaints, there's a bunch of whingin in the logs about em [16:37:48] ms-be6 had its memory broken, you replaced it, and now it has its controller broken it seems [16:37:54] never made it into production [16:38:10] cmjohnson1: looking @ ms-be7/8 [16:38:30] so the rumor mill is that dell doesn't actually make these, they are manufactured elsewhere [16:38:53] before I spread that rumor and it's logged (well after) I should find the email where I read it [16:39:01] paravoid: I don't wanna make it sound like we won't fix them, we totally will, just not sure how else to handle it than one problem at a time =] [16:39:36] RobH: not challenging your ways, just wondering about your plans [16:39:40] robh: i think he is trying to say that these systems are very unstable and may be better served as something else [16:39:53] I'm not saying that, not yet at least :) [16:39:56] no worries, i just didnt wanna come across as rude, irc is hard to convey tone =] [16:40:08] Need that sarcasm font back [16:40:12] http://lists.linbit.com/pipermail/drbd-user/2011-December/017361.html here's where they say it was outsourced [16:40:20] Damianz: that font gets me into troule [16:40:21] trouble [16:41:00] I think we have to do a few of them as they break, but then we might be able to lean on Dell after that at the first sign of trouble with any others [16:42:20] I don't know how reliable the email source is, just pointing it out [16:48:15] paravoid: it was just ms-be7 sag�but i see it mounted /dev/sdg1 on /srv/swift-storage/sdg1 type xfs (rw,noatime,nodiratime,nobarrier,logbufs=8) [16:49:00] yep, seems mounted [16:49:02] complains too [16:49:12] the other issues were w/ ms-be6 and ms-be10 but w/ a bad controller it is hard to say what is good and bad at this time [16:49:58] i have a disk for it but I am not sure which slot it is [16:50:16] for what? ms-be7 you mean? [16:50:22] yes [16:51:03] so, how do you usually handle this? [16:53:40] New patchset: MaxSem; "Log API errors caused by the WLM app in a separate file" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21753 [16:53:40] well for the c2100's the drive slots do not match the drive order�.last time, ben tried to force mount the bad disk and a red led showed up on the system. [16:54:09] the system can't even see the disk anymore I'm afraid [16:54:11] how many c2100s are there? [16:54:19] 24 i think [16:54:30] no more [16:54:33] analytics bought them too [16:54:41] and I have no idea why [16:54:48] oohh bummer :-( [16:54:51] debugging each of them individually is going to be a huge time suck [16:55:10] * robla wonders if we really can't make this a general escalation [16:55:29] we should totally contact dell about getting them all replaced [16:55:37] if we see general issues repeatedly with all of them [16:55:48] PROBLEM - Puppet freshness on ms-be12 is CRITICAL: Puppet has not run in the last 10 hours [16:56:08] cmjohnson1: so, it should be the only disk which is idle or even powered off [16:56:18] this is why we want to do the first couple and then say "see these other ones? fix them all" [16:56:39] (assuming we have general issues with several others by then) [16:56:51] PROBLEM - Puppet freshness on ms-be11 is CRITICAL: Puppet has not run in the last 10 hours [16:57:49] paravoid: based on that it looks like disk7�zero activity going on [16:58:15] i am nearly certain they are hot swappable�robh can you confirm [16:59:34] ok, off to a meeting, back in a while. [17:01:41] !log swapping out disk7 (dev/sdg) on ms-be7 [17:01:51] Logged the message, Master [17:06:42] so, the controller says it can't see a disk in slot 4 (counting from 0) [17:06:52] all the others are accounted for [17:06:58] based on their serial number [17:07:33] sda/b are the SSDs, sdc is 0, sdd is 1, sde is 2, sdf is 3, sdg is 4 (and so on) [17:07:53] cmjohnson1: the 3.5" disks are hot swap in the c2100s [17:07:55] so Linux's counting matches the controllers [17:07:56] the SSDs are not. [17:08:14] !g 21393 [17:08:14] https://gerrit.wikimedia.org/r/#q,21393,n,z [17:08:31] jeremyb: yes? [17:09:06] !log returning srv281 to apaches pool [17:09:16] Logged the message, notpeter [17:09:46] cmjohnson1: and now I see the new disk, a Toshiba (vs. WDs) [17:10:10] cmjohnson1: do you see a locate led perhaps? [17:10:12] PROBLEM - check_apache2 on payments1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [17:11:11] paravoid: no locate led [17:11:56] !log removing srv281 once again and reimaging [17:12:06] Logged the message, notpeter [17:13:17] that's a pity [17:13:30] I'm sure I sent the command right, probably the hardware doesn't have a led [17:14:27] paravoid: try again [17:14:37] wasn't looking [17:14:49] ok..it is blinking [17:15:18] PROBLEM - check_apache2 on payments1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [17:15:28] oh, cool [17:15:39] that is the disk i replaced [17:15:47] yep [17:16:22] good, now we can communicate better [17:16:33] I installed the proprietary tool for that btw [17:16:35] it's called "sas2ircu" [17:16:48] so you can do sas2ircu list to see controllers (there's just one, "0") [17:16:55] then sas2ircu 0 display to show all the disks [17:17:03] that disk is enclosure 2 bay 4 [17:17:10] so then you do sas2ircu 0 locate 2:4 on [17:18:07] and you can see serial numbers of linux block devices with either "hdparm -i" or "smartctl -d ata -i" [17:19:20] New patchset: Pyoungmeister; "setting srv281 as an apache with the new role class :/" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21899 [17:19:36] that is cool [17:19:50] :-) [17:20:06] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21899 [17:20:15] PROBLEM - check_apache2 on payments1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [17:20:15] notpeter: that role class didn't exist when I re-imaged it iirc [17:20:38] paravoid: gotcha. I had assumed that it was using it [17:20:41] I don't know why... [17:20:50] but hey! explains why it was acting a bit weird! [17:20:51] PROBLEM - SSH on srv281 is CRITICAL: Connection refused [17:21:00] PROBLEM - Apache HTTP on srv281 is CRITICAL: Connection refused [17:21:39] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21899 [17:25:12] PROBLEM - check_apache2 on payments1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [17:28:30] RECOVERY - SSH on srv281 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [17:30:18] RECOVERY - check_apache2 on payments1001 is OK: PROCS OK: 9 processes with command name apache2 [17:36:18] New patchset: MaxSem; "WLM updater script" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17964 [17:37:05] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/17964 [17:37:48] RECOVERY - Apache HTTP on srv281 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.004 seconds [17:37:48] PROBLEM - Puppet freshness on ms-be5 is CRITICAL: Puppet has not run in the last 10 hours [17:40:03] PROBLEM - NTP on srv281 is CRITICAL: NTP CRITICAL: Offset unknown [17:49:56] !g 21393 | Krinkle [17:49:56] Krinkle: https://gerrit.wikimedia.org/r/#q,21393,n,z [17:50:02] that's all [17:50:16] Yes, I saw that. (I get two e-mails for it as well) [17:50:25] np, I thought you wanted to chat about it [17:50:42] PROBLEM - Apache HTTP on srv281 is CRITICAL: Connection refused [17:50:49] no, just thought maybe it was ready for you to do something else with it [17:50:51] (maybe you know how if I should and if so how I can configure my shell to work with this) [17:51:07] I don't have merge rights there [17:51:09] oh wait I do [17:51:13] but its my own commit [17:51:51] i don't even understand the difference [17:51:52] bye [17:55:12] RECOVERY - Apache HTTP on srv281 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.025 second response time [17:56:20] !log returning srv281 to apaches pool [17:56:30] Logged the message, notpeter [17:58:24] notpeter: hahaha, you think so eh? [17:58:31] that server will redie. [17:58:34] its cursed. [17:59:19] I can vouch for that [17:59:26] it's definitely cursed [17:59:40] let's just tell Chris to throw the box into the hurricane and be done with it [17:59:57] no no, search32 is ahed of 281 for that... [18:00:04] !log reinstalling cp1022 and up [18:00:20] Logged the message, Master [18:01:35] nah, it's working now [18:01:53] srv281, I mena, not search32. search32 will never work for nore than 48 hours. ever. [18:02:06] PROBLEM - Host cp1022 is DOWN: PING CRITICAL - Packet loss = 100% [18:02:41] notpeter: search32 :( that reminds me�i need to call DELL about that again! [18:03:05] do they have special explosives for destroying problematic hosts? [18:03:27] cuz I think that would be the most helpful thing... [18:03:28] RECOVERY - Varnish HTTP upload-backend on cp1022 is OK: HTTP OK HTTP/1.1 200 OK - 632 bytes in 2.412 seconds [18:03:36] RECOVERY - Host cp1022 is UP: PING OK - Packet loss = 0%, RTA = 26.74 ms [18:03:43] no�but I am creating a 'lessons learned" so I have notes for when ms-be6 and 10 never work again [18:04:08] okay, dinner time, be back in a few hours. [18:04:21] PROBLEM - Puppet freshness on ms-fe4 is CRITICAL: Puppet has not run in the last 10 hours [18:04:57] RECOVERY - NTP on srv281 is OK: NTP OK: Offset -0.0473690033 secs [18:06:20] New patchset: Demon; "Updating all wikipedias to 1.20wmf10" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21905 [18:06:38] Change merged: Demon; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21905 [18:07:13] PROBLEM - Host cp1022 is DOWN: PING CRITICAL - Packet loss = 100% [18:09:12] haha, @Windows Defender making sure that doubleclick can serve ads [18:09:36] PROBLEM - Host cp1023 is DOWN: PING CRITICAL - Packet loss = 100% [18:11:15] this sounds somehow familiar "Maybe Microsoft will get their act together with the next OS." [18:12:54] RECOVERY - Host cp1022 is UP: PING OK - Packet loss = 0%, RTA = 26.50 ms [18:15:18] RECOVERY - Host cp1023 is UP: PING OK - Packet loss = 0%, RTA = 26.71 ms [18:16:21] PROBLEM - SSH on cp1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:16:30] PROBLEM - Varnish traffic logger on cp1022 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:17:15] PROBLEM - Varnish HTTP upload-backend on cp1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:17:15] PROBLEM - Varnish HTCP daemon on cp1022 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:17:24] PROBLEM - Varnish HTTP upload-frontend on cp1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:18:27] PROBLEM - Varnish HTTP upload-frontend on cp1023 is CRITICAL: Connection refused [18:18:36] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: Connection refused by host [18:18:36] PROBLEM - SSH on cp1023 is CRITICAL: Connection refused [18:19:13] PROBLEM - Varnish HTCP daemon on cp1023 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:19:13] PROBLEM - Varnish HTTP upload-backend on cp1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:19:30] RECOVERY - SSH on cp1022 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [18:21:18] PROBLEM - Puppet freshness on ms-be1002 is CRITICAL: Puppet has not run in the last 10 hours [18:21:18] PROBLEM - Puppet freshness on ms-be1001 is CRITICAL: Puppet has not run in the last 10 hours [18:21:18] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: Puppet has not run in the last 10 hours [18:21:18] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [18:21:18] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [18:21:19] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [18:21:19] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [18:21:20] PROBLEM - Puppet freshness on ms-fe1001 is CRITICAL: Puppet has not run in the last 10 hours [18:21:20] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [18:21:21] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [18:21:21] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [18:21:22] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [18:21:22] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [18:21:45] RECOVERY - SSH on cp1023 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [18:36:00] PROBLEM - NTP on cp1022 is CRITICAL: NTP CRITICAL: No response from NTP server [18:37:02] Krinkle: sorry, left so abrubtly, train was going underground and if i typed anything longer, would have been too late ;) [18:37:19] Krinkle: what's `echo $'docroot/foundation/lev\303\251e_de_fonds.html' | sed -n l` look like for you? [18:37:30] PROBLEM - NTP on cp1023 is CRITICAL: NTP CRITICAL: No response from NTP server [18:38:06] jeremyb: Note sure, I know the individual parts, not sure what it does exactly or why I'd need it [18:38:14] !log deleting global role groups from LDAP. They aren't needed in keystone's LDAP DIT [18:38:25] Logged the message, Master [18:38:29] Krinkle: i mean if you just right that, what is the output? [18:38:47] Krinkle: this is mac? [18:38:54] jeremyb: "docroot/foundation/lev\303\251e_de_fonds.html" [18:38:55] jeremyb: yes [18:39:07] errr, wtf [18:39:13] oh, wait, nvm [18:39:16] that's good ;) [18:39:39] Krinkle: so the same `git rm` i ran should have worked for you. you just didn't know what to run [19:07:28] New patchset: Pyoungmeister; "adding by-site ganglia clusters to applicationserver role class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21907 [19:07:56] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2563* [19:08:13] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21907 [19:08:50] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21907 [19:10:37] Does anyone know if poolcounter has to run on an external IP ? [19:10:46] it runs on tarin.w.o in tampa [19:17:05] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2563* [19:18:04] RobH: I wouldn't think so but you'd have to ask someone who actually knows PC. By which I mean Tim [19:18:16] apergos: I replied to https://wikitech.wikimedia.org/view/Swift/Open_Issues_Aug_-_Sept_2012/Cruft_on_ms7#skins [19:18:29] RoanKattouw: I dont think it needs to be either, looking over the wikitech docs it doesnt appear to be [19:18:40] so I am going to go ahead and allocate the servers, but not set them up or install the os quite yet. [19:18:45] thanks RoanKattouw [19:19:23] figures, I can't load the page cause I'm busy grabbing some media data from fenari :-/ [19:19:30] stupid slow internet connection [19:19:45] lol [19:20:02] ah it loaded, well there's an exception! usually dns craps out on me [19:20:10] yeah I can give you the referer, not in here though [19:20:18] Sure [19:25:22] New review: Ryan Lane; "Fix site.pp for formey and we'll be good." [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/13484 [19:28:02] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [19:34:38] RobH, should we mention the inconsistent NIC naming between OS and bios on the C2100s in that thread? [19:34:39] New patchset: Pyoungmeister; "fixing incorrect cluster definition in applicationserver role class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21947 [19:35:14] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2400 [19:35:26] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21947 [19:35:37] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21947 [19:38:29] ottomata: that was due to us putting in a secondary nic [19:38:38] its annoying but happens on most systems [19:38:42] hmm, ok cool [19:38:43] just checking [19:38:45] takes the add on nic as 'primary' in the os [19:38:53] yep, thx for checkin =] [20:12:30] New patchset: Aaron Schulz; "Enumerate math/timeline containers for copy scripts." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21952 [20:29:45] !log changing apache config and wiki location on virt0 [20:29:56] Logged the message, Master [20:31:04] New patchset: Ryan Lane; "Change location of wiki to w, rather than version" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21953 [20:31:51] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21953 [20:31:56] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21953 [20:34:57] can someone please review https://gerrit.wikimedia.org/r/17964 ? [20:54:56] New patchset: Demon; "Overhauling gerrit manifest to be a role class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13484 [20:55:42] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/13484 [21:07:11] PROBLEM - Puppet freshness on ms-be8 is CRITICAL: Puppet has not run in the last 10 hours [21:17:53] New patchset: Demon; "Overhauling gerrit manifest to be a role class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13484 [21:18:42] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/13484 [21:21:26] mw8: ssh: connect to host mw8 port 22: Connection timed out [21:21:27] srv266: ssh: connect to host srv266 port 22: Connection timed out [21:21:36] ^demon: This is SNAFU with sync? [21:21:47] (the rest went fine) [21:21:53] <^demon> Yeah, those have been down. [21:21:55] <^demon> No big deal [21:21:55] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13484 [21:21:59] ok.. [21:22:23] I guess they should be commented out (or removed) from some kind of registry, which hasn't happened yet? [21:25:20] PROBLEM - SSH on virt1001 is CRITICAL: Connection refused [21:30:08] PROBLEM - Puppet freshness on ms-be1 is CRITICAL: Puppet has not run in the last 10 hours [21:31:56] PROBLEM - Host virt1001 is DOWN: PING CRITICAL - Packet loss = 100% [21:35:22] Krinkle�srv266 is down and powered down at the moment [21:38:36] I figured, but it "should" not be trying to access it. since it is known (and for a white apparently, as in hours or days) I suppose it could be taken out of roulation. [21:38:43] while* [21:39:09] RECOVERY - SSH on virt1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:39:15] New patchset: Demon; "Fixing up gerrit apache config so it'll work on labs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21961 [21:39:18] RECOVERY - Host virt1001 is UP: PING OK - Packet loss = 0%, RTA = 26.53 ms [21:39:36] PROBLEM - Host srv281 is DOWN: PING CRITICAL - Packet loss = 100% [21:39:47] Krinkle: They could be commented out from the list in theory, but the consensus is that the annoyance of having downed boxes on the list is less than the danger of having up-and-running boxes not on the list (which is what would happen if someone forgot to uncomment them) [21:40:03] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/21961 [21:40:05] right [21:40:32] RoanKattouw: but nagios is not complaining about those two, right ? [21:40:39] nevermind [21:40:42] I'm sure it is [21:40:56] And pybal will have depooled them so they're not serving traffic [21:40:57] PROBLEM - Puppet freshness on ms-be2 is CRITICAL: Puppet has not run in the last 10 hours [21:41:09] New patchset: Demon; "Fixing up gerrit apache config so it'll work on labs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21961 [21:41:23] Hm.. sync's list is not the same as the list where depooling happens ? [21:41:36] maybe that is the source of the solution? [21:41:42] Yeah, we know :S [21:42:01] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21961 [21:42:08] Ryan's magical shiny git-deploy system will fix this [21:42:17] I found it strange that we wouldn't uncomment them since if deploy tries, like, what if loadbalancer tried, so there is a list. Its just that we have two list. [21:42:23] BTW Nagios does report srv266 down, but it's been acknowledged by ops: http://nagios.wikimedia.org/nagios/cgi-bin/status.cgi?host=srv266 [21:42:39] ok [21:42:44] Yeah but one list is maintained by hand whereas the other is maintained automatically [21:42:57] and one from the other? [21:43:08] No [21:43:16] xD [21:43:29] * AaronSchulz would like a srv666 [21:43:33] The dsh list used for deploys is maintained by hand and includes machines whose role isn't to serve traffic [21:44:06] The list of machines currently serving traffic is in constant flux because machines get depooled if they misbehave, then repooled later [21:44:27] right, but there's still a way to "fix" that. So I guess Ryan's git-deploy will merge the lists so that uncommentting takes it out of both, and deplooling sets a certain flag. [21:44:40] But the latter has a lot of flapping potential (i.e. machines can more or less randomly drop off for a little while), so it's not safe to use for deployments cause you might miss boxes [21:44:42] it would suck to deploy a config change to make something "read only"... [21:44:58] Ryan's git-deploy system would use a queue to make sure that hosts that aren't up at the time will process the update when they come up, IIRC [21:45:07] it might become "read only except srv266" [21:45:17] ok [21:45:17] which...probably not useful [21:45:23] Yeah, Apaches being out of sync used to be a big problem [21:45:37] Now that we have a script that syncs the machine before starting the Apache process, this is less of an issue [21:45:59] so what about mw8 and srv266. when will they get the latest code? or do they scap themselves on startup ? [21:46:08] As you can imagine, this was especially interesting when some dead Apaches were brought back to life a few days after the monobook->vector switchover [21:46:11] beat me by the second [21:46:32] right [21:46:58] Hm.. then why have a queue for when a server comes back up if it already updates itself when it comes back up anyway? [21:47:00] Some of them were even running 1.15 instead of 1.16, it was a mess [21:47:18] Because the auto-update only happens if the machine (or the Apache process) died [21:47:20] krinkle: sorry in the middle of something�srv266 has not worked right in a long time�not sure when that will be back on line. �mw8 has a DIMM issue and should be back next week [21:47:34] If its network went down and back up for some reason, or in a variety of other cases, it won't happen [21:47:57] Those aren't common and the most common case is the machine dying hard, but still, it's good to have a more watertight system [21:48:05] cool [21:48:43] cmjohnson1: Yeah no worries, we have downed Apaches all the time, and Krinkle hadn't done a deployment before so he didn't know how to interpret the warnings he got [21:49:43] hm.. new shiny button, what does this do [21:49:56] ok�do you want me to ping either one of you when mw8 comes back to life? [21:50:00] [fwd from #jquery-dev] FLOWING CHOCOLATE http://cheezburger.com/6527494144 [21:50:12] ok, that wasn't intentional. :D [21:50:29] that is a terrible icon design [21:51:02] cmjohnson1: Nah, it'll be fine. srv* and mw* boxes going down and being brought back up is routine, everything is in place to make those boxes fix themselves once you bring them back up [21:51:42] Krinkle: Wait, you forwarded a link to a picture of a bird drinking from a chocolate fountain BY ACCIDENT? [21:51:46] :D [21:52:32] there's this icon that goes like >>> so apparently the last link you clicked, that message is forwarded to the current channel or something. not sure how it works. but yeah, I did want to share that link :D [21:52:57] lol [21:53:59] New patchset: Demon; "Fixing up gerrit apache config so it'll work on labs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21961 [21:54:49] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21961 [22:06:29] can someone please review https://gerrit.wikimedia.org/r/17964 ? [22:20:51] ACKNOWLEDGEMENT - Varnish HTCP daemon on cp1022 is CRITICAL: Connection refused by host daniel_zahn fresh install [22:20:51] ACKNOWLEDGEMENT - Varnish HTTP upload-backend on cp1022 is CRITICAL: Connection refused daniel_zahn fresh install [22:20:51] ACKNOWLEDGEMENT - Varnish HTTP upload-frontend on cp1022 is CRITICAL: Connection refused daniel_zahn fresh install [22:20:51] ACKNOWLEDGEMENT - Varnish traffic logger on cp1022 is CRITICAL: Connection refused by host daniel_zahn fresh install [22:21:00] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [22:21:00] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [22:21:00] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [22:21:00] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [22:21:21] ACKNOWLEDGEMENT - Varnish HTCP daemon on cp1023 is CRITICAL: Connection refused by host daniel_zahn fresh install [22:21:21] ACKNOWLEDGEMENT - Varnish HTTP upload-backend on cp1023 is CRITICAL: Connection refused daniel_zahn fresh install [22:21:36] ACKNOWLEDGEMENT - Varnish HTTP upload-frontend on cp1023 is CRITICAL: Connection refused daniel_zahn fresh install [22:21:36] ACKNOWLEDGEMENT - Varnish traffic logger on cp1023 is CRITICAL: Connection refused by host daniel_zahn fresh install [22:27:00] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [22:39:00] PROBLEM - Puppet freshness on palladium is CRITICAL: Puppet has not run in the last 10 hours [22:55:38] PROBLEM - Puppet freshness on ms-fe2 is CRITICAL: Puppet has not run in the last 10 hours [23:05:32] PROBLEM - Puppet freshness on ms-be9 is CRITICAL: Puppet has not run in the last 10 hours [23:31:14] what is wgMathDirectory and why is it undefined and causing cron errors on fenari and hume? [23:31:20] RoanKattouw: ^^ [23:31:27] I ask you because one of them is running as your user [23:33:04] "This feature was removed completely in version 1.18." [23:34:44] it was moved to an extension [23:34:55] what cron job? [23:36:10] /usr/local/bin/mwscript extensions/ArticleFeedback/populateAFStatistics.php --wiki=metawiki --op=highslows,problems --rating_sets=1 > /dev/null [23:36:33] /usr/local/bin/mwscript extensions/ContributionReporting/PopulateFundraisingStatistics.php foundationwiki --op updatedays > /tmp/PopulateFundraisingStatistics-updatedays.log [23:36:48] I think those are the only two [23:36:52] they're spewing cronspam [23:38:18] also php /home/wikipedia/common/multiversion/MWScript.php extensions/TorBlock/loadExitNodes.php aawiki 2>&1 [23:38:23] mutante: can you look a cron change over for MaxSem ? [23:39:19] tfinc: in gerrit? [23:39:26] https://gerrit.wikimedia.org/r/#/c/17964/ [23:39:35] mutante: yes --^ [23:43:27] RoanKattouw: initialisation of wgMathDirectory was moved from InitialiseSettings.php to CommonSettings.php because it was being overridden by the extension setup file, maybe that's what broke your script [23:44:10] tfinc, Max: ehm.. yea ..given the existing reviews from Asher and Ryan i would really like them to comment again. [23:45:13] mutante: Asher is on vacation and I don't see Ryan_Lane around [23:45:20] who else could we ask to get this done today? [23:49:04] mutante, Asher's comment was about importing of .sql files, which is simply absent now [23:49:46] MaxSem: you could always try asking TimStarling nicely to take a look at it ;) [23:53:32] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [23:53:32] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [23:54:13] strong words from asher [23:54:43] where? [23:55:35] TimStarling: I need 5mins of your time when you're avaialble [23:55:40] "https://gerrit.wikimedia.org/r/#/c/17964/" [23:55:49] "nothing like this will ever be ok from a host on our cluster." [23:55:58] it looks OK to me [23:56:14] I don't have time to do a really detailed review though [23:56:51] oh I've seen that [23:56:55] it's horrible [23:57:31] did you see how it "drops table" then recreates it from a dump that it fetches from toolserver.org?