[00:04:13] Damianz: i just changed the link on wikitech. the one without /ng/ looks like the newer one, since it uses the host "professor" [00:04:30] as "profilehost [00:04:41] Cool [00:12:18] Are the files behind https://noc.wikimedia.org/conf/ in Git? [00:19:54] Probably in https://gerrit.wikimedia.org/r/gitweb?p=operations/mediawiki-config.git somewhere for anything not auto-generated. [00:33:55] RECOVERY - NTP on cp1021 is OK: NTP OK: Offset -0.04841589928 secs [00:41:53] Damianz, I mean the actual web page there, not the files it mentions [00:41:56] this: https://noc.wikimedia.org/conf/index.html [00:46:39] Oh [00:46:57] No idea then, I mis-understood your phrasing of the question, apologies [01:04:22] PROBLEM - Puppet freshness on ms-be8 is CRITICAL: Puppet has not run in the last 10 hours [01:27:35] PROBLEM - Puppet freshness on ms-be1 is CRITICAL: Puppet has not run in the last 10 hours [01:38:32] PROBLEM - Puppet freshness on ms-be2 is CRITICAL: Puppet has not run in the last 10 hours [01:41:32] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 234 seconds [01:42:44] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 261 seconds [01:48:26] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 604s [01:50:32] PROBLEM - Puppet freshness on srv193 is CRITICAL: Puppet has not run in the last 10 hours [01:54:35] PROBLEM - Puppet freshness on srv194 is CRITICAL: Puppet has not run in the last 10 hours [01:58:11] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 3 seconds [01:59:05] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 2s [01:59:32] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 8 seconds [02:18:35] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [02:18:35] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [02:18:35] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [02:18:35] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [02:24:35] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [02:40:17] RECOVERY - Varnish HTCP daemon on cp1021 is OK: PROCS OK: 1 process with UID = 997 (varnishhtcpd), args varnishhtcpd worker [02:53:38] PROBLEM - Puppet freshness on ms-fe2 is CRITICAL: Puppet has not run in the last 10 hours [03:03:41] PROBLEM - Puppet freshness on ms-be9 is CRITICAL: Puppet has not run in the last 10 hours [03:10:35] RECOVERY - Varnish traffic logger on cp1021 is OK: PROCS OK: 3 processes with command name varnishncsa [03:17:47] PROBLEM - MySQL Replication Heartbeat on db1020 is CRITICAL: CRIT replication delay 191 seconds [03:18:05] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 200 seconds [03:18:32] PROBLEM - MySQL Slave Delay on db1020 is CRITICAL: CRIT replication delay 226 seconds [03:18:41] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 229 seconds [03:29:37] Krenair: Don't think so. Not sure if they were ever in SVN either. [03:29:44] Probably worth a bug. [03:39:08] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [03:39:26] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [03:40:47] RECOVERY - MySQL Replication Heartbeat on db1020 is OK: OK replication delay 0 seconds [03:40:56] RECOVERY - MySQL Slave Delay on db1020 is OK: OK replication delay 0 seconds [03:51:08] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [03:51:08] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [04:43:33] PROBLEM - Puppet freshness on ms-be6 is CRITICAL: Puppet has not run in the last 10 hours [05:19:54] New review: Hashar; "Could probably get removed too. I think we will end up having to do some cleanup manually since I am..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/21675 [05:44:00] New patchset: Hashar; "simplify wrapper" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5778 [05:44:50] New review: Hashar; "I have simply rebased that change. Has been +1 by Aaron already." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/5778 [05:44:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5778 [05:47:55] New patchset: Hashar; "dedupe code: foreachwiki vs. foreachwikiindblist" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8434 [05:48:38] New review: Hashar; "Simply rebased the change. Removing code duplication is great to have." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/8434 [05:48:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/8434 [05:52:25] New patchset: Hashar; "(bug 38299) Computer Modern fonts for math rendering" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15550 [05:53:07] New review: Hashar; "PS3 is a rebase" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/15550 [05:53:07] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15550 [05:55:01] Change abandoned: Hashar; "cant remember what that change was for, I guess it is no more needed. If we ever have this issue aga..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16515 [05:56:38] Damianz: I already saw a bug for that, which might not state the vastness of the brekage though [06:00:21] New review: Hashar; "I guess we would also need several -dev packages to properly compile the npm packages. Isn't there a..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/19397 [06:08:11] PROBLEM - Puppet freshness on ms-be10 is CRITICAL: Puppet has not run in the last 10 hours [06:08:11] PROBLEM - Puppet freshness on ms-be3 is CRITICAL: Puppet has not run in the last 10 hours [06:10:17] PROBLEM - Puppet freshness on ms-be4 is CRITICAL: Puppet has not run in the last 10 hours [06:54:21] PROBLEM - Puppet freshness on ms-be12 is CRITICAL: Puppet has not run in the last 10 hours [06:55:24] PROBLEM - Puppet freshness on ms-be11 is CRITICAL: Puppet has not run in the last 10 hours [07:36:21] PROBLEM - Puppet freshness on ms-be5 is CRITICAL: Puppet has not run in the last 10 hours [07:46:03] hello [07:47:04] morning [07:48:37] Damianz: looks like puppet freshness is fixed in Nagios \O/ [07:48:38] ;) [07:48:56] I made tidy [07:49:12] Still some stuff to fix that's running out of date puppet [07:49:19] I am wondering if we could make service checks depending upon host being up [07:49:32] I was thinking that, it's a little hard to tell atm [07:50:23] I'm attempting to rewrite the horrid bash/wget/c#/sed/grep/tr/awk thing for building configs atm [07:50:34] dohh [07:50:52] make sure to have that code somewhere in a public place [07:50:55] Hmm this is totally not the labs channel either heh [07:51:02] we could create a git repo for you if needed [07:51:05] oh yeah [07:51:23] well ops are all sleeping anyway (or too busy right now :D ) [07:51:31] Well I'm going to write it, github it, annoy chad to make a gerrit repo, bug petan to review it pointing out it's cleaner then hopefully it can go 'live' as we fix nagios long term. [08:03:02] PROBLEM - Puppet freshness on ms-fe4 is CRITICAL: Puppet has not run in the last 10 hours [08:19:59] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: Puppet has not run in the last 10 hours [08:19:59] PROBLEM - Puppet freshness on ms-be1002 is CRITICAL: Puppet has not run in the last 10 hours [08:19:59] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [08:19:59] PROBLEM - Puppet freshness on ms-be1001 is CRITICAL: Puppet has not run in the last 10 hours [08:19:59] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [08:20:00] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [08:20:00] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [08:20:01] PROBLEM - Puppet freshness on ms-fe1001 is CRITICAL: Puppet has not run in the last 10 hours [08:20:01] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [08:20:02] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [08:20:02] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [08:20:03] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [08:20:03] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [08:25:37] New patchset: Hashar; "manpages for our misc scripts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16606 [08:26:30] New review: Hashar; "PS3: copy man files to /usr/local/share/man/man1" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/16606 [08:26:30] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16606 [08:36:26] New patchset: Hashar; "manpages for our misc scripts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16606 [08:37:10] New review: Hashar; "PS4: rebased (misc-script is now in manifests/misc/deployment.pp)." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/16606 [08:37:10] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16606 [08:46:18] New patchset: Hashar; "manpages for our misc scripts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16606 [08:47:09] New review: Hashar; "PS5: ignore asciidoc files not starting with a letter (such as the template _annotated.txt) and dele..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/16606 [08:47:09] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16606 [09:27:29] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [09:29:20] New review: Hashar; "I still want beta to be able override robots by editing [[Mediawiki:robots.txt]] so we can eventuall..." [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/21602 [09:32:23] Change abandoned: DamianZaremba; "Needs a role class adding to allow changing the var in labs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21307 [09:52:58] New patchset: Hashar; "manpages for our misc scripts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16606 [09:53:51] New review: Hashar; "PS6: fix all issue from PS5. Thanks a lot for the review Dereckson!" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/16606 [09:53:52] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16606 [10:11:44] New patchset: ArielGlenn; "wansecurity host able to rsync other and archives" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21868 [10:12:29] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21868 [10:12:49] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21868 [10:52:08] New patchset: ArielGlenn; "tool for checking which media files are used by which projects This is a one-off but I don't want to lose it in case I need it again" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/21871 [10:53:13] New review: ArielGlenn; "No, it's not great code. Yes, it needs to live somewhere. Yes, I need to really move production th..." [operations/dumps] (ariel); V: 1 C: 2; - https://gerrit.wikimedia.org/r/21871 [10:53:13] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/21871 [11:05:38] PROBLEM - Puppet freshness on ms-be8 is CRITICAL: Puppet has not run in the last 10 hours [11:28:40] PROBLEM - Puppet freshness on ms-be1 is CRITICAL: Puppet has not run in the last 10 hours [11:39:37] PROBLEM - Puppet freshness on ms-be2 is CRITICAL: Puppet has not run in the last 10 hours [11:51:37] PROBLEM - Puppet freshness on srv193 is CRITICAL: Puppet has not run in the last 10 hours [11:55:40] PROBLEM - Puppet freshness on srv194 is CRITICAL: Puppet has not run in the last 10 hours [12:19:40] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [12:19:40] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [12:19:40] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [12:19:40] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [12:26:00] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [12:38:00] PROBLEM - Puppet freshness on palladium is CRITICAL: Puppet has not run in the last 10 hours [12:42:20] !log enabling Jenkins on TitleBacklist extension (job: Ext-TitleBlacklist ) [12:42:30] Logged the message, Master [12:48:53] !log reloading Jenkins to fix up Ext-TitleBlacklist misconfiguration [12:49:03] Logged the message, Master [12:54:57] PROBLEM - Puppet freshness on ms-fe2 is CRITICAL: Puppet has not run in the last 10 hours [12:59:13] New patchset: Mark Bergsma; "Migrate pmtpa /home to the NetApp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21877 [12:59:58] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21877 [13:05:00] PROBLEM - Puppet freshness on ms-be9 is CRITICAL: Puppet has not run in the last 10 hours [13:05:16] THUNDER [13:05:20] erm... wrong channel [13:05:30] !log Shutdown fenari and hume [13:05:40] Logged the message, Master [13:07:11] quick question, can somebody invite me for the #wikimedia-staff channel? [13:07:32] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21877 [13:08:09] PROBLEM - Host hume is DOWN: CRITICAL - Host Unreachable (208.80.152.190) [13:10:14] New patchset: Mark Bergsma; "Remove admins::dctech from applicationserver role class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21878 [13:10:33] PROBLEM - Host fenari is DOWN: CRITICAL - Host Unreachable (208.80.152.165) [13:11:01] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21878 [13:12:58] RECOVERY - Puppet freshness on srv193 is OK: puppet ran at Wed Aug 29 13:12:50 UTC 2012 [13:18:21] RECOVERY - Puppet freshness on srv194 is OK: puppet ran at Wed Aug 29 13:18:04 UTC 2012 [13:25:43] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [13:26:00] stupid check [13:26:13] it can't even run the check script [13:28:57] !log Changed /home mount from nfs1 to nas1-a on srv193 and spence [13:29:06] Logged the message, Master [13:37:05] !log Mounted /home read-only on nfs1, started final rsync to nas1-a [13:37:16] Logged the message, Master [13:38:02] exciting [13:52:25] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , zhwiki (19630) [13:52:34] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [13:52:34] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [14:32:56] mark: lemme know when you want to review frackpuppet [14:34:17] <^demon> Fenari still down for usage? [14:35:41] yes, fenari still down [14:35:47] <^demon> mmk. I'll find something else to do :) [14:36:17] there's bast1001 [14:38:29] copy is taking rather long [14:38:53] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5778 [14:39:53] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8434 [14:40:25] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15550 [14:42:06] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17470 [14:42:48] mark: are we going to refactor syslog as part of the nfs1/2 deprecation? [14:43:04] I'm asking because of some syslog overlap with labs, not sure if you saw hashar's mail [14:43:54] no [14:44:06] thta is [14:44:12] i'm not depreciating nfs1/nfs2 now [14:44:17] what happens to them later, I dunno yet [14:44:53] PROBLEM - Puppet freshness on ms-be6 is CRITICAL: Puppet has not run in the last 10 hours [14:44:54] paravoid: hello :-) [14:46:28] New patchset: Faidon; "(bug 38946) hebrew fonts for SVG rendering" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21588 [14:46:48] hashar: hi :) [14:47:12] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21588 [14:47:28] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21588 [14:48:05] i certainly want that to change anyway [14:48:09] paravoid: maybe we should rewrite the syslog-ng stuff to be based on rsyslog ;) [14:48:23] logging onto nfs /home is rather suboptimal [14:48:43] indeed [14:48:45] and it's mostly the reason this copy is taking so long [14:48:46] hashar: yes we should [14:49:04] is rsyslog better than syslog-ng? [14:49:17] I generally prefer it but I have no problem using either [14:49:20] as long as we stick to one [14:49:24] we currently use both [14:49:33] for very different purposes [14:49:40] years ago I evaluated both, ended up choosing syslog-ng because it was easier to administrate [14:49:41] but i don't care much either [14:49:42] but still, it's... syslog [14:49:53] but rsyslog has more feature AFAIK [14:50:11] rsyslog has some interesting stuff, such as RELP [14:50:19] reliable logging/retransmissions etc. [14:50:44] not sure if we'd want to use that but it always feeled more modern to me [14:50:52] but I think syslog-ng has catched up lately too [14:51:57] the obvious choice would be to pick whatever Ubuntu uses by default and assume it will be better maintained by them because it is deployed on more machines [14:52:04] which is rsyslog [14:52:17] fwiw opensuse went rsyslog recently too [14:52:22] so we need to migrate our syslog-ng conf to rsyslog [14:54:44] re: https://gerrit.wikimedia.org/r/#/c/17973/ what's our policy on shell access? discuss it on an ops meeting? [15:03:06] discuss it in the rt ticket [15:03:45] there is none :) I'll ask Roan to open one [15:03:50] it's for shell access to singer [15:04:26] hiyyya paravoid, when you have a sec, could you see if this is better than the other one? [15:04:28] https://gerrit.wikimedia.org/r/#/c/21749/ [15:05:00] there needs to be one [15:05:05] hi andrew [15:05:25] ottomata: looks "good" to me. mark will probably like to comment on that too. [15:05:28] New review: Mark Bergsma; "Where is the RT ticket for this request?" [operations/puppet] (production); V: 0 C: -2; - https://gerrit.wikimedia.org/r/17973 [15:06:18] New review: Faidon; "I don't see an RT ticket for that and I think access requests need approval. Could you open one and ..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/17973 [15:06:25] argh [15:09:21] mark, I think you are probably busy with the netapp stuff right now, but when you get a sec: [15:09:21] https://gerrit.wikimedia.org/r/#/c/21749/ [15:09:24] hah [15:10:39] jeremyb: thanks for the swift 1.7 heads up. I happened to see that beforehand entirely accidentally. [15:10:55] sure [15:11:14] i think i chose the right people to send it to [15:14:00] fucking swift logs are huge [15:15:44] mark: that's because they're not sampled or anything [15:15:54] and because we log at both the proxy and object server [15:16:03] you could store them in swift ;) [15:16:12] so for each request you have 2-3 lines [15:16:25] should turn that off soon [15:23:49] New review: Dereckson; "Test procedure followed:" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/16606 [15:29:39] New patchset: Demon; "Overhauling gerrit manifest to be a role class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13484 [15:30:28] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/13484 [15:32:33] Change abandoned: Demon; "Actually, this isn't needed at all. The labs instance was running precise, which suffers from this b..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21848 [15:40:40] Change abandoned: Demon; "This needs discussion before I go forward with it. And I want my omnibus gerrit change in first." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11451 [15:48:58] !log Final rsync finished. [15:49:01] !log Powering up hume [15:49:07] Logged the message, Master [15:49:17] Logged the message, Master [15:52:33] New patchset: Mark Bergsma; "No longer enable nfs server on nfs1/nfs2" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21893 [15:53:11] RECOVERY - Host hume is UP: PING OK - Packet loss = 0%, RTA = 0.17 ms [15:53:20] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21893 [15:53:27] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21893 [15:54:43] \o/ [15:58:14] you gotta be kidding me [15:58:18] fstab on hume: [15:58:19] albert:/a/old-static/mnt/albert-staticnfsbg,soft,udp,rsize=8192,wsize=8192,timeo=1400 [15:59:10] :-D [15:59:19] well that's going to come up [15:59:56] fenari is currently in fsck [16:00:04] !log Powering up fenari [16:00:07] :P [16:00:15] RECOVERY - Host fenari is UP: PING OK - Packet loss = 0%, RTA = 0.53 ms [16:00:18] Logged the message, Master [16:02:43] jeremyb: paravoid merged some of your changes 8434 5778 :-))) [16:03:04] i kinda half saw, danke [16:03:20] 4 digit merges are at a premium now [16:03:32] PROBLEM - SSH on fenari is CRITICAL: Connection refused [16:04:04] Change merged: Demon; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/16036 [16:04:16] !log Fixed serial console on hume [16:04:17] PROBLEM - HTTP on fenari is CRITICAL: Connection refused [16:04:25] * jeremyb pokes Krinkle|detached for !g 21393 [16:04:26] Logged the message, Master [16:05:02] RECOVERY - SSH on fenari is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [16:05:17] now for 8120 ;) [16:06:25] and 8344 [16:06:28] by hashar [16:06:29] mark: I think serial consoles are broken pretty much across the fleet [16:07:01] why do you think that? [16:07:10] that's not my impression [16:07:18] we have bios console redirection enabled [16:07:22] err, bye* /me is still half asleep [16:07:26] RECOVERY - HTTP on fenari is OK: HTTP OK HTTP/1.1 200 OK - 4416 bytes in 0.020 seconds [16:07:29] and grub/linux configured to log to serial [16:07:31] yeah, grub2 barfs on that [16:07:36] i know [16:07:39] but that's just during boot [16:08:12] no, I think we have the setting for post-boot console redirection enabled [16:08:18] i know what you mean [16:08:38] looking at mysql grants for the payments database, wikiuser has Repl_client_priv -- does mediawiki do some fancy slave status checks? or is that likely an error? [16:09:23] PROBLEM - Puppet freshness on ms-be10 is CRITICAL: Puppet has not run in the last 10 hours [16:09:23] PROBLEM - Puppet freshness on ms-be3 is CRITICAL: Puppet has not run in the last 10 hours [16:10:35] !log enabling srv194 and srv281 in pybal for apache/precise live test [16:10:43] yay :) [16:10:45] Logged the message, notpeter [16:11:29] PROBLEM - Puppet freshness on ms-be4 is CRITICAL: Puppet has not run in the last 10 hours [16:11:40] Jeff_Green: maybe related to https://en.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=dbrepllag&sishowalldb= ? [16:12:05] * jeremyb runs away [16:12:12] hahahah [16:12:18] come back come back! [16:12:21] j/k. that makes sense [16:12:32] I'm pretty sure MW checks for slave replication lag [16:12:33] (literally -> subway ;) [16:12:55] so you prefer snoop to dtrace, paravoid? [16:14:03] I prefer not using Solaris [16:14:04] :P [16:15:19] that goes without saying [16:18:01] more seriously, I've never used DTrace [16:18:23] by the time it got released I was using Linux already, so... [16:18:35] oh [16:18:53] well now readers have two ways they can see the get requests over there [16:19:02] :-) [16:19:08] three actually [16:19:10] not that it matters, I was just curious [16:19:24] two listed on the page [16:19:34] oh for get, right [16:19:46] you snooped for NFS traffic [16:19:49] I snooped for HTTP traffic [16:19:50] uh huh [16:19:58] and you also dtraced [16:20:03] yup [16:20:26] did you have any luck with the hardware faults btw? [16:20:27] don't think that means I knw a bunch about it though, it's waaay too complicated [16:20:51] oh I mentioned that yesterday, I updated the ms-be-somthing rt ticket [16:20:57] about changing the disks [16:21:17] looking [16:21:19] short summary, I think we need to get dell to swap out some fcrap [16:21:38] yeah [16:26:33] New patchset: Dereckson; "(bug 39767) Break lines in review Gerrit comments" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21895 [16:27:19] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21895 [16:28:37] !log removing srv281 from apaches pool [16:28:47] Logged the message, notpeter [16:28:48] oh? [16:28:52] test failed? [16:29:19] something isn't right about it. ganglia doesn't show and resource usage and nothing is showing up in the logs on fluorine [16:29:29] cmjohnson1, RobH: do you have plan regarding swift/c2100 boxes? want to talk about it? [16:29:45] srv194 looks totally normal, though, which is great [16:29:50] paravoid: Plan for what? [16:29:58] *any resource usage [16:30:04] what to do? just replace hardware? or escalate to Dell? [16:30:06] paravoid: as in repair plans? [16:30:16] ms-be6 had its memory broken and now its controller broken too [16:30:19] is the hardware broken? ahh [16:30:23] ms-be10 has its controller broken as well [16:30:31] paravoid: first step is to escalate it to Dell [16:30:34] ok, well, you drop a ticket in pmtpa queue and chris will test the hardware and get it replaced under warranty [16:30:49] but you need to ensure he can take it offline to do so, so if its pooled link instructions and such [16:30:58] if its just offline and he can pull and work, note that in ticket. [16:31:05] we have a ticket, #3282 [16:31:21] yep�i read the comments last night from apergos [16:31:24] ...... [16:31:31] argh [16:31:39] i wish you guys would make different tickets for different issues [16:31:45] we tie those to Dell returns and the like [16:31:50] one ticket per problem in future please. [16:32:15] paravoid: so since its tampa hardware you work with chris to get it fixed [16:32:18] if it was ashburn, would be me [16:32:44] I'm talking about the general problem "C2100s have a big failure rate" [16:32:51] prolly contoller, could be also mb [16:32:51] that's not something that's related to tampa or swift or the ms-be tickets [16:33:11] paravoid: we arent ordering them anymore is my understanding [16:33:20] some folks had problems with these c2100s so bad that even after replacing mb and controller and other pieces, it still didn't resolve and they eventually got send new systems [16:33:20] but we cannot just stop using what we have, so we have to just get warranty repair. [16:33:21] they don't make them anymore either [16:33:46] so if we do enough repair work and it still doesnt work, then as apergos points out dell will eventually replace it [16:33:49] anyways it will be slog through swapping crap out with dell [16:34:01] I'm basically asking if you have an escalation plan or if we're going to deal with the issues as they come [16:34:02] yep [16:34:02] but they wont do it on any given system until we have a repair history on that specific system [16:34:07] paravoid: as they come. [16:34:12] not sure how else we would handle them [16:34:16] we have less than 20 c2100s. [16:34:24] well, wait, less than 30. [16:34:29] (i forgot anaylitics ;) [16:35:16] paravoid: What would you suggest otherwise? [16:35:24] * RobH is open to suggestions [16:35:33] cuz i hate the c2100s as well. [16:35:39] * Damianz offers thermite to paravoid [16:35:52] sledgehammers [16:35:56] so much more satisfying [16:35:57] Damianz: the folks below our datacenter floor do not appreciate that. [16:36:18] I'm sure dell would replace their kit if it's under warrenty :D [16:36:21] however the good news is that eventually people did get working systems [16:36:22] they are in tampa, the broken servers ;] [16:36:32] Damianz: yep, so we have to handle as they break is my stance [16:36:48] paravoid: since we are talking ms-be?�there are a few bad drives in ms-be7 and 8 that i have replacements for �do you wanna swap them out? can you see which drives are not mounted? [16:36:52] robh: yes [16:37:12] the latest batch of c2100's have come littered w/ hdd problems. [16:37:19] 2 or 3 per box [16:37:27] and apparently controllers now too [16:37:30] it's easy to see which drives are causing complaints, there's a bunch of whingin in the logs about em [16:37:48] ms-be6 had its memory broken, you replaced it, and now it has its controller broken it seems [16:37:54] never made it into production [16:38:10] cmjohnson1: looking @ ms-be7/8 [16:38:30] so the rumor mill is that dell doesn't actually make these, they are manufactured elsewhere [16:38:53] before I spread that rumor and it's logged (well after) I should find the email where I read it [16:39:01] paravoid: I don't wanna make it sound like we won't fix them, we totally will, just not sure how else to handle it than one problem at a time =] [16:39:36] RobH: not challenging your ways, just wondering about your plans [16:39:40] robh: i think he is trying to say that these systems are very unstable and may be better served as something else [16:39:53] I'm not saying that, not yet at least :) [16:39:56] no worries, i just didnt wanna come across as rude, irc is hard to convey tone =] [16:40:08] Need that sarcasm font back [16:40:12] http://lists.linbit.com/pipermail/drbd-user/2011-December/017361.html here's where they say it was outsourced [16:40:20] Damianz: that font gets me into troule [16:40:21] trouble [16:41:00] I think we have to do a few of them as they break, but then we might be able to lean on Dell after that at the first sign of trouble with any others [16:42:20] I don't know how reliable the email source is, just pointing it out [16:48:15] paravoid: it was just ms-be7 sag�but i see it mounted /dev/sdg1 on /srv/swift-storage/sdg1 type xfs (rw,noatime,nodiratime,nobarrier,logbufs=8) [16:49:00] yep, seems mounted [16:49:02] complains too [16:49:12] the other issues were w/ ms-be6 and ms-be10 but w/ a bad controller it is hard to say what is good and bad at this time [16:49:58] i have a disk for it but I am not sure which slot it is [16:50:16] for what? ms-be7 you mean? [16:50:22] yes [16:51:03] so, how do you usually handle this? [16:53:40] New patchset: MaxSem; "Log API errors caused by the WLM app in a separate file" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21753 [16:53:40] well for the c2100's the drive slots do not match the drive order�.last time, ben tried to force mount the bad disk and a red led showed up on the system. [16:54:09] the system can't even see the disk anymore I'm afraid [16:54:11] how many c2100s are there? [16:54:19] 24 i think [16:54:30] no more [16:54:33] analytics bought them too [16:54:41]