[00:00:02] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:11:45] New patchset: Ryan Lane; "Adding php-luasandbox to labsconsole" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22856 [00:12:17] !log Added rev_sha1 to revision table on liquidthreads_labswikimedia [00:12:23] !log added scribunto to labsconsole [00:12:26] Logged the message, Mr. Obvious [00:12:34] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22856 [00:12:35] Logged the message, Master [00:14:08] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.029 seconds [00:15:23] mutante, also, for the OTRS 3.1 upgrade, we're going to need someone from ops as an ops contact - Jeff's going to be busy with the fundraiser, would you be able to do it or know who else might be able to? [00:16:10] I know this ultimately depends on what CT says, but I might as well find out who can do it first [00:16:59] Thehelpfulone: i have no OTRS experience, so probably not [00:17:56] hmm I think I don't know if OTRS experience itself is needed - Reedy mentioned on the bug https://bugzilla.wikimedia.org/show_bug.cgi?id=22622#c29 that it's something to do with perl [00:18:39] also, your userpage https://www.mediawiki.org/wiki/User:Mutante needs an update :) [00:19:03] true, thanks [00:19:21] !log added SpamBlacklist to labsconsole [00:19:30] Logged the message, Master [00:20:08] New patchset: Krinkle; "misc deployment scripts: Minor clean up" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22858 [00:20:18] Thehelpfulone: wow, we have " the Inventor of OTRS" working on the upgrade and he signed an NDA? that sounds like it might work:) [00:20:47] Thehelpfulone: i can just say repeat what Sam said, tell us what you need ..via RT [00:20:51] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/22858 [00:20:54] I know right? yesterday almost all hope was lost when Philippe said there's no engineering resources for it and he's not returned the NDA! [00:21:33] mutante, meh, the issue with RT is that it's not public - I can't tell what the updates are, and neither can anyone else outside of ops [00:22:19] Ryan_Lane, speaking of OTRS, is the plan to do the test server on labs first? I see there's an OTRS project - can we get Martin access to this? [00:22:33] we can't, really [00:22:37] private data [00:22:37] (presuming he needs/wants it)? [00:22:43] Ryan_Lane, he signed an NDA? [00:22:51] no private data in labs right now [00:23:06] so it needs to happen on a production server [00:23:08] oh I see [00:23:27] then I presume to get him access to the production server it needs an RT ticket? ;-) [00:23:29] we have plans for private data in labs, but it's a ways out [00:23:41] I think there's some people working this right now [00:23:48] Thehelpfulone: i know, but that is not my personal decision and as long as that is the tool we use in our team thats the way to make sure other ops actually see it [00:23:49] and I believe we have tickets in [00:24:03] I'd like to make most of RT public [00:24:08] we need to do LDAP auth first [00:24:15] then we need to open up specific queues [00:24:24] * mutante nods [00:24:50] Ryan_Lane, I agree, but last time I discussed with CT he said it's not a "high priority" which is ops speak for not happening anytime soon :( [00:25:00] unfortunately, yep [00:25:06] apparently there's private data across lots of different queues though? [00:25:08] I would love to have a pile of public rt queues [00:25:20] how much work would it take to actually make RT public? [00:25:30] is implementing LDAP authentication difficult? [00:25:58] it seems to be a pain in rt [00:26:01] it's undocumented [00:26:11] a pain in the rt? :-) [00:26:15] heh [00:26:44] is LDAP the only way we can do authentication? doesn't RT have it's own system like Bugzilla does? [00:26:55] we don't want to manage accounts [00:27:04] labsconsole does that for us already [00:27:44] doing a quick google search, what about http://requesttracker.wikia.com/wiki/ExternalAuthentication#2._RT::Authen::ExternalAuth [00:28:01] there's some stuff at http://wiki-archive.bestpractical.com/view/LdapSiteConfigSettings too [00:28:18] yeah, look at the docs, though :) [00:28:26] also, the ubuntu install of rt is…. different [00:29:38] http://requesttracker.wikia.com/wiki/ExternalAuth#CPAN_installation - are those the wrong docs? [00:33:08] Ryan_Lane, see above - what method of installation do you plan to use? [00:33:52] hopefully not cpan [00:34:02] I've looked at these docs before, though [00:34:07] it's not a terribly simple process [00:34:47] At least it's not RT... oh god that app sucks [00:35:46] puppet feature "rudimentary CPAN support", "Added by Jim Blomo about 5 years ago. " :/ [00:36:48] is there anything better than RT? [00:37:07] "better" [00:37:47] every bug system sucks [00:37:49] i dunno, how is Mantis? [00:37:50] every single one [00:37:57] there's no such thing as a good one [00:38:22] RT works well with email input etc, which is useful for ops [00:38:32] we need that for procurement [00:38:46] the NASA likes RT :) [00:39:40] http://bestpractical.com/rt/praise.html [00:40:19] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [00:40:19] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [00:40:19] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [00:41:04] I'll pick a random one - what about bugnet? [00:41:23] Ryan_Lane: we should write our own. Obviously. [00:41:41] heh, well http://www.thegeekstuff.com/2010/08/bug-tracking-system/ things bugzilla is the best [00:41:54] again, it's "best" [00:42:02] thinks* [00:42:05] again, no such thing as a good one [00:43:43] it can suck and still suck less than all others [00:44:28] http://www.youtube.com/watch?v=d85p7JZXNy8 [00:44:58] Advantages are they're free, unlike kayako which sucks and is stupidly expensive. [00:46:01] non-free should not even be an option, or i would have said Atlassian JIRA (toolserver uses it, heh:P) /me hides [00:46:01] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:46:19] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [00:46:19] The world does not need more java [00:56:42] Change merged: Krinkle; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21393 [00:58:37] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.566 seconds [01:04:28] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [01:05:22] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [01:09:50] !log upgrading OATHAuth to 795cef09cab6ecb0e9ded35df06b2877ccc22c1a on labsconsole [01:09:59] Logged the message, Master [01:14:50] New patchset: Jgreen; "second attempt to mount netapp to locke for fundraising banner log archiving" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22872 [01:15:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22872 [01:16:32] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22872 [01:20:59] !log enabling ConfirmEdit with FancyCaptcha on labsconsole [01:21:10] Logged the message, Master [01:22:14] New patchset: Jgreen; "fixed include" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22873 [01:22:59] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22873 [01:27:08] !log enabling $wgEmailConfirmToEdit on labsconsole [01:27:17] Logged the message, Master [01:33:43] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:39:27] !log renaming mailing list chaptercommittee-l to affcom , rebuilding archives... [01:39:36] Logged the message, Master [01:41:22] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 240 seconds [01:42:18] New patchset: Dzahn; "mail alias and HTTP redirect for list rename: chaptercommittee-l -> affcom" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22874 [01:43:05] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22874 [01:43:07] New patchset: Dzahn; "mail alias and HTTP redirect for list rename: chaptercommittee-l -> affcom" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22874 [01:43:53] New review: Dzahn; "for RT-3477" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/22874 [01:43:53] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22874 [01:45:52] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.763 seconds [01:46:46] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , zhwiki (33037) [01:47:41] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , zhwiki (32244) [01:57:27] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 5 seconds [01:58:03] PROBLEM - Puppet freshness on cp1025 is CRITICAL: Puppet has not run in the last 10 hours [02:10:48] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [02:10:57] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [02:10:57] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [02:11:42] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [02:21:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:29:11] New patchset: Jgreen; "adding account file_mover to aluminium/grosley/storage3" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22876 [02:32:44] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22876 [02:33:45] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.399 seconds [02:36:00] PROBLEM - Puppet freshness on search1001 is CRITICAL: Puppet has not run in the last 10 hours [02:37:30] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , zhwiki (34761) [02:38:24] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , zhwiki (33571) [02:42:08] New patchset: Jgreen; "Revert "adding account file_mover to aluminium/grosley/storage3" . . . because our account creation classes are too broken to use--they create inconsistent GIDs across hosts, thwarting sane use of nfs/netapp." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22877 [02:46:50] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22877 [02:54:20] New patchset: Jgreen; "removing nfs mount from locke" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22878 [02:59:37] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22878 [03:04:48] RECOVERY - Puppet freshness on search1001 is OK: puppet ran at Thu Sep 6 03:04:19 UTC 2012 [03:08:42] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [03:09:36] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [03:32:54] RECOVERY - Puppet freshness on mw74 is OK: puppet ran at Thu Sep 6 03:32:32 UTC 2012 [03:48:30] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , enwiki (11976), zhwiki (51853) [03:51:57] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , zhwiki (51286) [05:21:30] PROBLEM - Host srv266 is DOWN: PING CRITICAL - Packet loss = 100% [06:15:18] PROBLEM - Puppet freshness on cp1022 is CRITICAL: Puppet has not run in the last 10 hours [06:41:15] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [06:41:15] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [06:41:15] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [06:41:15] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [06:41:15] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [06:41:16] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [06:41:16] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [06:41:17] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [06:41:17] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [07:50:36] PROBLEM - BGP status on csw2-esams is CRITICAL: CRITICAL: No response from remote host 91.198.174.244, [08:28:46] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [08:29:49] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [08:37:01] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [08:38:22] PROBLEM - LVS HTTPS IPv6 on foundation-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:39:52] RECOVERY - LVS HTTPS IPv6 on foundation-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 61471 bytes in 7.374 seconds [08:43:34] hmmm that didn't page [08:43:35] weird [10:41:12] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [10:41:12] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [10:41:12] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [10:47:12] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [10:58:04] paging is broken we think. or have you been getting pages recently? [10:58:19] I haven't [10:58:49] yeah I didn't think about it til Rob said something yesterday but I haven't either [11:01:13] poor apergos [11:01:16] still jetlagged [11:01:24] 4 am again [11:01:32] yeah, I noticed [11:01:37] I even drank caffeine yesterday afternoon [11:01:39] no difference [11:01:47] still fell asleep at 10pm [11:45:13] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , enwiki (106422), zhwiki (49409) [11:46:07] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , enwiki (105850), zhwiki (49296) [11:59:19] PROBLEM - Puppet freshness on cp1025 is CRITICAL: Puppet has not run in the last 10 hours [12:12:22] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [12:12:22] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [12:16:43] RECOVERY - Puppet freshness on cp1025 is OK: puppet ran at Thu Sep 6 12:16:20 UTC 2012 [12:16:53] so [12:16:58] so? [12:17:09] i wanted to talk to you about what to send where on the eqiad upload varnishes [12:17:16] i mirrored squid's current config now [12:17:23] that is, thumbs/temp/originals to swift, rest to ms7 [12:17:31] correct [12:17:37] what is temp anyway? [12:17:48] some MW temp cache space or something [12:17:50] don't remember exactly [12:17:55] PROBLEM - BGP status on csw2-esams is CRITICAL: (Service Check Timed Out) [12:19:07] RECOVERY - Varnish HTTP upload-backend on cp1025 is OK: HTTP OK HTTP/1.1 200 OK - 634 bytes in 0.053 seconds [12:19:14] while you guys are both in here, do either of you remember ben testing with an r510 (or some dell with an h700/h800) for swift? [12:19:33] yes and he couldn't make it but Asher said that he was wrong. [12:19:43] do we knw what he did? [12:19:52] RECOVERY - Varnish HTCP daemon on cp1025 is OK: PROCS OK: 1 process with UID = 997 (varnishhtcpd), args varnishhtcpd worker [12:19:54] I hunted around on wikitech but couldn't find notes there [12:19:56] i don't remember that [12:20:03] but we have R720XDs in esams for swift [12:20:21] what controllers do those have? [12:20:33] don't remember, but I had them check they could do JBOD [12:20:37] before purchase [12:20:37] ok [12:21:09] can't buy R510s anymore [12:21:28] I guess asher might know the back story, I'll see what he remembers [12:21:30] thanks [12:21:40] I have a mail from Asher that CT forwarded me [12:21:53] that said that he made R510s with JBOD just fine [12:21:55] you both in SF now? [12:21:59] only one [12:22:14] I'm just on SF time :-) [12:22:17] kind of [12:22:18] hehe [12:22:27] that's what I'm on: "kind of" [12:22:41] my sleep schedule is completely fucked up due to the previous two-three weeks [12:22:48] PROBLEM - BGP status on csw2-esams is CRITICAL: (Service Check Timed Out) [12:22:51] it won't get better [12:23:08] heh [12:23:26] the deployment windows every day of the week at 20:00 localtime surely didn't help :-) [12:23:32] no it didn't [12:23:43] mark: so, what did you want to ask about the Varnishes? [12:24:04] just what you think I should send where [12:24:11] but you've already answered it I guess [12:27:43] okay [12:27:52] you said you fixed it already? [12:27:56] two days ago [12:27:59] heh [12:28:02] am about to test it [12:28:16] since you were not online then I thought you were flying actually ;) [12:29:08] oh yeah, I saw your ping, ponged you a few hours after [12:30:20] so, we have more pending changes [12:30:31] I'll try to keep varnish configs up-to-date [12:31:49] ok [12:33:18] RECOVERY - Puppet freshness on cp1022 is OK: puppet ran at Thu Sep 6 12:32:54 UTC 2012 [12:35:15] RECOVERY - Varnish HTCP daemon on cp1022 is OK: PROCS OK: 1 process with UID = 997 (varnishhtcpd), args varnishhtcpd worker [12:35:42] RECOVERY - Varnish HTTP upload-backend on cp1022 is OK: HTTP OK HTTP/1.1 200 OK - 634 bytes in 0.053 seconds [12:35:51] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [12:35:51] PROBLEM - Puppet freshness on virt1000 is CRITICAL: Puppet has not run in the last 10 hours [12:36:45] RECOVERY - NTP on cp1025 is OK: NTP OK: Offset -0.05371642113 secs [12:37:39] PROBLEM - BGP status on csw2-esams is CRITICAL: (Service Check Timed Out) [12:42:54] PROBLEM - Host cp1021 is DOWN: PING CRITICAL - Packet loss = 100% [12:43:30] RECOVERY - Host cp1021 is UP: PING OK - Packet loss = 0%, RTA = 26.52 ms [12:45:45] PROBLEM - Host cp1024 is DOWN: PING CRITICAL - Packet loss = 100% [12:45:54] PROBLEM - Host cp1023 is DOWN: PING CRITICAL - Packet loss = 100% [12:46:30] RECOVERY - Host cp1023 is UP: PING OK - Packet loss = 0%, RTA = 26.65 ms [12:46:39] RECOVERY - Host cp1024 is UP: PING OK - Packet loss = 0%, RTA = 26.54 ms [12:47:33] PROBLEM - Varnish HTTP upload-frontend on cp1021 is CRITICAL: Connection refused [12:48:00] PROBLEM - Varnish traffic logger on cp1021 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [12:49:03] PROBLEM - Host cp1025 is DOWN: PING CRITICAL - Packet loss = 100% [12:49:21] PROBLEM - Host cp1026 is DOWN: PING CRITICAL - Packet loss = 100% [12:49:39] RECOVERY - Host cp1025 is UP: PING OK - Packet loss = 0%, RTA = 26.52 ms [12:49:48] PROBLEM - Varnish HTTP upload-frontend on cp1024 is CRITICAL: Connection refused [12:49:48] RECOVERY - Host cp1026 is UP: PING OK - Packet loss = 0%, RTA = 26.52 ms [12:50:33] PROBLEM - Varnish traffic logger on cp1024 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [12:51:00] PROBLEM - Host cp1028 is DOWN: PING CRITICAL - Packet loss = 100% [12:51:09] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [12:51:09] PROBLEM - Host cp1027 is DOWN: PING CRITICAL - Packet loss = 100% [12:51:45] RECOVERY - Host cp1027 is UP: PING OK - Packet loss = 0%, RTA = 26.51 ms [12:52:03] RECOVERY - Host cp1028 is UP: PING OK - Packet loss = 0%, RTA = 27.23 ms [12:52:12] RECOVERY - Varnish HTTP upload-frontend on cp1021 is OK: HTTP OK HTTP/1.1 200 OK - 641 bytes in 0.053 seconds [12:53:24] PROBLEM - Varnish traffic logger on cp1026 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [12:54:09] RECOVERY - Varnish traffic logger on cp1021 is OK: PROCS OK: 3 processes with command name varnishncsa [12:54:36] PROBLEM - Varnish HTTP upload-frontend on cp1026 is CRITICAL: Connection refused [12:55:21] PROBLEM - Varnish HTTP upload-frontend on cp1027 is CRITICAL: Connection refused [12:55:48] PROBLEM - Varnish traffic logger on cp1027 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [12:56:06] RECOVERY - Varnish HTTP upload-frontend on cp1024 is OK: HTTP OK HTTP/1.1 200 OK - 641 bytes in 0.053 seconds [12:56:15] PROBLEM - Varnish HTTP upload-frontend on cp1028 is CRITICAL: Connection refused [12:56:24] PROBLEM - Varnish traffic logger on cp1028 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [12:56:42] RECOVERY - Varnish HTTP upload-frontend on cp1025 is OK: HTTP OK HTTP/1.1 200 OK - 643 bytes in 0.053 seconds [12:58:21] RECOVERY - Varnish HTTP upload-frontend on cp1027 is OK: HTTP OK HTTP/1.1 200 OK - 641 bytes in 0.053 seconds [12:58:57] RECOVERY - Varnish HTTP upload-frontend on cp1022 is OK: HTTP OK HTTP/1.1 200 OK - 643 bytes in 0.053 seconds [13:02:17] New patchset: Mark Bergsma; "Don't start the loggers until Varnish is running" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22888 [13:03:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22888 [13:04:47] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22888 [13:06:54] RECOVERY - Varnish HTTP upload-frontend on cp1028 is OK: HTTP OK HTTP/1.1 200 OK - 641 bytes in 0.053 seconds [13:07:21] RECOVERY - Varnish traffic logger on cp1028 is OK: PROCS OK: 3 processes with command name varnishncsa [13:09:36] PROBLEM - Host cp1021 is DOWN: PING CRITICAL - Packet loss = 100% [13:10:21] RECOVERY - Host cp1021 is UP: PING OK - Packet loss = 0%, RTA = 26.60 ms [13:16:03] RECOVERY - Varnish HTTP upload-frontend on cp1026 is OK: HTTP OK HTTP/1.1 200 OK - 643 bytes in 0.053 seconds [13:16:39] RECOVERY - Varnish traffic logger on cp1026 is OK: PROCS OK: 3 processes with command name varnishncsa [13:18:54] RECOVERY - Varnish traffic logger on cp1023 is OK: PROCS OK: 3 processes with command name varnishncsa [13:25:48] RECOVERY - Varnish traffic logger on cp1024 is OK: PROCS OK: 3 processes with command name varnishncsa [13:26:33] RECOVERY - Varnish traffic logger on cp1022 is OK: PROCS OK: 3 processes with command name varnishncsa [13:29:08] RECOVERY - Varnish traffic logger on cp1025 is OK: PROCS OK: 3 processes with command name varnishncsa [13:29:08] RECOVERY - Varnish traffic logger on cp1027 is OK: PROCS OK: 3 processes with command name varnishncsa [13:48:49] heading in to the office [13:52:14] streaming in varnish seems to work now [13:52:20] oh really? [13:52:22] how cool! [13:52:27] i just checked with a > 1 GB ogv on commons [13:52:39] since it's bigger than 64, the streaming code kicked in and delivered it to me just fine [13:52:45] i'll have to check concurrency better [13:52:54] :-) [13:52:57] yay for ditching squid [13:53:05] and the varnish instance that caches it went from completely empty to 1.3 GB on one disk cache hehe [13:53:11] /dev/sda3 139G 1.3G 138G 1% /srv/sda3 [13:53:12] /dev/sdb3 139G 36M 139G 1% /srv/sdb3 [13:53:24] how does it do disk cache? files? [13:53:26] one single big file? [13:53:29] single big file [13:53:41] divided in silos [13:53:49] does it make since to give it a block device directly? [13:54:03] giving a block device doesn't help, like with squid [13:54:07] sinc ethen the kernel doesn't cache it [13:54:11] and varnish kind of relies on that [13:54:20] for squid it doesn't matter, as it does its own memory caching [13:54:28] right [13:54:30] then it's nice that the kernel isn't also caching it [13:54:37] so we just use a single file on an otherwise empty xfs fs [13:54:56] have you booked tickets for VUG btw? [13:54:59] no [13:55:24] but you will? [13:55:27] i don't know [13:55:31] i don't really want to go [13:55:35] haha [13:55:59] it's gonna be a busy period, I don't like too much travel as I can't get work done [14:12:36] New patchset: Pyoungmeister; "adding udp2log-log4j.jar to classpath to fully support udp2log logging" [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/22898 [14:13:19] Change abandoned: Pyoungmeister; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22734 [14:13:37] Change merged: Pyoungmeister; [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/22898 [14:22:52] well I can load a single large file from two clients at least, while it's being fetched from swift [14:23:03] !log reimaging search1001, 1002, 1003 [14:23:14] Logged the message, notpeter [14:26:15] * apergos lurks for the rest of the swift conversation [14:26:52] PROBLEM - Host search1002 is DOWN: PING CRITICAL - Packet loss = 100% [14:27:11] PROBLEM - Host search1001 is DOWN: PING CRITICAL - Packet loss = 100% [14:27:44] it's not a swift conversation [14:28:04] New patchset: Pyoungmeister; "lucene: a couple of small tweaks to get udp2log results up and running" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22899 [14:28:57] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22899 [14:29:34] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22899 [14:31:40] PROBLEM - SSH on search1003 is CRITICAL: Connection refused [14:31:58] PROBLEM - Lucene disk space on search1003 is CRITICAL: Connection refused by host [14:32:34] RECOVERY - Host search1002 is UP: PING OK - Packet loss = 0%, RTA = 26.49 ms [14:32:52] RECOVERY - Host search1001 is UP: PING OK - Packet loss = 0%, RTA = 26.51 ms [14:35:10] PROBLEM - NTP on search1001 is CRITICAL: NTP CRITICAL: No response from NTP server [14:36:31] PROBLEM - Lucene disk space on search1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:36:31] PROBLEM - Lucene disk space on search1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:37:16] PROBLEM - SSH on search1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:37:16] PROBLEM - SSH on search1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:37:52] PROBLEM - Lucene on search1003 is CRITICAL: Connection timed out [14:40:16] RECOVERY - SSH on search1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [14:40:16] RECOVERY - SSH on search1002 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [14:40:16] RECOVERY - SSH on search1003 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [14:42:44] New patchset: Jgreen; "adding netapp mount back to locke (again)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22901 [14:43:36] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22901 [14:43:46] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22901 [14:48:22] PROBLEM - Lucene on search1001 is CRITICAL: Connection refused [14:55:08] !log ms-be6 going down to remove ssd drives [14:55:17] Logged the message, Master [14:56:28] RECOVERY - Lucene disk space on search1001 is OK: DISK OK [14:59:37] RECOVERY - Lucene disk space on search1002 is OK: DISK OK [15:02:19] PROBLEM - NTP on search1003 is CRITICAL: NTP CRITICAL: No response from NTP server [15:05:37] RECOVERY - Lucene disk space on search1003 is OK: DISK OK [15:06:13] RECOVERY - Host ms-be6 is UP: PING OK - Packet loss = 0%, RTA = 0.60 ms [15:10:43] cmjohnson1: yay you are here [15:11:43] New patchset: Ottomata; "Ungh, RT 3460 is for halfak, not aaron shulz." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22902 [15:12:34] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22902 [15:12:40] RECOVERY - NTP on search1001 is OK: NTP OK: Offset -0.01202392578 secs [15:12:43] notpeter, could you merge that one? [15:12:43] https://gerrit.wikimedia.org/r/22902 [15:13:06] apergos: yes�almost finished w/os [15:13:15] sweet [15:14:28] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22902 [15:14:33] thank you! [15:14:33] doneski [15:14:35] yup! [15:21:58] RECOVERY - NTP on search1003 is OK: NTP OK: Offset -0.006657481194 secs [15:24:04] RECOVERY - SSH on ms-be6 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [15:24:16] apergos: ms-be6 is all yours [15:24:40] yay [15:24:46] ready for puppet cert? [15:24:56] yes [15:26:26] ok doing the first puppet run now [15:26:36] let's see what happens shall we? [15:27:30] * cmjohnson1 is optimistic  [15:27:38] * apergos is pessimistic [15:27:43] that should balance us out :-D [15:27:45] * ^demon is realistic [15:28:34] RECOVERY - Lucene on search1003 is OK: TCP OK - 0.027 second response time on port 8123 [15:28:50] doo dee doo dee doo [15:29:19] RECOVERY - Lucene on search1002 is OK: TCP OK - 0.030 second response time on port 8123 [15:29:19] PROBLEM - NTP on ms-be6 is CRITICAL: NTP CRITICAL: Offset unknown [15:29:28] RECOVERY - Lucene on search1001 is OK: TCP OK - 0.027 second response time on port 8123 [15:30:05] cmjohnson1: so. both srv281 and srv266 are dead again... [15:30:11] test complete! [15:30:12] :) [15:30:38] notpeter [15:30:55] yep [15:31:01] *sigh* [15:31:16] RECOVERY - swift-container-server on ms-be6 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [15:31:16] also search32 is dead again as well [15:31:34] RECOVERY - swift-object-server on ms-be6 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [15:31:43] RECOVERY - swift-account-server on ms-be6 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [15:31:43] RECOVERY - swift-container-auditor on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [15:31:43] RECOVERY - swift-account-reaper on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [15:31:45] but hey! search32 seems to be staying up! [15:31:48] cmjohnson1: really? [15:31:52] RECOVERY - swift-object-auditor on ms-be6 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [15:32:10] RECOVERY - swift-account-auditor on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [15:32:22] i see an amber led�didn't check it yet�just assumed it was bad [15:32:28] RECOVERY - swift-container-replicator on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [15:32:28] RECOVERY - swift-container-updater on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [15:32:37] RECOVERY - swift-object-replicator on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [15:32:37] RECOVERY - swift-object-updater on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [15:33:02] aaaaaaand it sucks to be us. [15:33:14] lemme see exactly which ones it whined about and in which way [15:33:27] notpeter: same b.s. [15:33:28] Record: 2 [15:33:29] Date/Time: 09/05/2012 19:31:42 [15:33:29] Source: system [15:33:29] Severity: Critical [15:33:29] Description: Multi-bit memory errors detected on a memory device at location DIMM_A2. [15:33:30] ------------------------------------------------------------------------------- [15:34:08] blerg [15:34:08] ok [15:37:34] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [15:37:57] sdj, sdg, show errors at the driver level [15:38:37] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [15:38:48] let's replace the drives and cycle it several times and see what happens�maybe somewhere along the way we killed the drives [15:39:20] apergos: can you try and force mount them? [15:39:32] rather manual mount [15:39:43] not done looking [15:39:57] gimme a couple more minutes, we had more mount errors than this [15:40:16] those two were the ones with errors at the driver level, not the xfs level [15:43:50] c d l and i show as empty [15:44:43] eww uck the device names don'tmatch the mount dirs [15:44:49] anyways, it'sjust for testing [15:45:51] so in the end the devices not mounted are sdj,g,l,k [15:47:40] well �.backplane is next [15:47:52] hi w [15:47:59] want some traffic on varnish? [15:48:34] RECOVERY - NTP on ms-be6 is OK: NTP OK: Offset 0.003421783447 secs [15:48:41] sdj fails, hostbyte=DID_NO_CONNECT etc... (still checking the other three) [15:50:27] mark ; yes ;-P [15:50:38] sdg the same, [15:52:27] pick a small country ;) [15:53:40] how about an african one? [15:53:53] what about malaysia ;-p [15:54:06] sdl, sdk the same [15:54:17] sure [15:54:26] bahasa melayu [15:54:31] what time is it there [15:54:34] cmjohnson1: ^^ so those I assume are the same four disks as before, ids may be different because of pulling the two ssds [15:54:36] or bahasa indonesia [15:54:42] so backplane is next. [15:54:50] almost midnight [15:54:54] meh [15:54:56] no traffic [15:54:56] woosters, whilst you are here, for the OTRS upgrade I imagine we're going to need an ops contact - who do you think would be able to be one, given that Jeff will be busy with the fundraiser now? [15:55:16] mark, maybe somewhere small in europe? [15:55:21] not in europe [15:55:26] perhaps argentina [15:55:38] it's bigger, but still ok [15:55:39] egypt [15:55:56] i'll do argentina, it's 1% of traffic apparently [15:55:57] cmjohnson1: also see yer mail (from dell) [15:56:06] also needs to be a country we'll hear from if there are issues [15:56:34] what are we testing btw mark? [15:56:40] thehelpfulone - let me discuss with some of my team members and get back to u [15:56:42] I will happily be on that call, I would like robh to be on it also at aminimum, you if you think it makes sense. [15:56:50] woosters, sure thanks [15:57:29] * apergos goes to update the ticket [15:57:43] thehelpfulone - he is testing the latest varnish build with some much needed fixes [15:58:00] is there a way to find out when puppet ran for the last time on a given host ? [15:58:14] hashar: login to it? [15:58:18] The last Puppet run was at Thu Feb 16 19:13:00 UTC 2012 (32 minutes ago). [15:58:20] ;) [15:58:30] motd of course! ;-D [16:01:24] woosters: so I tested the same videos we had issues with months ago [16:01:28] seem to work just fine now [16:01:31] updated. [16:01:41] real traffic is of course the real test [16:01:46] but it's looking good so far [16:02:00] that is good news [16:02:06] the next question is, ttf-lyx package is supposed to be installed on imagescaler box per https://gerrit.wikimedia.org/r/22705 but it is not on the box. [16:02:17] could it be that the change did not get pulled on the puppetmaster ? [16:02:43] hashar: hi, let me check that one for you [16:03:12] hashar: it is on sockpuppet.. so it's not that [16:03:34] mutante: got some input on the bug report https://bugzilla.wikimedia.org/show_bug.cgi?id=38299#c32 [16:04:24] hashar: running puppet on srv219 .. hold on [16:05:02] !log Sending upload traffic from Argentina to upload-lb.eqiad (Varnish with streaming and persistence patches) [16:05:03] hashar: package installs fail because there are unmet dependencies [16:05:08] ohh [16:05:11] Logged the message, Master [16:05:12] hashar: cm-super: Depends: cm-super-minimal but it is not going to be installed [16:05:35] grmblbl did my tests on Precise :( [16:05:37] cmjohnson1: as soon as robh gets in I will ask him what he thinks about the call; in the meantime can you go ahead and request they send the backplane/expander? I dunno what your schedule is, are you in tomorrow if it shows up? [16:06:07] apergos: yes, i am here tomorrow when it will show [16:07:03] hashar: wait..trying to fix manually [16:07:25] mutante: you are my hero [16:07:32] mutante: I am not sure why there is that probeem [16:08:00] maybe some packages are conflicting each other [16:08:19] dpkg: error processing /var/cache/apt/archives/cm-super-minimal_0.3.4-3_all.deb (--unpack): trying to overwrite '/usr/share/texmf/doc', which is also in package tex-common 0:2.06ubuntu0.1 [16:08:36] apergos / paravoid: there will be more swift traffic now, as varnish is requesting thumbs/originals/temps directly from swift, bypassing the squids there [16:08:42] (with an empty cache of course) [16:08:51] hashar: yes, cm-super-minimal vs. tex-common [16:08:58] we are doomed [16:09:21] I am removing cm-super [16:09:24] how much traffic are you passing through the varnishes again? [16:09:29] percentagewise or someting [16:09:34] ma [16:09:36] grr [16:09:37] 1% [16:09:37] mark [16:09:41] ok thanks [16:09:47] so not a lot yet [16:10:01] it should be just fine but thanks for the heads up [16:10:43] nice, there's a new parameter fetch_streamed [16:10:46] New patchset: Hashar; "(bug 38299) remove cm-super" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22907 [16:10:53] every time varnish streams a large object [16:11:02] mutante: https://gerrit.wikimedia.org/r/22907 removes cm-super [16:11:31] !log srv219 - apt-get -f install, removing cm-super, dist-upgrading, [16:11:37] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22907 [16:11:40] Logged the message, Master [16:12:05] hashar: ok, gotcha, if they are provided by ttf-lyx anyways.. cool [16:12:35] New review: Dzahn; "yep, this conflicted" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/22907 [16:12:36] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22907 [16:12:52] I should have tested on a Lucid instance with the imagescaler class [16:13:39] grmlbl out for a few minutes, brb [16:13:58] !log srv219 - apt-get auto-remove unneeded packages, running puppet [16:14:07] Logged the message, Master [16:16:41] hashar: ii ttf-lyx 1.6.5-1ubuntu1. and puppet runs fine [16:19:29] I'm gonna shut down ms-be6 again cmjohnson1 since it's not doing anything useful, unless you want to get something off it first [16:20:03] no�i am good. i have a dell tech here for mw8�once he is gone I will call support and get the backplane sent [16:20:13] great [16:22:17] !Log poewring off ms-be6 til we try the next round of replacements [16:22:27] mutante: thanks :-) [16:23:33] mutante: will wait for puppet to kick in and will then close the bug ;) [16:23:35] hashar: i am fixing the other ones [16:23:52] hashar: eh, it does not remove cm-super by itself [16:24:11] but no worries, i can do them, they are just 6, right? [16:24:16] PROBLEM - Host ms-be6 is DOWN: PING CRITICAL - Packet loss = 100% [16:24:48] hashar: btw, also this one failed ,f.e. on srv224 install php-luasandbox' [16:25:44] mutante: though it is installed on the other servers :/ [16:27:52] hrmm, need to remove apt lock file and stuff [16:29:46] hashar: srv224, well it has ttf-lyx, but it has other issues with the php-luasandbox [16:29:57] so far so good [16:35:30] I am not sure what luasandbox is for on imagescaler though :) [16:36:50] mark: so in my quest to get crap off of ms7, I'm looking at the monitorurl setting for the apaches in squid's upload-settings.php, it's set to http://upload.wikimedia.org/pybaltestfile.txt [16:36:57] how does that get handled? [16:38:45] my question is quite literal: what retrieves it, from where, and how is this treated as a system health check [16:42:16] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [16:42:16] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [16:42:16] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [16:42:16] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [16:42:16] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [16:42:17] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [16:42:17] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [16:42:18] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [16:42:18] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [16:44:15] New patchset: preilly; "add DTAC Thailand (DT)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22913 [16:45:13] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22913 [16:55:24] New patchset: Cmjohnson; "Replaced mother board on mw8. updating mac address in dhcpd file hardware ethernet 77;" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22914 [16:56:15] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22914 [16:56:39] notpeter: can you merge that change plz? 22914 [16:57:10] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22914 [16:57:12] done [17:03:13] swift looks not happy now in the stats [17:03:41] apergos: in the squid config, it's simply squid that retrieves that file to see if that backend is still active [17:04:00] it can be changed into any other object that is guaranteed to be in swift always [17:05:33] anybody know how or if it is possible to set upa github <-> gerrit mirror? [17:05:55] Not too way... [17:05:56] ok my question was how it does the retrieval, I mean upload.wikimedia.org would go through lvs to the squids again, [17:06:01] *two [17:06:14] wouldn't it? [17:06:19] it retrieves it from the backend as it's configured in squid [17:06:35] one way is fine [17:06:38] ottomata: Mirroring out is possible as of 2.5 (?), mirroring back hasn't been written by anyone (hence our gerrit contractor posting) [17:06:39] using that hostname as the request [17:06:41] don't really care which [17:06:49] ^demon should be able to confirm [17:07:00] ah he's online now too :) [17:07:03] <^demon> Mirroring from gerrit to github is possible, and I've been working on it this week. [17:07:09] oh cool! [17:07:15] i have two cases where I"d like to do that [17:07:19] can I work with you as a trial? [17:07:20] so Host header? [17:07:29] <^demon> ottomata: Well we'll be replicating all the repos :) [17:07:42] <^demon> But I may need a guinea pig, thanks for volunteering [17:07:43] apergos: yeah, or as a proxy request, that i'm not sure of [17:07:51] ok. thanks, that cleared it up [17:08:09] <^demon> ottomata: Pulling stuff back from github into gerrit is a little more involved, and something we'll have the contractor work on like Reedy said. [17:09:02] i'm fine with making gerrit the master [17:09:04] don't really care [17:10:32] Hi, not sure if this is the right place to ask, but is there any movement towards getting wikipedia.org (and associated domains) dnssec signed? DANE (http://tools.ietf.org/html/rfc6698) now being a full RFC it would be a step towards more trustable HTTPS everywhere, and with Jimmy talking about making wikipedia HTTPS only as a responce to the snoopers charter it would be nice to have... [17:11:14] no, there's no movement towards dnssec [17:11:54] Jimmy should really talk to ops before suggesting things [17:13:24] RECOVERY - Host mw8 is UP: PING OK - Packet loss = 0%, RTA = 0.54 ms [17:14:30] is there no interestin dnssec fir any particular reason? or has it just not been considered due to many other things going on (tm). [17:14:39] mostly the latter [17:15:10] also dnssec has its risks associated with it; it needs to be managed well or it can cause significant downtime [17:15:21] so it's not a decision we can take lightly [17:15:51] yes, you need a reliable system to manage key roll over and so on. [17:15:56] yeah [17:16:34] http://www.theregister.co.uk/2012/09/06/jimmy_wales_complains_about_uk_snoopers_charter/ [17:17:00] jimmy wales, talking purely in a personal capacity [17:17:12] Is that even possible at the moment for all (including non-logged in) users? [17:17:16] sorry, talking in a purely personal capacity ;) [17:17:18] PROBLEM - SSH on mw8 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:17:22] does wikipedia use a HSM for the existign https stuff? [17:17:27] PROBLEM - Memcached on mw8 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:17:32] <^demon> mark: At least he qualified it with "err, I shouldn't speak for our technical staff" this time :) [17:17:42] jasperw: no [17:18:32] * csteipp would really like to see us implement dnssec [17:18:38] fwiw i use zkt on my zones: http://www.hznet.de/dns/zkt/ [17:19:05] which works well, but i suspect the wiki* setup is a little more complicated than my odd handful of zones :) [17:19:30] jasperw, please read https://www.mediawiki.org/wiki/Wikipmediawiki [17:20:22] ahhh computer is dyyying [17:20:27] oops [17:20:28] wrong chat [17:20:44] ok, wiki* == wikimedia (or wikimedia servers) [17:21:03] PROBLEM - Host mw8 is DOWN: PING CRITICAL - Packet loss = 100% [17:24:05] So what happened to operations/mediawiki-config master? [17:24:26] It's not listed on https://gerrit.wikimedia.org/r/#/admin/projects/operations/mediawiki-config,branches [17:24:32] And I'm getting fatal: Couldn't find remote ref master [17:26:34] * apergos wishes sqiud config was in puppet = in gerrit so they could commit and request a review [17:27:45] it's in git though now [17:27:46] I think [17:27:56] as of 2 weeks ago by paravoid I think [17:28:11] so you can request a review just fine, but not in gerrit ;) [17:28:32] or perhaps he didn't do it yet, i'm not sure [17:28:56] it is in git [17:29:11] I could add and not commit, then ask him or someone to look at the diff I guess [17:29:47] or you put it in a branch [17:29:57] New review: Sumanah; "No, Reedy isn't the only person who *can* do stuff like this, but too often he is. I'm going to ask..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21326 [17:30:10] right! [17:31:50] Am I the only one having trouble getting mw-config master right now? :/ [17:32:26] Krenair: ZOMG me too [17:32:26] ^demon: WTF is wrong with mediawiki-config? [17:32:29] Nope, it's not got any branches for me [17:32:33] catrope@roanLaptop:~/mediawiki/git/mediawiki-config (master)$ git pull [17:32:34] Your configuration specifies to merge with the ref 'master' [17:32:36] from the remote, but no such ref was fetched. [17:32:42] https://gerrit.wikimedia.org/r/gitweb?p=operations/mediawiki-config.git;a=tree 404 [17:33:54] <^demon> RoanKattouw: Ugh, no clue. Haven't touched it. [17:34:08] <^demon> We attempted some git gc last night, wonder if that fubared it. [17:34:14] <^demon> Which would *suck* royally. [17:34:50] We have it on fenari [17:34:55] So the data isn't lost [17:35:29] <^demon> What the HELL [17:35:35] <^demon> https://gerrit.wikimedia.org/r/#/admin/projects/operations/mediawiki-config,branches [17:35:49] <^demon> master and refs/meta/config just *disappeared* [17:36:16] master is something we can reinstate fairly easily [17:36:20] But r/m/c is concercin [17:36:22] g [17:36:25] *concerning [17:36:53] <^demon> Well the repo's permissions weren't complicated. [17:36:59] <^demon> I'm just more worried why we had data loss. [17:37:58] <^demon> I can think of no plausible reason why this would've happened. [17:38:31] <^demon> It was just a `git gc --quiet`, no --prune options or anything. [17:42:52] What about the replication host? [17:43:00] Maybe it was the git fsck [17:43:01] Did it replicate the disappearances there too? [17:43:51] <^demon> Damianz: We only fsck'd core, not all repos. [17:43:58] New patchset: Demon; "Disabling `git gc` cron for gerrit for now" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22924 [17:44:00] <^demon> RoanKattouw: Shouldn't have, a gc is local. [17:44:08] <^demon> I can log into formey and check [17:44:46] <^demon> Actually though, the cron went on both hosts so we might've gotten bitten [17:44:48] <^demon> Lemme login now [17:44:52] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22924 [17:45:10] Only if it messed up in the same way [17:45:20] Also, let me add that running any sort of gc / repack on a backup host is a seriously bad idea [17:46:11] <^demon> Yup, both got hosed with this. [17:46:20] OK [17:46:25] Replication isn't a backup solution anyway, it's a HA solution [17:46:39] Grant Create Reference and we can push it back up from fenari [17:48:07] Depends on what you want it to be [17:48:10] <^demon> RoanKattouw: Granted, plus ownership, review, submit [17:48:18] I don't think formey is used as an HA solution, more like a backup [17:48:21] OK pushing from fenari [17:48:32] <^demon> It's a backup really, yeah [17:49:01] Which is why we shouldn't gc/repack it or touch it in any way ever [17:49:10] Also, is refs/review/* intact? [17:49:53] <^demon> I doubt it. [17:49:54] !log Pushing master of operations/mediawiki-config.git into gerrit from fenari [17:50:03] Because that would be truly lost [17:50:03] Logged the message, Mr. Obvious [17:50:23] ^demon: I need Forge {Author,Committer} Identity too [17:50:27] <^demon> Actually, everything still appears to be in refs/* [17:50:51] <^demon> Done [17:51:02] https://gerrit.wikimedia.org/r/gitweb?p=operations%2Fmediawiki-config.git;a=commit;h=0f50fa923ecf15ac32aba993699f443e1c407a50 seems to work [17:51:11] So refs/review seems to still be there [17:51:15] OK pushed [17:51:20] <^demon> refs/changes/* is all there. [17:51:22] <^demon> afaict. [17:51:58] master is back [17:54:02] New review: Kaldari; "I believe it was just closed and locked in 2003." [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/22534 [18:06:45] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22924 [18:10:29] New patchset: RobH; "allocating ersch as secondary poolcounter server (tampa)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22931 [18:11:21] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22931 [18:17:24] !log stopping search indexing on searchidx1001 and searchidx2 to sync to eqiad [18:17:34] Logged the message, notpeter [18:23:41] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22931 [18:23:51] woooooo self review! \o/ [18:29:10] !log installing ersch as poolcounter server [18:29:19] Logged the message, RobH [18:30:05] $ connect com2 [18:30:05] connect: com2 port is currently in use [18:30:10] didnt miss that. [18:37:45] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [18:38:26] heh [18:40:17] !log authdns-update for ersch ip and removing old decom servers from pdns templates [18:40:26] Logged the message, RobH [18:42:15] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 235 seconds [18:44:47] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22856 [18:47:21] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , enwiki (19678) [18:48:03] New patchset: Jeremyb; "change all $ircecho_server to use the chat record" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22698 [18:48:24] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , enwiki (19211) [18:48:55] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22698 [18:49:27] * jeremyb wonders if mutante's been doing something to make it keep trying to merge? otherwise gerrit seems to be freaking out in a way I never saw before. (usually it just says it once?) [18:49:46] anyway, i realized it's not actually a CNAME so i fixed the commit msg [18:49:52] jeremyb: no, i have been wondering about that myself. it keeps retrying that [18:49:55] ARGHWEQHIOUWe;oiawr;lk [18:49:55] connect: com2 port is currently in use [18:50:03] HATE DRAC 5 >_< [18:50:13] crap [18:50:14] RobH: and DRAC 4 ??? [18:50:17] seems this repo is broken too [18:50:19] <^demon> mutante, jeremyb: It keeps retrying when there's dependencies. It's kinda annoying. [18:50:28] <^demon> Ryan_Lane: Which one? [18:50:31] jeremyb: dunno dont recall it [18:50:37] ^demon: i've seen deps before but never seen the perpetual retry [18:50:38] https://gerrit.wikimedia.org/r/#/c/22698/ [18:51:06] <^demon> That's not broken, I've seen it before. [18:51:11] <^demon> Not sure if it's a feature or a bug. [18:51:11] ah [18:51:12] good [18:51:25] <^demon> It happens when there's dependencies or somesuch. [18:51:38] <^demon> It keeps trying to remerge, thinking it'll suddenly magically succeed. [18:51:44] <^demon> So yeah, I'm pretty sure it's a "feature" [18:51:59] can't it just detect if there's still a dep and be silent if there is? [18:52:15] <^demon> jeremyb: File a bug upstream ;-) [18:52:15] (unless maybe it's a different dep? [18:52:28] ^demon: maybe later ;) [18:52:44] anyway, going away for a while, have to get some work done [18:53:00] i also thought i have seen before how it does not merge because of the dep, but the retying is new [18:53:16] <^demon> It doesn't happen in all scenarios. [18:53:27] <^demon> I've probably seen it less than 10 times. [18:53:52] I really hope that doesn't send an email out each time [18:53:59] Damianz: it does! [18:54:05] Oh god... [18:54:05] <^demon> Oh yes, it certainly does. [18:54:46] <^demon> Gerrit's policy towards e-mail notifs seems to be "if anything happens, ever, send an e-mail, someone might care." [18:55:01] Damianz: re: ascii cats. find . | xargs cowsay -f hellokitty [18:55:14] surely a hook could suppress [18:55:31] but someone would have to write it! [18:55:33] <^demon> There's no hook for that. [18:55:47] <^demon> Would need an @ExtensionPoint or somesuch [18:55:51] * ^demon has bigger fish to fry [18:55:54] bye [18:56:37] Heh I never knew cowsay could do cats :D [18:57:05] hehe, yes, it has tons of "fonts" :) [18:57:14] you can write your own:) [18:57:15] I wonder how hard it is to get commits in upstream, would suck to effectivly have to fork gerrit. [18:57:29] <^demon> Damianz: Not hard, at all. [18:57:34] :) [18:57:37] <^demon> I've contributed several :) [18:57:55] <^demon> https://gerrit-review.googlesource.com/#/q/owner:%22Chad+Horohoe%22,n,z [18:58:43] That's a few several, they need a better theme though heh [19:00:07] <^demon> Yeah, they still use the puke green :) [19:05:21] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 286 seconds [19:06:47] New patchset: Aaron Schulz; "Added global backend config for things like math." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/22941 [19:31:15] New patchset: Ottomata; "udp2log.pp - ensuring that udp-filter is installed instead of latest." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22947 [19:32:09] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22947 [19:37:08] hioy, could someone merge that pretty please? [19:37:08] https://gerrit.wikimedia.org/r/22947 [19:37:09] notpeter? [19:38:27] sup? [19:38:47] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22947 [19:38:50] thanks [19:38:51] dunno why [19:38:53] but sure :) [19:51:37] New review: Helder.wiki; "Is this supposed to be live once it is "merged"?" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21475 [19:53:21] !log stopping puppet on oxygen. (puppet upgraded udp-filter before I was ready, I have to make sure ip filtering still works before I can turn it back on) [19:53:30] Logged the message, Master [20:06:40] New review: Reedy; "IT's there as is in the config, and I know InitialiseSettings has been sync'd numerous times since" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21475 [20:12:15] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 27 seconds [20:14:39] RECOVERY - MySQL disk space on storage3 is OK: DISK OK [20:14:48] RECOVERY - Host storage3 is UP: PING OK - Packet loss = 0%, RTA = 0.71 ms [20:14:49] New patchset: Ori.livneh; "Remove udp2log instance from vanadium" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22988 [20:15:43] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22988 [20:16:07] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22988 [20:17:25] jeff_green: storage3 is back up and you are able to access the array [20:17:33] excellent, thanks [20:17:39] so it was two failed disks? [20:17:48] yep..the 2 disk crashed it [20:18:05] qwality [20:18:55] TimStarling: binasher Looks like we've not been getting anything in the memcached error logs on fluorine since the 29th/30th August [20:18:59] Seems a bit suspect [20:20:25] New patchset: Jgreen; "deprecating misc::fundraising::impressionlog::compress" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22989 [20:21:17] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22989 [20:32:50] New patchset: Jgreen; "remove misc::fundraising::impressionlog::compress" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22992 [20:33:42] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22992 [20:37:36] RECOVERY - mysqld processes on storage3 is OK: PROCS OK: 1 process with command name mysqld [20:40:36] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 92082 seconds [20:41:57] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [20:41:57] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [20:41:57] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [20:47:57] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [20:50:55] about to run scap [21:01:21] New patchset: Dereckson; "(bug 39942) Disables UseRCPatrol on fi.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/22999 [21:22:07] New review: Dereckson; "shellpolicy" [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/22999 [21:29:33] New patchset: RobH; "ersch set to autopart with 250gb raid 1" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23005 [21:30:27] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/23005 [21:31:11] apergos: can you guess the best kind of review? [21:31:14] New review: Alex Monk; "Hello?" [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/12556 [21:31:17] New review: RobH; "self review \o/" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/23005 [21:31:17] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23005 [21:31:22] review by someone *else*! [21:31:57] i amuse the hell out of me. [21:34:42] It's when you sef-review and give critisim then reject the change we'll worry [21:35:04] heh�that i like to see [21:51:45] Damianz: now i feel the need to do so. [22:00:01] anyone wanna review a dns change for me? [22:00:24] /tmp/atg-msbe13-diff.txt on sockpuppet, db70 being renamed to ms-be13, new internal ip but same old mgmt ip [22:02:48] RobH: you can telnet to poolcounterd on port 7531 and type STATS FULL [22:03:09] binasher: ohhh, thanks! [22:03:43] on a newly started instance it should have a lot of 0's but that should change quickly after adding it to the site [22:04:01] yep, i can add and watch it spin into service, atleast i can comare output against tarin too [22:04:02] thank you [22:04:12] (that's gonna go on the wikitech page right?) :-P) [22:04:25] it will once I confirm it works for me in pushing it live, yep [22:04:32] yay! [22:05:01] stinks the only test is once its live, though i suppose the telnet before pool shows service is responding [22:05:07] so seems a good basic test. [22:06:16] apergos: looks good to me. maybe just the newlines in 10.in-addr-arpa between 209/210,211/212 etc. [22:07:07] line 213,216,219,222 [22:09:52] wtf is the telnet esc sequence [22:11:13] lol [22:11:33] telnet...old. [22:11:39] and the one i thought it was, it was, but isnt working [22:11:41] awesome [22:11:47] apergos: so is https://gerrit.wikimedia.org/r/#/c/22941/ ok to you? [22:12:04] ctrl+]? [22:12:12] PROBLEM - ps1-d1-sdtpa-infeed-load-tower-A-phase-Y on ps1-d1-sdtpa is CRITICAL: ps1-d1-sdtpa-infeed-load-tower-A-phase-Y CRITICAL - *2688* [22:12:22] just a sec [22:12:50] I thought about just re-using commons, but that's evil [22:12:57] Damianz: yea i tried that, seems its just not working when i do it [22:13:07] i think its cuz poolcounter is technically still outputting from my show command [22:13:11] even though its not scrolling, oh well [22:13:26] apergos: also, how many files are in math/ ? [22:13:29] Yeah... telnet sorta sucks like that [22:13:33] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [22:13:33] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [22:13:46] math-render is ok I guess, [22:13:50] lemme look at the last bit [22:14:30] yeah that seems ok [22:14:49] * apergos thinks abot this for a minute [22:16:03] we'll have to migrate the contents to the container(s) but also I think rewrite.py is going to need to be fixed up, it's not set to handle anything outside of thumbs, temp, orig [22:16:19] lemme look at the file count in math, for your other qustion [22:16:31] apergos: sure, rewrite will need to change after migration [22:16:42] PROBLEM - ps1-d1-sdtpa-infeed-load-tower-A-phase-Y on ps1-d1-sdtpa is CRITICAL: ps1-d1-sdtpa-infeed-load-tower-A-phase-Y CRITICAL - *2588* [22:16:50] I'd like for us to strengthen the rewrite regex too [22:16:54] good [22:16:59] similar to the squid ones [22:17:04] well after the migration but before the config goes out [22:18:02] waiting for the count to come back [22:21:21] PROBLEM - ps1-d1-sdtpa-infeed-load-tower-A-phase-Y on ps1-d1-sdtpa is CRITICAL: ps1-d1-sdtpa-infeed-load-tower-A-phase-Y CRITICAL - *2575* [22:24:21] PROBLEM - ps1-d1-sdtpa-infeed-load-tower-A-phase-Y on ps1-d1-sdtpa is CRITICAL: ps1-d1-sdtpa-infeed-load-tower-A-phase-Y CRITICAL - *2638* [22:27:21] RECOVERY - ps1-d1-sdtpa-infeed-load-tower-A-phase-Y on ps1-d1-sdtpa is OK: ps1-d1-sdtpa-infeed-load-tower-A-phase-Y OK - 2400 [22:36:30] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [22:36:30] PROBLEM - Puppet freshness on virt1000 is CRITICAL: Puppet has not run in the last 10 hours [23:14:37] New patchset: Catrope; "Set $wgForceUIAsContentMsg for bewikimedia, per request on IRC" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23015 [23:17:28] Change merged: Catrope; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23015 [23:21:31] New patchset: Catrope; "...and anonnotice too" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23018 [23:21:49] Change merged: Catrope; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23018 [23:33:48] RECOVERY - poolcounter on helium is OK: PROCS OK: 1 process with command name poolcounterd [23:46:12] PROBLEM - poolcounter on ersch is CRITICAL: PROCS CRITICAL: 0 processes with command name poolcounterd [23:52:30] PROBLEM - Puppet freshness on search14 is CRITICAL: Puppet has not run in the last 10 hours [23:53:33] PROBLEM - Puppet freshness on search26 is CRITICAL: Puppet has not run in the last 10 hours [23:58:30] PROBLEM - Puppet freshness on search24 is CRITICAL: Puppet has not run in the last 10 hours [23:58:30] PROBLEM - Puppet freshness on search16 is CRITICAL: Puppet has not run in the last 10 hours