[00:10:56] !log touch /etc/exim4/defer_domains on streber [00:11:04] Logged the message, Master [00:12:27] New patchset: Andrew Bogott; "Create an empty /etc/exim4/defer_domains if it does not exist." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69267 [00:12:51] mutante, ^ [00:13:07] ah:) [00:13:54] I'll run that by Mark in the morning. [00:14:01] New review: Dzahn; "yes please, just touch'ing that file fixed exim on streber after the template change" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/69267 [00:14:01] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69267 [00:14:13] Oh, nevermind :) [00:16:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:18:03] andrewbogott: mutante: You don't need to do "Verified: +2" when jenkins-bot is already running. When jenkins-bot has already set it, doing it again does nothing. By doing it as a habit you only increase the chances of accidentally merging something that has a critical lint error or whatever [00:18:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [00:18:37] Krinkle, ok, good to know -- I thought jenkins only did +1 verified [00:20:10] We haven't revoked the ability to set V+2 from human accounts because lately jenkins/gerrit has been a bit unstable. So if you need to do an emergency merge and jenkins-bot is unavailable (e.g. not voting -2 or +2 either way), then it allows you do override it since the Submit merge button is only clickable if Verified+2 is set. [00:20:31] Be be sure to be aware of overriding it, only when neccecary and aware of it. [00:20:32] Thanks :) [00:30:27] New review: GWicke; "A single-layer cache with two parses per edit would complicate some things for us. API failures can ..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68404 [00:31:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:32:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [00:47:42] !log reloading squid config on esams text squids [00:47:51] Logged the message, Master [00:49:01] New patchset: Ottomata; "Fixing packet_loss_log file on oxygen udp2log instance" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69272 [00:49:29] New patchset: Ottomata; "Fixing packet_loss_log file on oxygen udp2log instance" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69272 [00:49:52] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69272 [00:50:07] !log reloading squid config on pmtpa.txt [00:50:17] Logged the message, Master [00:52:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:53:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.146 second response time [01:01:54] RECOVERY - NTP on ssl3003 is OK: NTP OK: Offset -0.005719184875 secs [01:02:56] RECOVERY - NTP on ssl3002 is OK: NTP OK: Offset -0.004811644554 secs [01:03:19] !log Gracefully reloading Zuul to deploy {{gerrit|I5695a3b988e9ec3138f}} [01:03:30] Logged the message, Master [01:31:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:32:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [01:53:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:54:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [02:01:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:02:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [02:07:32] !log LocalisationUpdate completed (1.22wmf7) at Tue Jun 18 02:07:31 UTC 2013 [02:07:42] Logged the message, Master [02:13:24] !log LocalisationUpdate completed (1.22wmf6) at Tue Jun 18 02:13:24 UTC 2013 [02:13:32] Logged the message, Master [02:25:43] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Jun 18 02:25:42 UTC 2013 [02:25:53] Logged the message, Master [02:41:44] Account creation seems to be broken on en.wp - anybody knows what's going on (or if somebody is still around)? [02:56:03] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [02:56:14] andre__, I'm looking into it. If anyone knows anything, please ping me mentioning my username. [02:56:46] superm401: ah, you're mflaschen. didn't know your nick. :) Thanks! [02:58:29] superm401: I'd love to see the stacktrace for that exception (not that I could change anything though) - can you access that? [02:58:47] I should be able to, testing locally first. [02:59:51] Can't reproduce locally on the same commit, so it's a prod issue most likely. [03:01:43] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:02:33] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.131 second response time [03:02:47] serious stuff: every account creation on enwiki fails with "Fatal exception of type MWException". fluorine finds "Exception from line 215 of /usr/local/apache/common-local/php-1.22wmf6/includes/Hooks.php: Detected bug in an extension! Hook VisualEditorHooks::onAddNewAccount failed to return a value; should return true to continue hook processing or false to abort." [03:02:48] Anyone from VE team around? [03:05:33] in #wikimedia-tech somebody says that account creation works e.g. on meta [03:05:53] spagewmf: https://bugzilla.wikimedia.org/show_bug.cgi?id=49727 [03:06:08] superm401 is taking a look [03:06:36] spagewmf, okay, well that tells us what it is. [03:06:50] I'll do a hotfix to workaround or fix it. [03:08:08] * andre__ assigns the bug report in the meantime [03:09:32] !log Started hotfix for account creation on wmf6. [03:09:39] !log mflaschen synchronized php-1.22wmf6/extensions/VisualEditor/VisualEditor.php 'Emergency hotfix to fix account creation' [03:09:45] Logged the message, Master [03:09:48] superm401 I see what you did there, thanks! [03:09:53] Logged the message, Master [03:10:32] Thanks so much! [03:11:40] Didn't work [03:12:14] superm401 what's the new error code I'll grep it on fluorine [03:12:25] spagewmf, cf48759f [03:13:10] Same, [cf48759f] /w/index.php?title=Special:UserLogin&action=submitlogin&type=signup&returnto=Main+Page Exception from line 215 of /usr/local/apache/common-local/php-1.22wmf6/includes/Hooks.php: Detected bug in an extension! Hook VisualEditorHooks::onAddNewAccount failed to return a value; should return true to continue hook processing or false to abort. [03:13:36] you need to sync VisualEditor.hooks.php ? [03:13:43] spagewmf, yeah, just saw it. [03:14:09] Sorry [03:14:14] Resyncing now [03:14:21] !log mflaschen synchronized php-1.22wmf6/extensions/VisualEditor/VisualEditor.hooks.php 'Emergency hotfix to fix account creation' [03:14:30] Logged the message, Master [03:15:08] There we go [03:15:19] Tested on enwiki successfully. [03:15:21] Thanks, spagewmf [03:15:39] awesome [03:15:48] Oh, there you are, hah. [03:16:21] FWIW https://gerrit.wikimedia.org/r/69277 fixes it "properly" [03:27:26] superm401, spagewmf, nice work [03:27:58] Thanks, spagewmf gets a big chunk of the credit [03:28:20] marktraceur, your fix has the same checksum as superm401's! It's like y'all are psychic :) [03:29:10] Because it's a supersimple fix! :) [03:29:45] And the winner of the dunce cap is James_F|Away, by the way [03:30:10] * marktraceur is just trolling, poor James_F [03:30:33] PROBLEM - Host wtp1008 is DOWN: PING CRITICAL - Packet loss = 100% [03:31:53] RECOVERY - Host wtp1008 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [03:31:55] don't troll, it isn't nice [03:32:59] what do we do about the accounts that were half-created? [03:34:20] ori-l, can you check the user table for 'Test 2013-06-17 unusable 3'? [03:34:28] I'm wondering what if anything was saved to there. [03:35:23] * ori-l looks [03:37:50] 1.22wmf7 has the same problem, OK if I sync that ? [03:38:35] superm401 should if he can, I think [03:38:57] he's deployed one; i don't think it's good to swap hats mid-fix [03:38:59] spagewmf, will do [03:39:10] superm401 ^ the file is in 1.22wmf7, just needs the sync-file [03:39:28] Should have checked that. [03:40:06] there's no "should have" in teamwork either [03:40:25] !log mflaschen synchronized php-1.22wmf7/extensions/VisualEditor/VisualEditor.hooks.php 'Hotfix to fix account creation' [03:40:33] Logged the message, Master [03:42:52] superm401: you or some other enwiki admin could temporarily override one of the messages used in Special:UserLogin and add a notice acknowledging the problem and urging users to try again. Do you think that would be appropriate? [03:44:07] The half-created accounts are strange: you're logged in, but trying to go to your page gives 'User account "SpageTest AC 0617-2" is not registered.' [03:44:13] ori-l, not sure, how long was it broken? [03:44:21] spagewmf, yeah, it looks like there's no user table entry at all. [03:44:28] superm401, it's there [03:44:37] user_id 19205308 [03:44:42] but there's no password or e-mail [03:45:51] ori-l, weird, that's not what I get locally, maybe it depends on the master/slave setup. [03:46:18] superm401, I think 23:52 logmsgbot: catrope Finished syncing Wikimedia installation... : Updating VisualEditor to master to 03:15 logmsgbot: mflaschen synchronized php-1.22wmf6/extensions/VisualEditor/VisualEditor.hooks.php 'Emergency hotfix to fix account creation' [03:47:28] I was going to say no, but that's almost three and half hours, so maybe it does make sense. [03:47:56] SELECT count(*) FROM `user` WHERE user_registration > '20130617000000' AND `user_password` = ''; [03:47:58] = 1970 [03:48:01] ouch. [03:48:51] I think we need to escalate this and get more input. [03:48:56] RoanKattouw, you around? [03:49:00] Ys [03:49:13] I supplied e-mails in my account creations, and got the "e-mail address confirmation" mail, but unsurprisingly the confirm URL doesn't work. [03:49:17] So we have 1970 half-created accounts with no password? [03:49:27] RoanKattouw, yeah. [03:49:53] Given that spagewmf got a confirmation email, I guess we do have email addresses for these people? [03:50:08] for those who provided one... [03:50:12] Right [03:50:29] Because one thing we could do is do password resets for the ones that have an email address [03:50:47] I don't know how to trigger those off the top of my head but it seems like a reasonable idea [03:50:48] only for 54 out of 1970 [03:50:51] Ugh [03:51:24] We can either delete the accounts or mangle the user name (by adding an underscore, say) which would at least release it so that it can be re-used, but I'm not sure how to go about doing that without breaking SUL [03:51:43] s/it/them [03:53:21] For SUL you shouldn't look at me, that's better left to Chris [03:53:56] I'm thinking we should delete the accounts, so the names are released and they can be recreated [03:54:13] We should reset the ones that do have emails. [03:54:14] Additionally, the 54 people that did provide an email address could be contacted [03:54:16] Yes [03:54:39] As for deleting the other accounts, they'd probably need to be deleted from the centralauth tables too and I don't know how that works [03:55:13] So let me figure out the password reset [03:55:15] I think an admin can just go to Special:PasswordReset for the 54. [03:55:19] The bug report says resets don't work, but I haven't verified that. [03:55:22] Oh that would be nice [03:55:37] The initial confirmation probably doesn't work [03:55:42] The resets should work, hopefully [03:56:04] When done by an admin at least [03:56:22] RoanKattouw, I'm an admin. I don't think we have any special reset privileges. [03:56:55] To reset, I think the emails will have to be force-confirmed. [03:57:04] Right [03:57:10] You can only reset with a confirmed email, which is probbly the issue they hit on the bug report. [03:57:24] Yeah I figured [03:57:29] But I thought admins could get around that maybe [03:57:47] Don't think we have any special abilties. Maybe stewards? [03:57:52] I think it'd be OK to force-confirm the e-mails, since these users will only be able to log in if they actually receive the reset e-mail [03:58:43] ori-l, also, I don't have a staff account on the content wikis, and I'm not supposed to use my personal one for this kind of thing. [03:59:15] OK let's just force-confirm them [03:59:33] ori-l, I'm not 100% convinced it's that useful to change the login/signup message, but if you think it is, we need to find someone (preferably to do it on all the VE wikis, not just enwiki). [03:59:46] RoanKattouw: want me to do it? [03:59:53] Sure go ahead [04:01:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:02:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.145 second response time [04:04:03] what other wikis is VE deployed to? [04:05:39] ori-l, lots of Wikipedias, and MW.org [04:06:04] hoooooooooooo boy. [04:06:18] I'm looking at s1-analytics-slave's enwiki user table and my new account doesn't have a user_email. Is it stored somewhere else before it's confirmed? [04:06:21] ori-l, https://git.wikimedia.org/blob/operations%2Fmediawiki-config.git/HEAD/visualeditor.dblist, I believe. [04:07:06] ok, sanity check my pseudo-code: [04:07:22] for each wiki in visualeditor.dblist: [04:07:38] for each account where account created today and account password is blank and account e-mail is not blank: [04:08:18] confirm email & save [04:08:22] right? [04:09:38] ori-l, maybe do it since the start of the deploy window. [04:09:47] Sounds good [04:09:56] Maybe we should page someone to have them double-check. [04:10:02] At least we should check with Chris. [04:10:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:11:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.131 second response time [04:12:46] !log Manually marked the e-mail address of 54 enwiki users with e-mail address configured but no password (due to bug 49727) as confirmed. [04:12:56] Logged the message, Master [04:13:25] superm401: I did those before your last message; I'm not disregarding your suggestion [04:14:57] ori-l, RoanKattouw, I emailed ops, just in case someone's not watching IRC. [04:15:55] Yeah I saw [04:15:57] Thanks a lot guys [04:16:02] And sorry for not spotting that in CR :) [04:16:04] * :( [04:16:20] (brb) [04:16:43] I told James I didn't test it, but I wish I saw that or took the time to test. [04:17:09] spagewmf, I think it's all in the user table, is it possible Analytics sanitizes that field? [04:18:24] I have to put Noam to bed, I'll be back ASAP [04:21:39] Random tidbit, when testing locally, it seems to grab a user_id, but never actually write the row. [04:21:54] So when I created some with the broken version, then a working one, it jumped for user_id 1 to user_id 8. [04:21:57] With only two actual rows. [04:23:19] superm401 perhaps hook order is dependent on extension ordering, so you may get different behavior than a WMF wiki even if you load the same extensions. [04:23:37] StevenW wasn't me boss :) [04:23:40] spagewmf, plus, I'm certainly not loading the same extensions (including important ones like CentralAuth) [04:25:35] lunchtime, BIAB [04:51:28] PROBLEM - NTP on ssl3003 is CRITICAL: NTP CRITICAL: No response from NTP server [04:52:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:54:28] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [04:57:48] PROBLEM - NTP on ssl3002 is CRITICAL: NTP CRITICAL: No response from NTP server [05:16:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:17:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [05:26:57] FYI Ganglia > Wikimedia Grid > Miscellaneous eqiad > vanadium.eqiad.wmnet, choose day, see the spike in the Exceptions graph. Also there's a link to a canned graph of exceptions & fatals in the last 2 hours in https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Test_and_monitor_your_live_code [05:39:32] Thanks, spagewmf [05:40:25] it's all Ori, I just rite stuff down [05:43:42] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:44:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [06:58:53] back [07:11:04] !log Marking e-mail addresses of accounts created in last 24h with no password as "confirmed" on all wikis on which VE is enabled. [07:11:13] Logged the message, Master [07:18:21] !log Force-confirmed e-mail for 160 accounts. full list in fluorine:/home/olivneh/users-forced-confirm-email-18-Jun-2013.list [07:18:30] Logged the message, Master [07:19:19] New review: Yurik; "even though I would have probably inlined the $zeroRated variable, looks ok." [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/64629 [07:48:15] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [07:48:15] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [07:48:15] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [07:48:15] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [07:48:15] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [07:48:16] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [07:48:16] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [07:48:17] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [07:48:17] PROBLEM - Puppet freshness on spence is CRITICAL: No successful Puppet run in the last 10 hours [07:48:18] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [07:48:18] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [08:01:35] RECOVERY - NTP on ssl3003 is OK: NTP OK: Offset 0.008830189705 secs [08:05:20] New patchset: Nemo bis; "Add tools- and etherpad.wmflabs.org to $wgNoFollowDomainExceptions" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69289 [08:13:34] !log nikerabbit synchronized php-1.22wmf7/extensions/UniversalLanguageSelector/ 'ULS to master' [08:13:44] Logged the message, Master [08:15:26] !log nikerabbit synchronized php-1.22wmf6/extensions/UniversalLanguageSelector/ 'ULS to master' [08:15:35] Logged the message, Master [08:17:40] PROBLEM - Puppet freshness on sodium is CRITICAL: No successful Puppet run in the last 10 hours [08:18:42] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68146 [08:21:40] PROBLEM - Puppet freshness on magnesium is CRITICAL: No successful Puppet run in the last 10 hours [08:25:46] !log nikerabbit synchronized wmf-config/InitialiseSettings.php 'ULS phase 2' [08:25:53] Logged the message, Master [08:32:20] RECOVERY - NTP on ssl3002 is OK: NTP OK: Offset 0.009659051895 secs [09:12:55] hello [09:43:26] New review: Hashar; "Ah that is what fixed it. Thanks!" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68833 [09:46:51] New review: Hashar; "I needed them for https://integration.wikimedia.org/ci/view/Analytics/job/analytics-wikistats/ which..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68327 [10:10:06] New patchset: ArielGlenn; "remove wgSquidServersNoPurge for labs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69299 [10:17:08] New review: Hashar; "Editing a page over HTTPS, the request will pass via the nginx proxy and then the text cache. The re..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69299 [10:31:46] New patchset: ArielGlenn; "clean up squid NoPurge list for beta labs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69299 [10:33:32] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69299 [11:40:40] New review: Petrb; "yes we can, but this was it is FAR more simple and it has same potential as having a variable (we ca..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69133 [12:28:36] New patchset: ArielGlenn; "add upload cache to squids list for labs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69304 [12:43:25] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69304 [12:56:24] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [13:13:53] New review: coren; "It's a reasonable way of doing it." [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/69133 [13:13:54] Change merged: coren; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69133 [13:52:19] New patchset: Ottomata; "Lowering alert thresholds on kakfa-broker-ProduceRequestsPerSecond" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69307 [13:52:33] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69307 [13:56:11] it is going to get noisy here in a few minutes...relocating search [14:00:12] !log powering down search1001-1024 to relocate to row c5 [14:00:20] Logged the message, Master [14:02:25] PROBLEM - Host search1024 is DOWN: PING CRITICAL - Packet loss = 100% [14:03:05] PROBLEM - Host search1023 is DOWN: PING CRITICAL - Packet loss = 100% [14:03:05] PROBLEM - Host search1022 is DOWN: PING CRITICAL - Packet loss = 100% [14:03:15] PROBLEM - Host search1021 is DOWN: PING CRITICAL - Packet loss = 100% [14:03:15] PROBLEM - Host search1020 is DOWN: PING CRITICAL - Packet loss = 100% [14:03:35] PROBLEM - Host search1019 is DOWN: PING CRITICAL - Packet loss = 100% [14:04:05] PROBLEM - Host search1017 is DOWN: PING CRITICAL - Packet loss = 100% [14:04:05] PROBLEM - Host search1018 is DOWN: PING CRITICAL - Packet loss = 100% [14:04:37] PROBLEM - Host search1016 is DOWN: PING CRITICAL - Packet loss = 100% [14:04:38] PROBLEM - Host search1015 is DOWN: PING CRITICAL - Packet loss = 100% [14:04:38] PROBLEM - LVS Lucene on search-pool5.svc.eqiad.wmnet is CRITICAL: No route to host [14:04:45] PROBLEM - Host search1014 is DOWN: PING CRITICAL - Packet loss = 100% [14:04:56] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: No route to host [14:05:02] PROBLEM - LVS Lucene on search-prefix.svc.eqiad.wmnet is CRITICAL: No route to host [14:05:29] PROBLEM - Host search1008 is DOWN: PING CRITICAL - Packet loss = 100% [14:05:29] PROBLEM - Host search1010 is DOWN: PING CRITICAL - Packet loss = 100% [14:05:29] PROBLEM - Host search1012 is DOWN: PING CRITICAL - Packet loss = 100% [14:05:29] PROBLEM - Host search1009 is DOWN: PING CRITICAL - Packet loss = 100% [14:05:29] PROBLEM - Host search1011 is DOWN: PING CRITICAL - Packet loss = 100% [14:05:30] PROBLEM - Host search1013 is DOWN: PING CRITICAL - Packet loss = 100% [14:05:38] PROBLEM - LVS Lucene on search-pool3.svc.eqiad.wmnet is CRITICAL: No route to host [14:06:18] PROBLEM - Host search1007 is DOWN: PING CRITICAL - Packet loss = 100% [14:06:48] PROBLEM - Host search1006 is DOWN: PING CRITICAL - Packet loss = 100% [14:06:48] PROBLEM - Host search1005 is DOWN: PING CRITICAL - Packet loss = 100% [14:06:49] PROBLEM - LVS Lucene on search-pool2.svc.eqiad.wmnet is CRITICAL: No route to host [14:07:18] PROBLEM - Host search1002 is DOWN: PING CRITICAL - Packet loss = 100% [14:07:18] PROBLEM - Host search1004 is DOWN: PING CRITICAL - Packet loss = 100% [14:07:18] PROBLEM - Host search1003 is DOWN: PING CRITICAL - Packet loss = 100% [14:07:18] PROBLEM - Host search1001 is DOWN: PING CRITICAL - Packet loss = 100% [14:07:58] PROBLEM - LVS Lucene on search-pool1.svc.eqiad.wmnet is CRITICAL: No route to host [14:11:46] hiii anybody up? [14:12:00] can I add Stefan Petrea to the admins:::mortals class? [14:12:08] he needs to be able to access stat1002.eqiad.wmnet [14:12:11] which does not have a public IP [14:12:21] which means he needs an account on bast1001 and/or fenari [14:14:17] <^demon> ottomata: New shell users have to go through RT I believe. [14:18:00] he's not a new user [14:18:05] he's already got the account on stat1002 [14:18:14] he just hasn't logged into it yet [14:18:17] i'm trying to help him now [14:18:26] and I noticed that he doesn't have a fenari or bast1001 account [14:27:18] baackk [14:36:13] New patchset: Andrew Bogott; "Move mail manifests to a module called 'exim'" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68584 [14:41:20] mark (if around) would you have a look at https://gerrit.wikimedia.org/r/#/c/68584/ ? [14:41:32] <^demon> ottomata: I guess just push for review. If someone has a problem I'm sure they'll say so :) [14:42:03] New review: Andrew Bogott; "Before this is merged we need to change the node definition in ldap for labs instances." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68584 [15:20:25] New patchset: Ottomata; "Adding spetrea to admins::mortals so he has an account on bastion hosts." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69313 [15:21:25] New review: Ottomata; "Do I need to make an RT for this? He already hass server access on several machines, including one ..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69313 [15:25:49] paravoid, hi, have you had a chance to review my two patches? one is tiny style update, but the other removes X-Carrier [15:25:59] I'm looking at them now [15:26:10] nice coincidence [15:26:27] so, where does the 30d comes from? [15:27:04] mediawiki sets 30d maxage in cache-control? [15:30:24] Change abandoned: Ottomata; "Not going to use this at all." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48041 [15:30:57] yurik: ^ [15:31:37] paravoid, yep [15:31:45] okay [15:31:48] they need rebasing [15:32:00] hm, or maybe not [15:32:17] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68840 [15:33:03] thanks! [15:33:11] what about X-Carrier in zero.inc.vcl? [15:33:24] ah wait [15:33:50] New patchset: Faidon; "Removed X-Carrier header" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68841 [15:34:32] ? [15:34:38] nevermind [15:34:43] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68841 [15:34:59] ottomata: heads up, X-Carrier is about to be gone, in case it matters for analytics [15:35:12] I think that's fine, we're only using X-Analytics afaik [15:35:19] i don't think X-Carrier is in the logs anymore [15:35:37] yeah that's not a problem at all, we never used that [15:35:43] andrewbogott: I don't like mailman being in the exim module [15:36:02] hey guys, do I need an RT for this? [15:36:03] https://gerrit.wikimedia.org/r/#/c/69313/ [15:36:04] mark: ok, I can break it out into a separate one [15:36:16] stefan already has access to stat1002.eqiad.wmnet, just not an official bastion host [15:37:09] otherwise I think it's fine, it's not a very segregated module and all, but let's not care about that too much at this point, at least it's being moved into a module ;) [15:38:22] yurik: btw, has any progress been made into creating a mobile/zero-enabled index page? [15:38:31] instead of doing all these redirects in varnish [15:38:35] as we previously talked about [15:39:16] paravoid, not really - were mostly trying to figure out a way to get rid of X-CS fragmentation [15:39:34] have you seen the latest email about it ? [15:39:40] i sent a "short" version [15:40:27] btw, you might want to look at it as it adds a lot new redirects to the mex [15:40:30] *mix [15:44:34] so, how about that VE team, eh? ;) [15:46:23] ? [15:46:33] @notify binasher [15:46:33] I'll let you know when I see binasher around here [15:51:02] apergos: the "account creation downtime" thread on Ops. From VE code. [15:53:06] ah [15:53:12] crap happens [15:55:02] mark, do you think spamassassin should be a separate module as well? [15:55:06] apergos: yeah, just joking around, hence the ;). The response/fix looked quick and well done. And it did bring up an unfortunate limitation in our monitoring, so, all good I guess :) [15:55:18] andrewbogott: well yeah [15:55:23] you could just make it all a "mail" module [15:55:25] and not worry about it now [15:55:33] we can always make it cleaner and split up later if we need to [15:56:01] it wasn't permanent harm to revision content, most everything else can be worked around... [15:56:47] I think it's easy enough to break up now, I'll do that [15:58:59] yurik: I just answered [15:59:16] * yurik looking [16:00:07] andrewbogott: also you can migrate that to 4-space indent now if you're moving the files over anyway [16:00:16] :) ok [16:00:39] Mark, in this file https://gerrit.wikimedia.org/r/#/c/68584/5/manifests/mail.pp line 204 [16:01:06] which 'mailman' is included? The mailman subclass that's in scope, or the top level mailman class? [16:01:07] yeah that's dirty [16:01:17] the in-scope one [16:01:26] but it would be better to fully qualify that [16:01:33] In that case what does line 210 do? [16:01:44] because there is no in-scope spamassassin class [16:02:04] that's taking the global one I think [16:02:07] really confusing :) [16:02:08] eek [16:02:17] ok :) [16:02:22] apergos: yep, agree. [16:04:37] New patchset: Andrew Bogott; "Move mail manifests to a module called 'exim'" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68584 [16:05:41] New review: Andrew Bogott; "work in progress - do not merge" [operations/puppet] (production) C: -2; - https://gerrit.wikimedia.org/r/68584 [16:08:10] mark thanks, i really wish we could get rid of all the variances, but zero has no images in HTML - only redirects to them [16:08:25] i know [16:08:27] but m does too for some [16:08:36] not that i know of [16:08:47] m always shows images [16:09:03] i think i'm still confused about the differences [16:09:14] i thought you wanted to replace all that by js (if you can) [16:10:01] mark - m has images, zero - no images. As simple as that really. I do want to replace redirects (to images or to other sites) with either direct links or in-browser confirmation boxes [16:10:18] ok [16:10:53] hey #ops - is there any way for me to query access logs, specifically arround search? I wanna know what search features people are actually using. [16:11:16] manybubbles: i'm not entirely sure, notpeter would know [16:11:58] I think having *two* variants, one with images and one without is fine for now [16:12:03] if all the rest go [16:12:31] paravoid, well, 3 variants - one staying as is today [16:12:38] for non-zero carriers [16:13:08] unless you want me to add redirect replacing js to ALL mobile deviecs [16:13:20] mobilefrontend will be happy... [16:13:35] er? [16:14:34] paravoid, "non-zero-carrier" variant, "m@zero-carrier" and "zero@zero-carrier" [16:15:18] the first one can be removed technically, in which case ALL m.* traffic will have its site-site links rewritten with redirects [16:15:49] and javascript will make them direct again on the client [16:17:08] that's a terrible idea [16:17:15] I didn't propose that [16:18:47] agree :) Although if we (zero) are successful, we will sign up ALL the world's carriers, making the first group almost nothing [16:19:36] but then you can put whether to redirect or not in the varnish per-carrier config, can't you? [16:25:53] New patchset: Jdlrobson; "Make use of centralauth tokens for MobileFrontend photo uploads" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69318 [16:29:52] yurik: so as far as I understand it, for carrier views all external links are sent to a redirector page, and that redirector page is basically not cacheable at all? [16:32:57] mark, redirector is not really the right term we use - its more of a confirmation page that shows Yes/No if the user agrees to switch to non-free resource. In case the resource is also whitelisted, it becomes a silent redirect. And the confirmation page can be made cachable - per X-CS [16:33:27] how does it know where to redirect to then? [16:33:36] URL params [16:33:47] so that's separate cache objects for that too [16:34:01] every external link gets its own cache object * X-CS [16:34:32] potentially - yes, but much much smaller than having every wiki page * X-CS [16:34:46] smaller - the HTML snippets are tiny [16:35:19] and due to javascript clients rewriting those links, much rarer [16:36:13] ok so in the (increasingly common) js case, there's just no call to the "redirector/confirmation page" at all, right [16:36:21] correct [16:38:27] I assume the confirmation page looks different per carrier? [16:40:36] mark, no, confirmation is actually the asme [16:40:38] same [16:41:02] its just a red banner at the top with yes/no, and an empty page thereafcter [16:41:15] then it doesn't really need to vary on X-CS I guess [16:41:25] or rather, just on X-CS being present or not [16:41:37] and javascript working or not [16:41:38] bleh [16:41:47] it does - because sometimes it redirects and sometimes it shows the prompt [16:41:51] depending on the carrier [16:41:57] (redirects silently) [16:42:08] yeah but you can put that config in varnish soon right [16:42:26] with javascript on the page wouldn't even be hit, so that doesn't matter [16:42:39] then again, whether js works or not, not so much [16:43:08] the confirmation page doesn't care if js is there or not - if the user goes to it, it means JS is off [16:43:16] regardless of the reasons for it [16:43:20] ok [16:44:05] then you can put redirect/prompt in the varnish vmod config [16:44:08] as for varnish -- yes, we could technically do the whole redirector logic in varnish -- by generating yet another IP->redirect? logic [16:44:22] well [16:44:24] sorry, not logic - a database [16:44:36] yes [16:45:11] i'll check to see what exactly we can do with that [16:45:17] and then answer again to your mail on the list [16:45:32] mark, but then we would have to have a database for each target site [16:45:47] thanks for explaining again, it's all a bit confusing for us now intimately involved in the different scenarios ;) [16:45:58] target site? [16:46:02] s/now/not/ [16:46:20] if the user is heading to ru.wikipedia -- some carriers have whitelisted ru, but not others [16:46:27] ah right [16:46:32] * mark sighs [16:46:35] whereas if the user goes fr.wiki, that list is different [16:46:47] could still be done fairly easily [16:46:57] if the vmod databases are fairly light [16:47:09] yeah don't want to put too much stuff in there [16:47:30] we could make it a two key database -- IP+string [16:47:35] ->string [16:48:20] if varnish allows string manipulation - like extracting language code from URL [16:48:46] yuck [16:49:08] anyway [16:49:13] this is all only for non-js devices [16:49:23] who said it will be easy :) [16:49:23] so let's not put too much effort into those [16:49:23] yep [16:49:23] agree [16:49:34] hence - we might as well cache all those redirect pages [16:49:59] in some cases it will simply cache 302, in others - a short snippet of HTML code [16:50:31] ok [16:50:41] how well does varnish deal with LOTS of small objects? [16:50:54] no idea [16:50:56] (as oppose to few large ones) [16:51:03] to be discovered the hard way i guess [16:51:40] thanks for all your comments though!!! [16:54:05] mark / paravoid - on a separate note, who wants to review robots change? we are getting lots of links from google. https://gerrit.wikimedia.org/r/#/c/64629/ [16:56:24] done [17:04:10] ^demon and others who care about CirrusSearch: I've just pushed a fix to index dates and caching configuration. [17:04:31] <^demon> Mmk. [17:06:01] the caching we'll need at some point. for now it shaves about 5ms off of cached requests. its still so fast that you don't notice the solr time. [17:06:05] !log reloading squid frontends in pmtpa.text [17:06:13] Logged the message, Master [17:08:04] !log reloading squid frontends in squid_esams_text and eqiad.text [17:08:13] Logged the message, Master [17:08:34] akosiaris, paravoid: when you get a chance: https://gerrit.wikimedia.org/r/#/c/50385/ [17:09:39] ottomata: i briefly talked to diederik yesterday, we said we could prepare ana1019 for reinstall and then i'd try PXE boot there to compare to 1020 to see a pattern or not [17:09:57] haha, diederik JUST said that to me in our stand up 5 seconds ago [17:10:04] can do. [17:10:07] heh, ok [17:14:28] mutante: ottomata: an1020 is always stuck here Scanning for devices. Please wait, this may take several minutes... [17:14:54] any ideas ? [17:15:07] i am btw recycling it ... i hope that is ok [17:15:18] power cycling* it that is [17:15:40] New patchset: awjrichards; "Enable Special:LoginHandshake on betlabs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69322 [17:16:18] New patchset: awjrichards; "Enable Special:LoginHandshake on betlabs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69322 [17:16:54] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69322 [17:17:39] akosiaris: here's what i know: it is NOT DHCP, it does get an DHCPACK from brewster but i dont get to see an installer. i saw an installer ONCE but then could not repeat it [17:18:06] akosiaris: the scanning for devices thing seems still normal, after that it tries PXE, for debugging i disabled booting from disk in BIOS [17:19:43] akosiaris: but after DHCP it fails to get the installer files. could it be TFTP still blocked by network ACL? or it could be what happened to other analytics boxes, it gets a stream of udp2log packets [17:19:46] binasher: I can haz S7? [17:20:30] akosiaris: the goal is just to do a reinstall, if you could get it to an installer , just let it run, would be great [17:21:34] mutante: the stream of udp2log packets only happened on the udp2log boxes [17:21:43] haven't actually seen that yet on the analytics nodes [17:21:46] although this could be it [17:22:04] the udp2log boxes join the multicast group and get a barrage of udp data [17:22:19] but, none of the nodes on the same subnet as an20 join that group [17:22:30] at least not regularly. I have joined it to debug things occasionally [17:26:29] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64629 [17:27:26] !log wtp1008 is going down for main board replacement [17:27:34] Logged the message, Master [17:28:05] gwicke roankattouw james_f ^ [17:28:15] cmjohnson1: Thanks. [17:30:34] PROBLEM - Host wtp1008 is DOWN: PING CRITICAL - Packet loss = 100% [17:36:10] New review: Dzahn; "works now after reloading frontend squids. f.e. in http://www.mobilephoneemulator.com/ i get Real lo..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66991 [17:39:21] !log dns update [17:39:28] Logged the message, Master [17:41:25] New patchset: Andrew Bogott; "Move mail manifests to a module called 'exim'" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68584 [17:41:54] ottomata: what box are the search logs being collected on? [17:42:09] bwerrr, i believe oxygen lemme check [17:42:24] yes [17:42:26] /a/log/lucene [17:42:47] kk, thanks [17:42:56] mark, the latest: https://gerrit.wikimedia.org/r/#/c/68584/ [17:46:09] New patchset: awjrichards; "Enable centralauth token for mobile on betalabs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69325 [17:46:42] New patchset: awjrichards; "Make use of centralauth tokens for MobileFrontend photo uploads" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69318 [17:48:25] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69325 [17:49:12] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [17:49:12] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [17:49:12] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [17:49:12] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [17:49:12] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [17:49:13] PROBLEM - Puppet freshness on spence is CRITICAL: No successful Puppet run in the last 10 hours [17:49:13] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [17:49:14] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [17:49:14] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [17:49:15] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [17:49:15] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [18:05:12] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: No route to host [18:05:23] PROBLEM - LVS Lucene on search-prefix.svc.eqiad.wmnet is CRITICAL: No route to host [18:05:52] PROBLEM - LVS Lucene on search-pool5.svc.eqiad.wmnet is CRITICAL: No route to host [18:06:03] PROBLEM - LVS Lucene on search-pool3.svc.eqiad.wmnet is CRITICAL: No route to host [18:06:56] !log search is based in tampa [18:07:05] Logged the message, RobH [18:07:18] "that's fine" :) [18:07:29] !log ignore eqiad based search errors for now, as tampa is hosting all search traffic while hardware is repaired [18:07:38] Logged the message, RobH [18:07:39] PROBLEM - LVS Lucene on search-pool2.svc.eqiad.wmnet is CRITICAL: No route to host [18:08:02] !log LVS monitoring pages caused by neon recovering [18:08:10] Logged the message, Master [18:08:13] !log log log log log [18:08:21] Logged the message, RobH [18:08:24] PROBLEM - LVS Lucene on search-pool1.svc.eqiad.wmnet is CRITICAL: No route to host [18:11:25] !log it's better than bad, it's good! [18:11:32] Logged the message, Master [18:14:40] greg-g: :) [18:14:43] thanks for the incident report [18:17:47] PROBLEM - Puppet freshness on sodium is CRITICAL: No successful Puppet run in the last 10 hours [18:18:48] ori-l: no problem :) [18:19:07] ori-l: thanks again for working last night on repairing the incident ;) [18:21:47] PROBLEM - Puppet freshness on magnesium is CRITICAL: No successful Puppet run in the last 10 hours [18:24:05] New review: Andrew Bogott; "Making this happen automatically makes it somewhat annoying to test local changes to the syncronised..." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/68309 [18:32:24] cmjohnson1: heh of course ….. i finally am like fuck it and am about to opena juniper case .. do a commit synchronize and of course, it's working now [18:34:32] mutante: the SM CLP of an1020 does not show the PXE sequence... had to use the DRAC webif to see it. So i confirm that it is stuck in TFTP [18:34:41] now... let's find out why [18:35:19] akosiaris: update i just received! it might be that carbon is down, RobH is currently looking [18:35:44] RobH: [18:35:57] lesliecarr: of course :-) [18:36:14] New patchset: Asher; "additional [centralauth] tables to filter at the sanitarium" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69335 [18:36:15] akosiaris: are you on it now? if so and you arent doing something, i would like to take it over [18:36:27] i wanna take a gander at what you all are seeing. [18:36:36] carbon's tftpd is running [18:36:48] (though doesnt mean it was when you tried last) [18:36:50] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69335 [18:37:21] ....No more sessions are available for this type of connection! [18:37:53] RobH: yeah you can take over [18:38:30] i can help though if you want... and I wanna find out wtf the problem is... [18:38:47] the "no more sessions" thing might happen if we try to mix java console and ssh console stuff [18:39:07] xmm [18:39:07] as in one user per webif console and one per ssh to mgmt [18:39:13] that is me ... two sessions [18:39:18] in ssh both though [18:39:20] i am logging out [18:39:35] yea you are taking up all the session slots. [18:39:53] now its workin [18:39:56] logged out from both [18:40:01] you could share a screen and then ssh to mgmt from there [18:40:17] yea thats what iron is for [18:40:19] but no one does [18:40:24] heh, ok:) [18:40:42] ah, we should really remember that :) [18:40:50] true [18:41:16] iron ? [18:42:06] https://wikitech.wikimedia.org/wiki/Systems_management [18:42:09] this iron ? [18:42:48] ", do not reference this page yet ":) [18:43:09] but yeah, iron is a Wikimedia IPMI Management (misc::management::ipmi). [18:43:29] Last login: Thu Jun 6 [18:43:30] on it [18:43:57] !log installing kernel and package upgrades on iron [18:44:11] Logged the message, Master [18:45:03] so we had to set that up so the ipmi mgmt bash wrapper could puppet install on it [18:45:10] since the c2100s werent drac, ipmi only [18:45:20] about to run scap [18:45:22] now no more c2100s, heh [18:47:20] it was kind of slow installing those upgrades but done [18:47:32] reboot? [18:48:15] not while we troubleshoot the TFTP problem [18:48:27] 5 mins after we figure it out :-) [18:48:44] sorry, i saw no other user on it so already went ahead [18:48:50] in that second [18:49:16] ok ok... no worries [18:49:29] New patchset: Andrew Bogott; "Move ldap into a module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69337 [18:49:32] !log rebooting iron [18:49:32] iron is a bastion not tftp [18:49:40] Logged the message, Master [18:50:40] New review: Andrew Bogott; "Work in Progress -- do not merge" [operations/puppet] (production) C: -2; - https://gerrit.wikimedia.org/r/69337 [18:50:43] waits for console output :p [18:51:13] you can go ahead with carbon/ana1020 from anywhere [18:53:02] uhm, iron .. i don't exactly see it coming back, sigh [18:53:13] something is blocking TFTP [18:53:41] way to few packets coming and going... [18:53:52] note that analytics is on its own vlan [18:53:55] and it keeps requesting precise-installer/ubuntu-installer/amd64/pxeli [18:55:08] lol, so iron is already back , i just couldnt see it [18:55:08] thats why [18:55:09] and monitoring didnt notice either [18:55:14] tftp is failing on the server ide [18:55:16] it times out trying [18:55:23] this is network based. [18:55:39] yeah but requests are arriving at carbon [18:55:49] i think the replies are never getting to an1020 [18:55:56] yea its allowing out [18:55:57] not in. [18:56:31] New patchset: Kaldari; "Turn on Disambiguator on en.wiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69340 [18:56:42] also feel free to actually use iron again, it's up and has new kernel [18:57:08] !log kaldari Started syncing Wikimedia installation... : [18:57:13] akosiaris: ottomata: so most likely it is network ACL [18:57:16] Logged the message, Master [18:57:22] yeahhhhhh that sounds like ACL [18:58:47] the ACL is supposed to only restrict traffic going out from analytics cluster [18:59:12] ottomata: yes, i have confirmed that, though it could be some request from analytics failing .... [18:59:19] yeah [18:59:26] but you already looked at tftp thing, right? [18:59:34] maybe it uses another port to initialize connection? dunno? [18:59:49] it needs to ask for the right installer file first and then receive it, so both directions needed, right [19:00:12] unless it's detected as related? [19:00:23] doubtfull [19:00:25] it does … so that's o n carbon, right ? [19:00:28] differents ports [19:00:37] tftp requests are allowed to carbon and brewster [19:00:39] god... why do i do so many typos ? [19:00:53] 18:52:18.531140 IP 10.64.36.120.2071 > 208.80.154.10.69: 73 RRQ "precise-installer/ubuntu-installer/amd64/pxeli" [|tftp] [19:00:53] 18:52:18.531469 IP 208.80.154.10.53926 > 10.64.36.120.2071: UDP, length 15 [19:01:03] 53926 ? [19:01:11] so ports are changing... [19:01:14] destination port 69 [19:01:23] lunch meeting time now though... [19:01:27] notpeter: all the search has been moved except for searchidx..going to reinstall all of them and update bios while I am at it. [19:01:29] but, the reverse shouldn't matter to us ... [19:01:33] ok, we have to go to lunch thing [19:01:35] bbiab [19:01:43] yea, same heere [19:02:09] cool. see you later [19:02:21] gonna get something to eat too [19:03:12] New patchset: Andrew Bogott; "Move ldap into a module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69337 [19:09:04] New patchset: Andrew Bogott; "Move mail manifests to a module called 'exim'" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68584 [19:15:59] !log kaldari Finished syncing Wikimedia installation... : [19:16:08] Logged the message, Master [19:17:29] New review: Alex Monk; "Caused bug 49759" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69130 [19:18:26] New patchset: Cmjohnson; "updating wtp1008 mac" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69345 [19:30:50] Change merged: Cmjohnson; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69345 [19:33:18] New review: Dr0ptp4kt; "This may need to be reverted, depending on the expected lifetime of the various robots.txt files in ..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64629 [19:34:33] yurik, mark, paravoid, jdlrobson ^^ [19:34:58] dr0ptp4kt, sup [19:35:30] yurik, see the comments in change 64629. i think that actually needs to *not* be released into production yet. do you know if that change is already synced to production? [19:36:20] yurik, it seems that either a cached robots.txt file is being served, or that the patch isn't working as expected, probably due to caching. this is good, as it actually needs to be delayed for a good week extra yet. [19:37:13] dr0ptp4kt, reverting [19:37:22] yurik, thanks pal. [19:37:29] New patchset: Yurik; "Revert "Instruct robots to stop indexing zero.wikipedia.org and its subdomains."" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69349 [19:37:39] Change merged: Yurik; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69349 [19:37:42] dr0ptp4kt, done [19:37:56] https://gerrit.wikimedia.org/r/#/c/69349/ [19:38:56] New review: Reedy; "(1 comment)" [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/68415 [19:40:02] Change merged: Kaldari; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69340 [19:40:52] yurik, thanks. mark, paravoid, jdlrobson, yurik: do you know how long it normally takes for stuff merged into operations/mediawiki-config to be actually deployed out to the production wikipedias? [19:43:04] !log kaldari synchronized wmf-config/InitialiseSettings.php 'Turn Disambiguator on on en.wiki' [19:43:12] Logged the message, Master [19:43:35] New patchset: Nemo bis; "Update gitweb/gitblit RSS" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68415 [19:48:04] yo ori-l, you there? quick puppet design brainbounce? [19:48:16] ottomata: hey, what's up? [19:48:24] i'm working on puppetizing some hive stuff [19:48:52] there's basically two kinds of puppetizations: client + server (there's more than that but thats good enough for our brainbounce) [19:49:08] in most of my other cdh4 puppetization stuff, i've installed the exact same config files on all nodes [19:49:23] even if it some of the configs are not necessarily relevant to all of them [19:49:39] that keeps the puppetization easier, and also makes examining config files easier [19:49:44] i *can* do this in this case as well [19:49:56] except that on the server's config file, I need to store a mysql database password [19:50:16] so I wanted to conditionally only include that part of the config file on the server [19:50:21] and on the server make it not world readable [19:51:03] but i'm not sure of the best way to do that, there are a few different ways, and none of them are as elegant or consistent with what I've been doing so far [19:51:21] can i see what you have so far somewhere? [19:51:28] yeah, hmmmmmMmMm [19:51:32] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/67300 [19:51:33] I think I can push for review and say its not ready yet [19:51:35] one sec [19:53:57] New patchset: Ottomata; "Puppetizing hive client, server and metastore." [operations/puppet/cdh4] (master) - https://gerrit.wikimedia.org/r/69353 [19:54:12] ori-l^ [19:54:26] mainly just look at [19:54:26] * ori-l looks [19:54:38] hive.pp, hive/metastore.pp and templates/hive/hive-site.xml.erb [19:55:30] in the .erb template, i want to be able to do something like "<% if is_metastore_host _%> metastore db creds here <% end -%>" [19:55:30] New review: Alex Monk; "(1 comment)" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/68309 [19:55:57] ottomata: k, and can you explain what "client" and "server" mean in the context of hive? [19:56:02] yeah [19:56:14] client is just whatever binaries you use to connect to run client queries [19:56:18] its kind of the same as mysql [19:56:24] in this case, server is a bit more than jsut a server [19:56:28] there is a hive-server [19:56:32] and a hive-metastore [19:56:40] the 'client' talks only to hive-server [19:56:46] hive-server talks to hive-metastore [19:57:02] hive-metastore is (afaik) just a proxy JDBC service [19:57:05] New review: MaxSem; "(1 comment)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68309 [19:57:06] it probably does more than that [19:57:20] but for our purposes thats what it does [19:57:27] it needs to be configured to talk to a database backend of some kind [19:57:30] by default we are using mysql [19:57:32] so [19:57:39] the hive-site.xml file on the hive-metastore node [19:57:45] needs to have a mysql username and password in it [19:58:02] none of the hive 'client' nodes need that info [19:58:12] they only need to know how to talk to hive-server and hive-metastore [19:59:39] i'm thinking i need to separate out the hive client and hive server/metastore configuration [19:59:52] right now hive server/metastore require that cdh4::hive (client) configs are installed first [20:00:21] what about adding an 'extra_settings' parameter [20:00:39] and then <%= @settings %> in the erb file [20:00:57] and you can do extra_settings => template('to_be_interpolated_into_main_erb_tempalte.erb') [20:01:03] hmm, i don't need a whole extra_settings thing, but I guess I could add a parameter to cdh4::hive that said metastore_host => true [20:01:29] both seem a little inelegant though [20:01:50] i could decouple cdh4::hive from server/metastore hosts altogether [20:02:01] but that is inconsistent with everything else i've been doing in this module [20:02:12] and would duplicate a lot of the parameters in multiple classes [20:02:29] hmmm [20:02:34] i could make another wrapper class [20:02:36] well, you can hide extra_settings inside a 'cdh4::metastore' [20:02:37] right [20:02:37] i already have cdh4::hive::Master [20:02:46] which includes cdh4::hive with the extra_settings param [20:02:47] if I made a cdh4::hive::client [20:02:56] or that, sure [20:03:08] the issue is which class renders the hive-site.xml file [20:03:27] if I remove that from cdh4::hive then I can do this [20:03:29] hm [20:03:35] 'cdh4::hive' will [20:03:38] and render it differenting in ::client vs ::master [20:03:46] differently* [20:03:51] cannot delete non-empty directory: php-1.22wmf2 [20:04:25] hmmm, but I still ahve to duplicate a lot of parameters then. [20:04:38] not a huge deal, i guess [20:04:40] gr [20:04:44] don't like it. [20:04:55] class 'cdh4::hive::server' { class { 'cdh4::hive': extra_settings => template('server_configs.xml') } } [20:05:11] class 'cdh4::hive::client' { class { 'cdh4::hive': } } # no extra settings [20:05:16] right [20:05:18] that would be fine [20:05:35] but then i'd have to dupcliate all of the common parameters between cdh4::hive::client and cdh4::hive::master [20:05:41] oh yeah, I see what you're saying [20:05:42] hrm [20:06:37] its not that hard to just add an extra param to cdh4::hive for metastore_host, i guess, and then have the role class include cdh4::hive differently on the server/metastore hose [20:06:39] host* [20:07:00] yeah, that might be neater after all [20:07:02] its a wee inelegant, since that role class will also be including cdh4::hive::master (which includes ::metastore) [20:07:05] but let me think about this for another minute [20:07:11] seems redundant to have to tell cdh4::hive this as well [20:07:32] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69318 [20:08:40] OH [20:08:42] ori-l, I know [20:08:42] ottomata: what about having a common 'params' class? [20:08:59] my ::defaults class is analagous to the ::params class pattern [20:09:16] puppet -lint tells me not to use the params pattern because it doesn't for older versions of puppet [20:09:39] so I use '::defaults' to get around lint…and because I think the name is more descriptive anyway [20:09:39] but! [20:09:41] i know! [20:09:51] I'm actually missing a config that I haven't done yet [20:09:59] I need to add [20:09:59] hive.metastore.uris [20:09:59] thrift://:9083 [20:10:00] I think you can disregard that. I don't agree with puppet-lint on this. [20:10:06] so [20:10:27] that needs to be given to all hive nodes [20:10:33] and then I can just do [20:10:54] if $::ipaddress == $metastore_ipaddress { ..include metastore db creds } [20:10:59] or $::fqdn, wahtever I use [20:11:20] that seems magical in a bad way [20:11:49] ok, back [20:12:41] naw its way cool [20:12:55] ottomata: https://cwiki.apache.org/Hive/adminmanual-configuration.html seems to suggest that hive-site.xml should contain site-wide settings [20:12:58] so much more elegant, and i've done something similar for other things [20:13:13] so maybe you can specify the mysql configuration using -hiveconf command [20:13:26] command-line argument [20:13:50] yargh, then I'd have to specificy a special hive-env.sh for the metastore service, or modify the init.d file [20:13:54] that starts the metastore [20:14:30] i think the latter would be OK; they're different services after all [20:14:44] you don't want to completely hide the distinction between metastore / nonmetastore instances using puppet magic [20:15:18] so … wtf with the analytics subnet [20:16:42] haha, LeslieCarr wtf indeed! [20:16:49] ori-l, it doesn't [20:17:01] you still have to include cdh4::hive::metastore on your metastore host to get it working [20:17:16] all this does is keep the db creds out of non-metastore host's hive-site.xml file [20:17:18] so, let's walk through where it dies … first anX requests a dhcp address, which it receives …. then it requests the tftp image, which fails ? [20:17:51] LeslieCarr: mutante and RobH and maybe akosiaris know more about the current status, but from what I know, yes that is true [20:20:25] hrm, so it's an1020 -- mind if i reboot it (yet again?) [20:20:55] go right ahead [20:20:57] its down for the count [20:21:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:21:24] LeslieCarr: in case you haven't seen it [20:21:27] here is more info: [20:21:27] https://rt.wikimedia.org/Ticket/Display.html?id=5281 [20:21:39] thanks [20:21:46] mutante: i was considering upgrading iron to precise .... [20:21:56] when I looked into this, I didn't realize it would be loading installer to carbon (didn't know we used anything but brewster) [20:22:52] ottomata: what you say (re: cdh4 puppet configs) makes sense; i'm reading the docs a bit. [20:23:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [20:24:55] k aye cool [20:25:32] blog's dead? [20:25:58] odder: looks ok to me [20:26:44] didn't want to load for me for a while, looks ok now [20:26:44] ^demon: gerrit should have a mrconfig output like following imo: https://gist.github.com/azatoth/5809029 [20:27:41] <^demon> Output? It uses that format in refs/meta/config and gerrit.config and so-forth... [20:27:53] <^demon> That's standard gitconfig [20:28:54] LeslieCarr: reinstall or in-place ? [20:29:03] mutante: in place [20:29:16] New patchset: awjrichards; "Commons to commons.m for mobile uploads" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69392 [20:29:22] LeslieCarr: can do, i'm on it [20:29:24] New patchset: MaxSem; "Switch mobile login handshake to commons.m" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69393 [20:29:28] sure [20:29:33] awjr, mid-air collision:P [20:29:55] god an1020 takes forever to reboot [20:29:57] Change abandoned: MaxSem; "LEEEEROY JENKINS!!1" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69393 [20:30:05] lol MaxSem [20:30:09] ^demon: never seen "checkout" as a key in gitconfig [20:30:30] ^demon: or are you referring to the "ini" format [20:30:47] I was referring to a fully functional mrconfig file [20:31:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:31:25] I assume you've used mr ^demon? [20:31:31] <^demon> No, I haven't. [20:31:33] LeslieCarr: uhm, it can't find the new release [20:31:35] ok [20:31:48] Checking for a new ubuntu release [20:31:48] No new release found [20:31:51] ^demon: it's a multi-repository thingi [20:31:57] <^demon> ah [20:32:04] cmjohnson1: awesome. [20:32:09] apt-get install mr [20:32:10] do you mean all of or half of? [20:32:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [20:32:15] Change merged: MaxSem; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69392 [20:32:19] I thought that were splitting across racks? [20:32:22] (do not type "mr T") [20:32:28] all of search is relocated [20:32:32] huh, ok [20:32:39] cool! [20:32:40] thanks! [20:32:54] i though that was the plan...besides iirc we should keep search together [20:33:08] notpeter...what is the deal with moving searchidx1001? [20:33:11] hhhhhmmmmm, I thought that mark had said split across racks... [20:33:25] LeslieCarr: i'm doing it the Debian way then... [20:33:26] ^demon: mr run reset --hard && mr run clean -fdx && mr update [20:33:30] i don't think mark wanted any search in that 10g rack [20:33:41] ah, yeah, that makes sense [20:33:54] !log upgrading iron to precise [20:34:01] well, I guess we can move searchidx1001 as well [20:34:02] Logged the message, Master [20:34:14] ^demon: had to construct that file myself :( [20:34:15] this will give it a new ip [20:34:26] I thought we were only moving half, so I wanted to make sure that idx wasn't moved [20:34:29] yeah.... [20:34:41] wasn't that hard, but a bit cumbersome [20:34:56] so, dunno if we have to reinstall... [20:35:12] we can probably just reinstall and not wipe /a [20:35:18] ok, think i see what was blocked … trying again :) [20:35:21] I can do that if you'd like [20:37:27] grrrr [20:37:30] wtf [20:39:48] no good [20:39:49] ? [20:40:00] when upgrading a server to preicse in place and you run into the "OMG, you are going to remove lmza, type "Do as I say" thing", then it's https://bugs.launchpad.net/ubuntu/+source/update-manager/+bug/944452 and the work-around should be to first upgrade the "dpkg" package itself [20:40:05] ottomata: nope, this time i put a silly udp logging filter on because goddamnit we WILL see what is going on [20:40:20] !log added temp logging filter to cr2-eqiad analytics4 filter [20:40:21] i like your style [20:40:29] Logged the message, Mistress of the network gear. [20:42:39] ottomata: I think the way to go about this is to have separate hive-default.xml & hive-site.xml files [20:43:02] hive automatically looks for / auto-loads both files, with hive-site.xml taking precedence over hive-default.xml [20:43:07] oh, and next you will run into https://bugs.launchpad.net/ubuntu/+source/python-defaults/+bug/990740 [20:44:29] so you should have your current xml template generate /etc/hive/conf/hive-default.xml, which should be the same for all hive hosts, and then separately generate an additional hive-site.xml on the metastore that contains the jdbc connection string with the password [20:45:03] the cdh4 packages don't ship with a hive-default.xml so you don't have to worry about reproducing any built-in defaults [20:45:03] work around for python-minimal issue = -o APT::Immediate-Configure=false [20:46:04] ottomata: https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-12/running-hive search for the bit that starts with "There is a precedence hierarchy" ... [20:47:19] OH Hrrrmmmmmm [20:47:27] cdh4 doesn't ship with -default*? [20:47:28] uhh [20:47:29] hm [20:47:49] i guess it just ships with hive-default.xml.template [20:47:49] hm [20:47:58] New review: Andrew Bogott; "This is tricky to test in labs due to conflicts, but I've verified that some roles run cleanly." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69337 [20:48:18] oof that file has so many settings in it [20:59:56] RobH, I wasn't aware I was reopening that ticket. When I submitted the reply the status dropdown was left at 'resolved', so I didn't think I was changing anything [21:00:20] New patchset: RobH; "Add user for jforrester and grant access to stat1" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68821 [21:00:40] ori-l is now doing access request tickets like mutante and having the patchsets queued up for me. [21:00:41] i like it. [21:00:54] RobH: :) [21:01:21] that's awesome [21:01:32] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68821 [21:01:51] andrewbogott: \O/ :) [21:02:15] New review: Lcarr; "office@wikimedia is the address that the office admins use - maybe officeit ?" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/68127 [21:02:19] hashar, might be a while before it gets merged [21:04:35] ok, wtf, i finally see the issue [21:04:44] it's trying to send from source port 2072 to destination port 44740 [21:04:45] oh? [21:04:50] wtf why is that in existance [21:07:03] PROBLEM - NTP on iron is CRITICAL: NTP CRITICAL: No response from NTP server [21:07:34] and now 2074 ... [21:07:35] hrm [21:07:40] it may just have freaking random ports [21:09:25] hrm, so if we can make atftpd use source port 69 to respond, it should go back and forth correctly ... [21:09:38] LeslieCarr: that is what tftp does [21:09:47] as well as ftp [21:10:04] just switches from using the specific port to random source/dest ? [21:10:05] grr [21:10:19] yeah... by design [21:10:35] and it is not random [21:10:46] it answer on the ephemeral port of the client [21:10:55] but from an ephemeral port of its own [21:11:13] New patchset: Ottomata; "Puppetizing hive client, server and metastore." [operations/puppet/cdh4] (master) - https://gerrit.wikimedia.org/r/69353 [21:11:26] like c:2071 -> s:69 then s:random -> c:2071 [21:11:28] and then the client is responding from its ephemeral port, to the tftp server's ephemeral port … which is where the problem comes in [21:11:40] :-( [21:11:44] hm [21:12:03] well, we'll just have to widen the firewall hole [21:13:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:13:28] New patchset: Pyoungmeister; "adding a labsdb management tool and asociated class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64862 [21:14:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [21:14:55] man, my shit's hella not pep8 [21:15:00] good to know! [21:15:20] thanks for implementing, andrewbogott! [21:19:16] puppet...kill me. [21:19:34] New patchset: RobH; "RT5330 ve.wikimedia.org redirect to wikimedia.org.ve" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/69404 [21:25:23] RECOVERY - Host analytics1020 is UP: PING OK - Packet loss = 0%, RTA = 0.64 ms [21:26:04] ottomata: ok, keep your fingers crossed this time [21:26:43] toes crossed! [21:26:48] New patchset: Pyoungmeister; "adding a labsdb management tool and asociated class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64862 [21:26:54] about to scap... [21:27:22] Change merged: RobH; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/69404 [21:27:25] PROBLEM - Host analytics1020 is DOWN: PING CRITICAL - Packet loss = 100% [21:29:01] ottomata: Loading ubuntu-installer/amd64/linux....... [21:29:02] :) [21:29:20] wee:) [21:30:39] WOooo [21:31:10] what did you end up doing? IDing atftp traffic some other way? [21:32:33] RECOVERY - Host analytics1020 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [21:34:43] PROBLEM - DPKG on analytics1020 is CRITICAL: Connection refused by host [21:35:14] PROBLEM - RAID on analytics1020 is CRITICAL: Connection refused by host [21:35:14] PROBLEM - Disk space on analytics1020 is CRITICAL: Connection refused by host [21:35:15] thanks Leslie, this was RT-5281, i cant find the other ticket anymore [21:35:23] PROBLEM - SSH on analytics1020 is CRITICAL: Connection refused [21:36:08] !log maxsem Started syncing Wikimedia installation... : Weekly mobile deployment [21:36:17] Logged the message, Master [21:36:24] ottomata: allowing destination ports of ethereal ports with udp to brewster and carbon [21:36:38] grrr, it wants to be manually partitioned [21:37:01] hm, it shoudln't, not for / [21:37:22] it's a hater [21:37:48] LeslieCarr: if you like, leave it there for me! If the installer runs then I can take it up from here [21:37:57] ottomata: it's all yours [21:38:00] k [21:38:00] Coren|Away: redacted s7 is now being copied to labsdb1003 [21:38:03] thanks so soso much! [21:38:04] yay! [21:38:08] !log apache-graceful-all for redirects [21:38:14] i won't get to it today, but i'll pick it up tomorrow for sure [21:38:15] * Coren|Away dances [21:38:17] Logged the message, Master [21:38:19] Thanks binasher [21:42:52] New review: Jdlrobson; "(1 comment)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68309 [21:43:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:44:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.156 second response time [21:46:42] PROBLEM - DPKG on sockpuppet is CRITICAL: DPKG CRITICAL dpkg reports broken packages [21:46:51] PROBLEM - NTP on analytics1020 is CRITICAL: NTP CRITICAL: No response from NTP server [21:47:19] New review: MaxSem; "@Andrew Bogott: This cronjob is supposed to be run daily, which is seldom enough to allow local test..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68309 [21:47:41] RECOVERY - DPKG on sockpuppet is OK: All packages OK [21:49:16] !log maxsem Finished syncing Wikimedia installation... : Weekly mobile deployment [21:49:25] Logged the message, Master [21:49:34] MaxSem: redirects for commons/wikimania2013 work now [21:49:53] mutante, awesome, thanks [21:49:59] MaxSem, do you want to add those other two servers to https://gerrit.wikimedia.org/r/#/c/68309/ or shall I merge it as is? [21:51:27] andrewbogott, I wanted hashar to take a look at it first - he had some ideas about it [21:51:34] 'k [21:54:21] !log maxsem synchronized php-1.22wmf6/extensions/MobileFrontend/ [21:54:30] Logged the message, Master [21:56:21] LeslieCarr: any idea if iron really needs "multiverse" for some reason, i'm gonna disable it i think [21:56:36] no idea, i would just keep it to the normal defaults [21:56:43] we don't need to specialize it [21:57:20] nod. ok [21:57:47] <^demon> manybubbles: I'm stepping out to dinner. Once I'm back, gonna try to and finish up my update overhaul. It depends on that core change that just went in, fyi. [21:58:22] I figured. have a nice dinner. I'm looking at incategory: [22:02:20] MaxSem: yeah I like the idea of syncing the CSS , we could use that for gadgets too [22:02:33] :) [22:02:37] MaxSem: but that would mean having to rely on ops to tweak the list of articles [22:03:13] * MaxSem mumbles something about a local unpuppetized config and runs away:P [22:03:21] !log maxsem synchronized php-1.22wmf6/extensions/MobileFrontend/ [22:03:29] Logged the message, Master [22:03:31] PROBLEM - mysqld processes on labsdb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:03:31] PROBLEM - MySQL Recent Restart Port 3306 on labsdb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:03:31] PROBLEM - MySQL Idle Transactions Port 3306 on labsdb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:03:56] !log rebooting iron to finish upgrade [22:04:04] Logged the message, Master [22:04:24] MaxSem: so yeah lets go with your shell script :) [22:04:29] MaxSem: could migrate to jenkins later on [22:05:01] PROBLEM - Host iron is DOWN: PING CRITICAL - Packet loss = 100% [22:05:02] hashar, is jenkins appropriate for basically a cronjob-type stuff? [22:05:08] yuo [22:05:20] Jenkins is basically a huge scheduler [22:05:33] LeslieCarr: Linux iron 3.2.0-48-generic [22:05:33] RECOVERY - MySQL Recent Restart Port 3306 on labsdb1001 is OK: OK seconds since restart [22:05:33] RECOVERY - mysqld processes on labsdb1001 is OK: PROCS OK: 1 process with command name mysqld [22:05:33] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: No route to host [22:05:34] we could replace all of our cron jobs with it :-) [22:05:39] woo [22:05:43] RECOVERY - MySQL Idle Transactions Port 3306 on labsdb1001 is OK: OK longest blocking idle transaction sleeps for seconds [22:05:44] PROBLEM - LVS Lucene on search-prefix.svc.eqiad.wmnet is CRITICAL: No route to host [22:05:53] yes icinga-wm we know [22:06:03] PROBLEM - LVS Lucene on search-pool3.svc.eqiad.wmnet is CRITICAL: No route to host [22:06:06] RECOVERY - Host iron is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [22:06:13] PROBLEM - LVS Lucene on search-pool5.svc.eqiad.wmnet is CRITICAL: No route to host [22:06:45] so, why do I keep getting pages for search [22:06:48] is anyone looking at it? [22:07:05] hardware work on eqiad, it's not the current cluster [22:07:09] see the log [22:07:24] why log and not scheduled downtime in nagios? [22:07:35] that's a q for those doing the work [22:07:46] PROBLEM - LVS Lucene on search-pool2.svc.eqiad.wmnet is CRITICAL: No route to host [22:08:48] New review: Hashar; "I have filled a bug to have it migrated to Jenkins eventually: https://bugzilla.wikimedia.org/show_..." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/68309 [22:08:54] PROBLEM - LVS Lucene on search-pool1.svc.eqiad.wmnet is CRITICAL: No route to host [22:09:11] andrewbogott: maxsem sync script at https://gerrit.wikimedia.org/r/#/c/68309/ is good to me :) [22:09:49] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68309 [22:10:41] New patchset: Dr0ptp4kt; "Instruct robots to not index Wikipedia Zero. No deploy before 25-June-2013." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69420 [22:10:43] MaxSem: misc::beta::sync-site-resources is in :) [22:10:59] New patchset: Ori.livneh; "Provide a sensible resource name for Daniel Kinzler's SSH key" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69421 [22:11:13] MaxSem: I guess the class needs to be applied on the beta bastion isn't it ? [22:11:28] yep [22:11:35] New review: Dr0ptp4kt; "This supercedes change 64629, which was reverted." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69420 [22:11:56] MaxSem: will do [22:11:57] good night, need to get up early [22:12:11] MaxSem: rest well! [22:12:29] night MaxSem [22:13:38] New review: Yurik; "if you don't want something deployed, pls mark it with -2" [operations/mediawiki-config] (master) C: -2; - https://gerrit.wikimedia.org/r/69420 [22:13:47] New review: Dr0ptp4kt; "Must wait until 26-June-2013. On that day the Google index needs to be validated as refreshed." [operations/mediawiki-config] (master) C: -2; - https://gerrit.wikimedia.org/r/69420 [22:14:58] New patchset: Asher; "reduce bufferpool size on labsdb s1 slave, set max_user_connections = 10 for all labs instances" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69422 [22:15:05] !log scheduled 24 hour downtime for eqiad search-pools to stop paging [22:15:14] Logged the message, Master [22:15:48] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69422 [22:16:11] heya paravoid, if you are still up and working: [22:16:12] https://gerrit.wikimedia.org/r/#/c/50385/ [22:16:27] i'm out for the eve here, but I'd love to be able to pick that up and work on any further comments in the morning here [22:16:31] dankeees, later! [22:16:44] ottomata: did you see my note earlier about the default/site xml configs? [22:17:02] just making sure they weren't lost [22:20:00] New patchset: Pyoungmeister; "adding a labsdb management tool and asociated class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64862 [22:20:39] New patchset: Ori.livneh; "Re-enable EventLogging on labs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69423 [22:21:08] greg-g: ^ could I sync that during LD? [22:21:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:22:01] New review: Pyoungmeister; "no pep8 for now. sorry!" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/64862 [22:22:07] ori-l: i saw, not sure if i want to do that or not yet [22:22:07] ori-l: on labs? go right ahead [22:22:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [22:22:17] * greg-g doesn't care about labs [22:22:18] New review: Faidon; "I don't think anything new that stands out, just these as a followup:" [operations/puppet/kafka] (master) C: -1; - https://gerrit.wikimedia.org/r/50385 [22:22:18] :P [22:22:25] ottomata: np, what you were suggesting seemed fine too [22:22:51] greg-g: now, you mean? if it's fair game to sync labs stuff off-schedule, i wouldn't mind [22:22:53] ottomata: done [22:23:08] ori-l: yeah, nothing depends on it other than testing stuff, right? [22:23:11] yep [22:23:19] go forth the, good sir [22:23:23] s/the/then/ [22:23:51] thanks [22:24:29] New patchset: Jgreen; "add public key for root@barium to backupmover@various_fundraising_hosts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69424 [22:25:06] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69424 [22:25:28] New patchset: Pyoungmeister; "adding a labsdb management tool and asociated class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64862 [22:26:15] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69423 [22:26:19] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69149 [22:27:48] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64862 [22:29:09] !log olivneh synchronized wmf-config/CommonSettings.php 'I7fc1a5ad4: Ensure forward-compatibility by setting ' [22:29:17] Logged the message, Master [22:29:26] New patchset: Pyoungmeister; "adding skrillex db management tool to tin" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69425 [22:30:01] !log olivneh synchronized wmf-config/InitialiseSettings-labs.php 'Ibeb07a4b9: Re-enable EventLogging on beta cluster' [22:30:09] Logged the message, Master [22:30:09] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69425 [22:32:16] New patchset: Pyoungmeister; "template needs correct name" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69427 [22:32:55] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69427 [22:34:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:39:50] dr0ptp4kt: good catch indeed [22:40:10] binasher should tell you a story about a prior incident with such a robots.txt going live on enwiki :) [22:40:20] paravoid, thx, ha. [22:40:33] RECOVERY - Host wtp1008 is UP: PING OK - Packet loss = 0%, RTA = 0.41 ms [22:40:43] PROBLEM - RAID on wtp1008 is CRITICAL: Connection refused by host [22:40:53] PROBLEM - Disk space on wtp1008 is CRITICAL: Connection refused by host [22:40:53] PROBLEM - Parsoid on wtp1008 is CRITICAL: Connection refused [22:40:53] PROBLEM - NTP on wtp1008 is CRITICAL: NTP CRITICAL: No response from NTP server [22:41:03] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.119 second response time [22:41:53] PROBLEM - NTP on ssl3002 is CRITICAL: NTP CRITICAL: No response from NTP server [22:43:33] PROBLEM - DPKG on wtp1008 is CRITICAL: Connection refused by host [22:43:33] PROBLEM - SSH on wtp1008 is CRITICAL: Connection refused [22:48:04] New patchset: Ori.livneh; "Configure $wgEventLoggingBaseUri & $wgEventLoggingFile on beta cluster" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69432 [22:49:39] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69432 [22:51:28] !log olivneh synchronized wmf-config/CommonSettings-labs.php 'Iaade2591b: Configure $wgEventLoggingBaseUri & $wgEventLoggingFile on beta cluster' [22:51:37] Logged the message, Master [22:54:03] PROBLEM - NTP on ssl3003 is CRITICAL: NTP CRITICAL: No response from NTP server [22:54:33] RECOVERY - SSH on wtp1008 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [22:57:23] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [22:59:44] PROBLEM - Disk space on db9 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=91%): [23:00:24] PROBLEM - MySQL disk space on db9 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=91%): [23:00:25] ouch, freeing disk on db9 [23:01:29] RECOVERY - MySQL disk space on db9 is OK: DISK OK [23:01:43] RECOVERY - Disk space on db9 is OK: DISK OK [23:04:43] RECOVERY - mysqld processes on labsdb1003 is OK: PROCS OK: 3 processes with command name mysqld [23:05:35] RECOVERY - DPKG on wtp1008 is OK: All packages OK [23:05:44] RECOVERY - RAID on wtp1008 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [23:05:53] RECOVERY - Disk space on wtp1008 is OK: DISK OK [23:21:51] RECOVERY - NTP on wtp1008 is OK: NTP OK: Offset -0.003150463104 secs [23:30:07] heh, i should have done rollback 3 instead of rollback 10 …. waiting for access to return.... [23:30:17] (to a new oob connection … nothing in production) [23:39:04] New patchset: Asher; "make skrillex.py executable by wikidev group" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69436 [23:39:28] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69436 [23:51:43] damn, downloader operations/dumps/test a whopping 122MB [23:51:47] New patchset: Asher; "new prod redis servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69438 [23:51:48] downloaded* [23:52:26] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69438 [23:54:28] ^demon: Could you look at Gerrit? It's slow again [23:55:46] <^demon> god dammit. [23:56:06] <^demon> There is absolutely nothing in the queue. [23:56:35] <^demon> Nothing exceptional in the log. [23:56:52] <^demon> Fast for me. [23:56:53] <^demon> ##check_your_connection