[00:10:56] !log touch /etc/exim4/defer_domains on streber [00:11:04] Logged the message, Master [00:12:27] New patchset: Andrew Bogott; "Create an empty /etc/exim4/defer_domains if it does not exist." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69267 [00:12:51] mutante, ^ [00:13:07] ah:) [00:13:54] I'll run that by Mark in the morning. [00:14:01] New review: Dzahn; "yes please, just touch'ing that file fixed exim on streber after the template change" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/69267 [00:14:01] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69267 [00:14:13] Oh, nevermind :) [00:16:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:18:03] andrewbogott: mutante: You don't need to do "Verified: +2" when jenkins-bot is already running. When jenkins-bot has already set it, doing it again does nothing. By doing it as a habit you only increase the chances of accidentally merging something that has a critical lint error or whatever [00:18:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [00:18:37] Krinkle, ok, good to know -- I thought jenkins only did +1 verified [00:20:10] We haven't revoked the ability to set V+2 from human accounts because lately jenkins/gerrit has been a bit unstable. So if you need to do an emergency merge and jenkins-bot is unavailable (e.g. not voting -2 or +2 either way), then it allows you do override it since the Submit merge button is only clickable if Verified+2 is set. [00:20:31] Be be sure to be aware of overriding it, only when neccecary and aware of it. [00:20:32] Thanks :) [00:30:27] New review: GWicke; "A single-layer cache with two parses per edit would complicate some things for us. API failures can ..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68404 [00:31:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:32:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [00:47:42] !log reloading squid config on esams text squids [00:47:51] Logged the message, Master [00:49:01] New patchset: Ottomata; "Fixing packet_loss_log file on oxygen udp2log instance" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69272 [00:49:29] New patchset: Ottomata; "Fixing packet_loss_log file on oxygen udp2log instance" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69272 [00:49:52] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69272 [00:50:07] !log reloading squid config on pmtpa.txt [00:50:17] Logged the message, Master [00:52:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:53:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.146 second response time [01:01:54] RECOVERY - NTP on ssl3003 is OK: NTP OK: Offset -0.005719184875 secs [01:02:56] RECOVERY - NTP on ssl3002 is OK: NTP OK: Offset -0.004811644554 secs [01:03:19] !log Gracefully reloading Zuul to deploy {{gerrit|I5695a3b988e9ec3138f}} [01:03:30] Logged the message, Master [01:31:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:32:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [01:53:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:54:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [02:01:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:02:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [02:07:32] !log LocalisationUpdate completed (1.22wmf7) at Tue Jun 18 02:07:31 UTC 2013 [02:07:42] Logged the message, Master [02:13:24] !log LocalisationUpdate completed (1.22wmf6) at Tue Jun 18 02:13:24 UTC 2013 [02:13:32] Logged the message, Master [02:25:43] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Jun 18 02:25:42 UTC 2013 [02:25:53] Logged the message, Master [02:41:44] Account creation seems to be broken on en.wp - anybody knows what's going on (or if somebody is still around)? [02:56:03] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [02:56:14] andre__, I'm looking into it. If anyone knows anything, please ping me mentioning my username. [02:56:46] superm401: ah, you're mflaschen. didn't know your nick. :) Thanks! [02:58:29] superm401: I'd love to see the stacktrace for that exception (not that I could change anything though) - can you access that? [02:58:47] I should be able to, testing locally first. [02:59:51] Can't reproduce locally on the same commit, so it's a prod issue most likely. [03:01:43] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:02:33] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.131 second response time [03:02:47] serious stuff: every account creation on enwiki fails with "Fatal exception of type MWException". fluorine finds "Exception from line 215 of /usr/local/apache/common-local/php-1.22wmf6/includes/Hooks.php: Detected bug in an extension! Hook VisualEditorHooks::onAddNewAccount failed to return a value; should return true to continue hook processing or false to abort." [03:02:48] Anyone from VE team around? [03:05:33] in #wikimedia-tech somebody says that account creation works e.g. on meta [03:05:53] spagewmf: https://bugzilla.wikimedia.org/show_bug.cgi?id=49727 [03:06:08] superm401 is taking a look [03:06:36] spagewmf, okay, well that tells us what it is. [03:06:50] I'll do a hotfix to workaround or fix it. [03:08:08] * andre__ assigns the bug report in the meantime [03:09:32] !log Started hotfix for account creation on wmf6. [03:09:39] !log mflaschen synchronized php-1.22wmf6/extensions/VisualEditor/VisualEditor.php 'Emergency hotfix to fix account creation' [03:09:45] Logged the message, Master [03:09:48] superm401 I see what you did there, thanks! [03:09:53] Logged the message, Master [03:10:32] Thanks so much! [03:11:40] Didn't work [03:12:14] superm401 what's the new error code I'll grep it on fluorine [03:12:25] spagewmf, cf48759f [03:13:10] Same, [cf48759f] /w/index.php?title=Special:UserLogin&action=submitlogin&type=signup&returnto=Main+Page Exception from line 215 of /usr/local/apache/common-local/php-1.22wmf6/includes/Hooks.php: Detected bug in an extension! Hook VisualEditorHooks::onAddNewAccount failed to return a value; should return true to continue hook processing or false to abort. [03:13:36] you need to sync VisualEditor.hooks.php ? [03:13:43] spagewmf, yeah, just saw it. [03:14:09] Sorry [03:14:14] Resyncing now [03:14:21] !log mflaschen synchronized php-1.22wmf6/extensions/VisualEditor/VisualEditor.hooks.php 'Emergency hotfix to fix account creation' [03:14:30] Logged the message, Master [03:15:08] There we go [03:15:19] Tested on enwiki successfully. [03:15:21] Thanks, spagewmf [03:15:39] awesome [03:15:48] Oh, there you are, hah. [03:16:21] FWIW https://gerrit.wikimedia.org/r/69277 fixes it "properly" [03:27:26] superm401, spagewmf, nice work [03:27:58] Thanks, spagewmf gets a big chunk of the credit [03:28:20] marktraceur, your fix has the same checksum as superm401's! It's like y'all are psychic :) [03:29:10] Because it's a supersimple fix! :) [03:29:45] And the winner of the dunce cap is James_F|Away, by the way [03:30:10] * marktraceur is just trolling, poor James_F [03:30:33] PROBLEM - Host wtp1008 is DOWN: PING CRITICAL - Packet loss = 100% [03:31:53] RECOVERY - Host wtp1008 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [03:31:55] don't troll, it isn't nice [03:32:59] what do we do about the accounts that were half-created? [03:34:20] ori-l, can you check the user table for 'Test 2013-06-17 unusable 3'? [03:34:28] I'm wondering what if anything was saved to there. [03:35:23] * ori-l looks [03:37:50] 1.22wmf7 has the same problem, OK if I sync that ? [03:38:35] superm401 should if he can, I think [03:38:57] he's deployed one; i don't think it's good to swap hats mid-fix [03:38:59] spagewmf, will do [03:39:10] superm401 ^ the file is in 1.22wmf7, just needs the sync-file [03:39:28] Should have checked that. [03:40:06] there's no "should have" in teamwork either [03:40:25] !log mflaschen synchronized php-1.22wmf7/extensions/VisualEditor/VisualEditor.hooks.php 'Hotfix to fix account creation' [03:40:33] Logged the message, Master [03:42:52] superm401: you or some other enwiki admin could temporarily override one of the messages used in Special:UserLogin and add a notice acknowledging the problem and urging users to try again. Do you think that would be appropriate? [03:44:07] The half-created accounts are strange: you're logged in, but trying to go to your page gives 'User account "SpageTest AC 0617-2" is not registered.' [03:44:13] ori-l, not sure, how long was it broken? [03:44:21] spagewmf, yeah, it looks like there's no user table entry at all. [03:44:28] superm401, it's there [03:44:37] user_id 19205308 [03:44:42] but there's no password or e-mail [03:45:51] ori-l, weird, that's not what I get locally, maybe it depends on the master/slave setup. [03:46:18] superm401, I think 23:52 logmsgbot: catrope Finished syncing Wikimedia installation... : Updating VisualEditor to master to 03:15 logmsgbot: mflaschen synchronized php-1.22wmf6/extensions/VisualEditor/VisualEditor.hooks.php 'Emergency hotfix to fix account creation' [03:47:28] I was going to say no, but that's almost three and half hours, so maybe it does make sense. [03:47:56] SELECT count(*) FROM `user` WHERE user_registration > '20130617000000' AND `user_password` = ''; [03:47:58] = 1970 [03:48:01] ouch. [03:48:51] I think we need to escalate this and get more input. [03:48:56] RoanKattouw, you around? [03:49:00] Ys [03:49:13] I supplied e-mails in my account creations, and got the "e-mail address confirmation" mail, but unsurprisingly the confirm URL doesn't work. [03:49:17] So we have 1970 half-created accounts with no password? [03:49:27] RoanKattouw, yeah. [03:49:53] Given that spagewmf got a confirmation email, I guess we do have email addresses for these people? [03:50:08] for those who provided one... [03:50:12] Right [03:50:29] Because one thing we could do is do password resets for the ones that have an email address [03:50:47] I don't know how to trigger those off the top of my head but it seems like a reasonable idea [03:50:48] only for 54 out of 1970 [03:50:51] Ugh [03:51:24] We can either delete the accounts or mangle the user name (by adding an underscore, say) which would at least release it so that it can be re-used, but I'm not sure how to go about doing that without breaking SUL [03:51:43] s/it/them [03:53:21] For SUL you shouldn't look at me, that's better left to Chris [03:53:56] I'm thinking we should delete the accounts, so the names are released and they can be recreated [03:54:13] We should reset the ones that do have emails. [03:54:14] Additionally, the 54 people that did provide an email address could be contacted [03:54:16] Yes [03:54:39] As for deleting the other accounts, they'd probably need to be deleted from the centralauth tables too and I don't know how that works [03:55:13] So let me figure out the password reset [03:55:15] I think an admin can just go to Special:PasswordReset for the 54. [03:55:19] The bug report says resets don't work, but I haven't verified that. [03:55:22] Oh that would be nice [03:55:37] The initial confirmation probably doesn't work [03:55:42] The resets should work, hopefully [03:56:04] When done by an admin at least [03:56:22] RoanKattouw, I'm an admin. I don't think we have any special reset privileges. [03:56:55] To reset, I think the emails will have to be force-confirmed. [03:57:04] Right [03:57:10] You can only reset with a confirmed email, which is probbly the issue they hit on the bug report. [03:57:24] Yeah I figured [03:57:29] But I thought admins could get around that maybe [03:57:47] Don't think we have any special abilties. Maybe stewards? [03:57:52] I think it'd be OK to force-confirm the e-mails, since these users will only be able to log in if they actually receive the reset e-mail [03:58:43] ori-l, also, I don't have a staff account on the content wikis, and I'm not supposed to use my personal one for this kind of thing. [03:59:15] OK let's just force-confirm them [03:59:33] ori-l, I'm not 100% convinced it's that useful to change the login/signup message, but if you think it is, we need to find someone (preferably to do it on all the VE wikis, not just enwiki). [03:59:46] RoanKattouw: want me to do it? [03:59:53] Sure go ahead [04:01:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:02:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.145 second response time [04:04:03] what other wikis is VE deployed to? [04:05:39] ori-l, lots of Wikipedias, and MW.org [04:06:04] hoooooooooooo boy. [04:06:18] I'm looking at s1-analytics-slave's enwiki user table and my new account doesn't have a user_email. Is it stored somewhere else before it's confirmed? [04:06:21] ori-l, https://git.wikimedia.org/blob/operations%2Fmediawiki-config.git/HEAD/visualeditor.dblist, I believe. [04:07:06] ok, sanity check my pseudo-code: [04:07:22] for each wiki in visualeditor.dblist: [04:07:38] for each account where account created today and account password is blank and account e-mail is not blank: [04:08:18] confirm email & save [04:08:22] right? [04:09:38] ori-l, maybe do it since the start of the deploy window. [04:09:47] Sounds good [04:09:56] Maybe we should page someone to have them double-check. [04:10:02] At least we should check with Chris. [04:10:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:11:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.131 second response time [04:12:46] !log Manually marked the e-mail address of 54 enwiki users with e-mail address configured but no password (due to bug 49727) as confirmed. [04:12:56] Logged the message, Master [04:13:25] superm401: I did those before your last message; I'm not disregarding your suggestion [04:14:57] ori-l, RoanKattouw, I emailed ops, just in case someone's not watching IRC. [04:15:55] Yeah I saw [04:15:57] Thanks a lot guys [04:16:02] And sorry for not spotting that in CR :) [04:16:04] * :( [04:16:20] (brb) [04:16:43] I told James I didn't test it, but I wish I saw that or took the time to test. [04:17:09] spagewmf, I think it's all in the user table, is it possible Analytics sanitizes that field? [04:18:24] I have to put Noam to bed, I'll be back ASAP [04:21:39] Random tidbit, when testing locally, it seems to grab a user_id, but never actually write the row. [04:21:54] So when I created some with the broken version, then a working one, it jumped for user_id 1 to user_id 8. [04:21:57] With only two actual rows. [04:23:19] superm401 perhaps hook order is dependent on extension ordering, so you may get different behavior than a WMF wiki even if you load the same extensions. [04:23:37] StevenW wasn't me boss :) [04:23:40] spagewmf, plus, I'm certainly not loading the same extensions (including important ones like CentralAuth) [04:25:35] lunchtime, BIAB [04:51:28] PROBLEM - NTP on ssl3003 is CRITICAL: NTP CRITICAL: No response from NTP server [04:52:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:54:28] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [04:57:48] PROBLEM - NTP on ssl3002 is CRITICAL: NTP CRITICAL: No response from NTP server [05:16:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:17:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [05:26:57] FYI Ganglia > Wikimedia Grid > Miscellaneous eqiad > vanadium.eqiad.wmnet, choose day, see the spike in the Exceptions graph. Also there's a link to a canned graph of exceptions & fatals in the last 2 hours in https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Test_and_monitor_your_live_code [05:39:32] Thanks, spagewmf [05:40:25] it's all Ori, I just rite stuff down [05:43:42] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:44:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [06:58:53] back [07:11:04] !log Marking e-mail addresses of accounts created in last 24h with no password as "confirmed" on all wikis on which VE is enabled. [07:11:13] Logged the message, Master [07:18:21] !log Force-confirmed e-mail for 160 accounts. full list in fluorine:/home/olivneh/users-forced-confirm-email-18-Jun-2013.list [07:18:30] Logged the message, Master [07:19:19] New review: Yurik; "even though I would have probably inlined the $zeroRated variable, looks ok." [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/64629 [07:48:15] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [07:48:15] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [07:48:15] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [07:48:15] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [07:48:15] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [07:48:16] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [07:48:16] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [07:48:17] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [07:48:17] PROBLEM - Puppet freshness on spence is CRITICAL: No successful Puppet run in the last 10 hours [07:48:18] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [07:48:18] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [08:01:35] RECOVERY - NTP on ssl3003 is OK: NTP OK: Offset 0.008830189705 secs [08:05:20] New patchset: Nemo bis; "Add tools- and etherpad.wmflabs.org to $wgNoFollowDomainExceptions" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69289 [08:13:34] !log nikerabbit synchronized php-1.22wmf7/extensions/UniversalLanguageSelector/ 'ULS to master' [08:13:44] Logged the message, Master [08:15:26] !log nikerabbit synchronized php-1.22wmf6/extensions/UniversalLanguageSelector/ 'ULS to master' [08:15:35] Logged the message, Master [08:17:40] PROBLEM - Puppet freshness on sodium is CRITICAL: No successful Puppet run in the last 10 hours [08:18:42] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68146 [08:21:40] PROBLEM - Puppet freshness on magnesium is CRITICAL: No successful Puppet run in the last 10 hours [08:25:46] !log nikerabbit synchronized wmf-config/InitialiseSettings.php 'ULS phase 2' [08:25:53] Logged the message, Master [08:32:20] RECOVERY - NTP on ssl3002 is OK: NTP OK: Offset 0.009659051895 secs [09:12:55] hello [09:43:26] New review: Hashar; "Ah that is what fixed it. Thanks!" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68833 [09:46:51] New review: Hashar; "I needed them for https://integration.wikimedia.org/ci/view/Analytics/job/analytics-wikistats/ which..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68327 [10:10:06] New patchset: ArielGlenn; "remove wgSquidServersNoPurge for labs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69299 [10:17:08] New review: Hashar; "Editing a page over HTTPS, the request will pass via the nginx proxy and then the text cache. The re..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69299 [10:31:46] New patchset: ArielGlenn; "clean up squid NoPurge list for beta labs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69299 [10:33:32] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69299 [11:40:40] New review: Petrb; "yes we can, but this was it is FAR more simple and it has same potential as having a variable (we ca..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69133 [12:28:36] New patchset: ArielGlenn; "add upload cache to squids list for labs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69304 [12:43:25] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69304 [12:56:24] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [13:13:53] New review: coren; "It's a reasonable way of doing it." [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/69133 [13:13:54] Change merged: coren; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69133 [13:52:19] New patchset: Ottomata; "Lowering alert thresholds on kakfa-broker-ProduceRequestsPerSecond" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69307 [13:52:33] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69307 [13:56:11] it is going to get noisy here in a few minutes...relocating search [14:00:12] !log powering down search1001-1024 to relocate to row c5 [14:00:20] Logged the message, Master [14:02:25] PROBLEM - Host search1024 is DOWN: PING CRITICAL - Packet loss = 100% [14:03:05] PROBLEM - Host search1023 is DOWN: PING CRITICAL - Packet loss = 100% [14:03:05] PROBLEM - Host search1022 is DOWN: PING CRITICAL - Packet loss = 100% [14:03:15] PROBLEM - Host search1021 is DOWN: PING CRITICAL - Packet loss = 100% [14:03:15] PROBLEM - Host search1020 is DOWN: PING CRITICAL - Packet loss = 100% [14:03:35] PROBLEM - Host search1019 is DOWN: PING CRITICAL - Packet loss = 100% [14:04:05] PROBLEM - Host search1017 is DOWN: PING CRITICAL - Packet loss = 100% [14:04:05] PROBLEM - Host search1018 is DOWN: PING CRITICAL - Packet loss = 100% [14:04:37] PROBLEM - Host search1016 is DOWN: PING CRITICAL - Packet loss = 100% [14:04:38] PROBLEM - Host search1015 is DOWN: PING CRITICAL - Packet loss = 100% [14:04:38] PROBLEM - LVS Lucene on search-pool5.svc.eqiad.wmnet is CRITICAL: No route to host [14:04:45] PROBLEM - Host search1014 is DOWN: PING CRITICAL - Packet loss = 100% [14:04:56] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: No route to host [14:05:02] PROBLEM - LVS Lucene on search-prefix.svc.eqiad.wmnet is CRITICAL: No route to host [14:05:29] PROBLEM - Host search1008 is DOWN: PING CRITICAL - Packet loss = 100% [14:05:29] PROBLEM - Host search1010 is DOWN: PING CRITICAL - Packet loss = 100% [14:05:29] PROBLEM - Host search1012 is DOWN: PING CRITICAL - Packet loss = 100% [14:05:29] PROBLEM - Host search1009 is DOWN: PING CRITICAL - Packet loss = 100% [14:05:29] PROBLEM - Host search1011 is DOWN: PING CRITICAL - Packet loss = 100% [14:05:30] PROBLEM - Host search1013 is DOWN: PING CRITICAL - Packet loss = 100% [14:05:38] PROBLEM - LVS Lucene on search-pool3.svc.eqiad.wmnet is CRITICAL: No route to host [14:06:18] PROBLEM - Host search1007 is DOWN: PING CRITICAL - Packet loss = 100% [14:06:48] PROBLEM - Host search1006 is DOWN: PING CRITICAL - Packet loss = 100% [14:06:48] PROBLEM - Host search1005 is DOWN: PING CRITICAL - Packet loss = 100% [14:06:49] PROBLEM - LVS Lucene on search-pool2.svc.eqiad.wmnet is CRITICAL: No route to host [14:07:18] PROBLEM - Host search1002 is DOWN: PING CRITICAL - Packet loss = 100% [14:07:18] PROBLEM - Host search1004 is DOWN: PING CRITICAL - Packet loss = 100% [14:07:18] PROBLEM - Host search1003 is DOWN: PING CRITICAL - Packet loss = 100% [14:07:18] PROBLEM - Host search1001 is DOWN: PING CRITICAL - Packet loss = 100% [14:07:58] PROBLEM - LVS Lucene on search-pool1.svc.eqiad.wmnet is CRITICAL: No route to host [14:11:46] hiii anybody up? [14:12:00] can I add Stefan Petrea to the admins:::mortals class? [14:12:08] he needs to be able to access stat1002.eqiad.wmnet [14:12:11] which does not have a public IP [14:12:21] which means he needs an account on bast1001 and/or fenari [14:14:17] <^demon> ottomata: New shell users have to go through RT I believe. [14:18:00] he's not a new user [14:18:05] he's already got the account on stat1002 [14:18:14] he just hasn't logged into it yet [14:18:17] i'm trying to help him now [14:18:26] and I noticed that he doesn't have a fenari or bast1001 account [14:27:18] baackk [14:36:13] New patchset: Andrew Bogott; "Move mail manifests to a module called 'exim'" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68584 [14:41:20] mark (if around) would you have a look at https://gerrit.wikimedia.org/r/#/c/68584/ ? [14:41:32] <^demon> ottomata: I guess just push for review. If someone has a problem I'm sure they'll say so :) [14:42:03] New review: Andrew Bogott; "Before this is merged we need to change the node definition in ldap for labs instances." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68584 [15:20:25] New patchset: Ottomata; "Adding spetrea to admins::mortals so he has an account on bastion hosts." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69313 [15:21:25] New review: Ottomata; "Do I need to make an RT for this? He already hass server access on several machines, including one ..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69313 [15:25:49] paravoid, hi, have you had a chance to review my two patches? one is tiny style update, but the other removes X-Carrier [15:25:59] I'm looking at them now [15:26:10] nice coincidence [15:26:27] so, where does the 30d comes from? [15:27:04] mediawiki sets 30d maxage in cache-control? [15:30:24] Change abandoned: Ottomata; "Not going to use this at all." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48041 [15:30:57] yurik: ^ [15:31:37] paravoid, yep [15:31:45] okay [15:31:48] they need rebasing [15:32:00] hm, or maybe not [15:32:17] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68840 [15:33:03] thanks! [15:33:11] what about X-Carrier in zero.inc.vcl? [15:33:24] ah wait [15:33:50] New patchset: Faidon; "Removed X-Carrier header" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68841 [15:34:32] ? [15:34:38] nevermind [15:34:43] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68841 [15:34:59] ottomata: heads up, X-Carrier is about to be gone, in case it matters for analytics [15:35:12] I think that's fine, we're only using X-Analytics afaik [15:35:19] i don't think X-Carrier is in the logs anymore [15:35:37] yeah that's not a problem at all, we never used that [15:35:43] andrewbogott: I don't like mailman being in the exim module [15:36:02] hey guys, do I need an RT for this? [15:36:03] https://gerrit.wikimedia.org/r/#/c/69313/ [15:36:04] mark: ok, I can break it out into a separate one [15:36:16] stefan already has access to stat1002.eqiad.wmnet, just not an official bastion host [15:37:09] otherwise I think it's fine, it's not a very segregated module and all, but let's not care about that too much at this point, at least it's being moved into a module ;) [15:38:22] yurik: btw, has any progress been made into creating a mobile/zero-enabled index page? [15:38:31] instead of doing all these redirects in varnish [15:38:35] as we previously talked about [15:39:16] paravoid, not really - were mostly trying to figure out a way to get rid of X-CS fragmentation [15:39:34] have you seen the latest email about it ? [15:39:40] i sent a "short" version [15:40:27] btw, you might want to look at it as it adds a lot new redirects to the mex [15:40:30] *mix [15:44:34] so, how about that VE team, eh? ;) [15:46:23] ? [15:46:33] @notify binasher [15:46:33] I'll let you know when I see binasher around here [15:51:02] apergos: the "account creation downtime" thread on Ops. From VE code. [15:53:06] ah [15:53:12] crap happens [15:55:02] mark, do you think spamassassin should be a separate module as well? [15:55:06] apergos: yeah, just joking around, hence the ;). The response/fix looked quick and well done. And it did bring up an unfortunate limitation in our monitoring, so, all good I guess :) [15:55:18] andrewbogott: well yeah [15:55:23] you could just make it all a "mail" module [15:55:25] and not worry about it now [15:55:33] we can always make it cleaner and split up later if we need to [15:56:01] it wasn't permanent harm to revision content, most everything else can be worked around... [15:56:47] I think it's easy enough to break up now, I'll do that [15:58:59] yurik: I just answered [15:59:16] * yurik looking [16:00:07] andrewbogott: also you can migrate that to 4-space indent now if you're moving the files over anyway [16:00:16] :) ok [16:00:39] Mark, in this file https://gerrit.wikimedia.org/r/#/c/68584/5/manifests/mail.pp line 204 [16:01:06] which 'mailman' is included? The mailman subclass that's in scope, or the top level mailman class? [16:01:07] yeah that's dirty [16:01:17] the in-scope one [16:01:26] but it would be better to fully qualify that [16:01:33] In that case what does line 210 do? [16:01:44] because there is no in-scope spamassassin class [16:02:04] that's taking the global one I think [16:02:07] really confusing :) [16:02:08] eek [16:02:17] ok :) [16:02:22] apergos: yep, agree. [16:04:37] New patchset: Andrew Bogott; "Move mail manifests to a module called 'exim'" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68584 [16:05:41] New review: Andrew Bogott; "work in progress - do not merge" [operations/puppet] (production) C: -2; - https://gerrit.wikimedia.org/r/68584 [16:08:10] mark thanks, i really wish we could get rid of all the variances, but zero has no images in HTML - only redirects to them [16:08:25] i know [16:08:27] but m does too for some [16:08:36] not that i know of [16:08:47] m always shows images [16:09:03] i think i'm still confused about the differences [16:09:14] i thought you wanted to replace all that by js (if you can) [16:10:01] mark - m has images, zero - no images. As simple as that really. I do want to replace redirects (to images or to other sites) with either direct links or in-browser confirmation boxes [16:10:18] ok [16:10:53] hey #ops - is there any way for me to query access logs, specifically arround search? I wanna know what search features people are actually using. [16:11:16] manybubbles: i'm not entirely sure, notpeter would know [16:11:58] I think having *two* variants, one with images and one without is fine for now [16:12:03] if all the rest go [16:12:31] paravoid, well, 3 variants - one staying as is today [16:12:38] for non-zero carriers [16:13:08] unless you want me to add redirect replacing js to ALL mobile deviecs [16:13:20] mobilefrontend will be happy... [16:13:35] er? [16:14:34] paravoid, "non-zero-carrier" variant, "m@zero-carrier" and "zero@zero-carrier" [16:15:18] the first one can be removed technically, in which case ALL m.* traffic will have its site-site links rewritten with redirects [16:15:49] and javascript will make them direct again on the client [16:17:08] that's a terrible idea [16:17:15] I didn't propose that [16:18:47] agree :) Although if we (zero) are successful, we will sign up ALL the world's carriers, making the first group almost nothing [16:19:36] but then you can put whether to redirect or not in the varnish per-carrier config, can't you? [16:25:53] New patchset: Jdlrobson; "Make use of centralauth tokens for MobileFrontend photo uploads" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69318 [16:29:52] yurik: so as far as I understand it, for carrier views all external links are sent to a redirector page, and that redirector page is basically not cacheable at all? [16:32:57] mark, redirector is not really the right term we use - its more of a confirmation page that shows Yes/No if the user agrees to switch to non-free resource. In case the resource is also whitelisted, it becomes a silent redirect. And the confirmation page can be made cachable - per X-CS [16:33:27] how does it know where to redirect to then? [16:33:36] URL params [16:33:47] so that's separate cache objects for that too [16:34:01] every external link gets its own cache object * X-CS [16:34:32] potentially - yes, but much much smaller than having every wiki page * X-CS [16:34:46] smaller - the HTML snippets are tiny [16:35:19]