[00:02:19] RoanKattouw: Reedy needs your help. :-) [00:02:28] OK [00:02:35] You have until :15-ish [00:02:40] RoanKattouw: so, we added 2 new wikis, login and iegcom [00:02:59] Apache knows about them, sees them as virtual hosts [00:03:07] DNS has them , point to lb [00:03:15] yet, we get "No wiki found" [00:03:37] Which lb [00:03:44] https://iegcom.wikimedia.org/ https://login.wikimedia.org/ [00:03:44] wikimedia-lb [00:03:47] New patchset: Dr0ptp4kt; "Add extra IP block to Varnish for Wikipedia Zero partner TATA." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62763 [00:03:49] OK [00:03:55] wikimedia.org:iegcom 1H IN CNAME wikimedia-lb [00:04:00] wikimedia.org:login 1H IN CNAME wikimedia-lb [00:04:14] port * namevhost iegcom.wikimedia.org (/etc/apache2/wmf/remnant.conf:1715) [00:04:21] port * namevhost login.wikimedia.org (/etc/apache2/wmf/wikimedia.conf:257) [00:04:32] Oh, wait [00:04:39] Reedy: What are the db names? [00:04:48] iegcomwiki and .. [00:04:53] loginwiki [00:05:04] OK [00:05:08] Let me check something real quick [00:05:15] mediawiki is fine with them [00:05:25] Ryan_Lane: Could you enable $wgVectorUseIconWatch = true; on wikitech? (and btw, where is its configuration?) [00:05:34] * RoanKattouw clones mediawiki-config onto his loaner machine [00:05:42] actually, i should probably move iegcom to wikimedia.conf , but unrelated [00:05:45] configuration is only on virt0 [00:05:48] I need to puppetize it [00:05:56] I think andrewbogott may be working on that, actually [00:06:32] done [00:06:39] what does that setting do? [00:06:42] Reedy: adding to s3.dblist and special.dblist, that was already done by ./refresh-dblist [00:06:53] ah [00:06:54] I see [00:07:21] Ryan_Lane: lol, you ask me now? [00:07:30] ) [00:07:33] err [00:07:33] ;) [00:07:39] mutante: Need committing ;) [00:07:47] Ryan_Lane: It turns the "Watch" item in the action menu into a primary item on the toolbar with the star icon [00:07:51] yeah [00:08:02] I looked it up right after I asked [00:08:04] Ryan_Lane: It's a nice change to have. :-) [00:08:09] indeed [00:08:34] PROBLEM - DPKG on analytics1008 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [00:08:40] that is me [00:08:55] Reedy: slowness when git pulling [00:08:58] trying to fix these machines up ! [00:09:01] done [00:09:14] PROBLEM - DPKG on analytics1025 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [00:09:14] PROBLEM - DPKG on analytics1014 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [00:09:34] PROBLEM - DPKG on analytics1026 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [00:09:34] RECOVERY - DPKG on analytics1008 is OK: All packages OK [00:09:34] PROBLEM - DPKG on analytics1016 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [00:09:44] PROBLEM - DPKG on analytics1022 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [00:10:14] PROBLEM - DPKG on analytics1015 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [00:10:14] RECOVERY - DPKG on analytics1025 is OK: All packages OK [00:10:14] PROBLEM - DPKG on analytics1012 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [00:10:34] RECOVERY - DPKG on analytics1026 is OK: All packages OK [00:10:44] RECOVERY - DPKG on analytics1022 is OK: All packages OK [00:10:56] Oh OK [00:11:05] Multiversion does some weird stuff based on the docroot [00:11:07] Checking Apache config [00:11:29] i'm moving something from remnants.conf to wikimedia.conf but not making changes to the actual config [00:12:00] very weird stuff [00:12:14] RECOVERY - DPKG on analytics1015 is OK: All packages OK [00:12:14] RECOVERY - DPKG on analytics1014 is OK: All packages OK [00:12:34] RECOVERY - DPKG on analytics1016 is OK: All packages OK [00:12:43] New patchset: Dzahn; "move newer private wikis (iegcom, transitionteam) out of remants.conf use wikimedia.conf instead" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/62764 [00:13:14] RECOVERY - DPKG on analytics1012 is OK: All packages OK [00:13:46] New review: Dzahn; "no changes to the actual config, just move to other file" [operations/apache-config] (master) C: 2; - https://gerrit.wikimedia.org/r/62764 [00:13:46] Change merged: Dzahn; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/62764 [00:14:12] Hmm this totally looks like it should work [00:14:18] The docroot looks right [00:14:38] mutante: I do note, this is exactly what we've both done for other wikis in the not so distant past and it worked fine ;) [00:14:49] acknowledged, yea [00:15:19] OK I gotta run [00:15:21] lemme graceful one more time after having merged that [00:15:27] Sorry I couldn't figure it out [00:15:29] * RoanKattouw disappears to midterm [00:15:38] RoanKattouw: thanks for confirming it's weird , cya, heh [00:16:08] RoanKattouw: Have fun. [00:16:54] And at the same time, othe wikis haven't broken.. [00:17:21] yea, did not break existing wikis [00:17:27] apaches dont complain about it [00:17:54] and .. affects both new wikis created by separate people ..hrrm [00:18:29] * James_F blames cosmic rays. [00:20:04] Maybe we should add index.html with "It works!" [00:20:30] Reedy: can also confirm them both on http://noc.wikimedia.org/dbtree/ on s3 [00:20:58] Mediawiki seems ok run via eval.php [00:21:04] !log upgrading a bunch of tampa squids [00:22:05] Reedy: --> http://login.wikimedia.org/Foo and wait a couple seconds [00:22:12] gets us to wikimediafoundation.org [00:23:01] iegcom is hitting missing.php [00:25:03] Reedy: so Apache wikimedia.conf, line 170 ServerAlias *.wikimedia.org that is before the others [00:25:24] hits the wildcard first? [00:25:30] mutante: The missing.php seems to suggest it's not correctly in wikiversions.cdb [00:25:44] PROBLEM - DPKG on ms-be1 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [00:26:05] Reedy: hrmm, i did run sync-wikiversions though [00:26:14] PROBLEM - DPKG on mc12 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [00:26:33] reedy@fenari:/home/wikipedia/common$ grep iegcom wikiversions.* [00:26:33] Binary file wikiversions.cdb matches [00:26:33] wikiversions.dat:iegcomwiki php-1.22wmf3 * [00:26:34] PROBLEM - DPKG on mc1013 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [00:26:34] PROBLEM - DPKG on mc1011 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [00:26:44] PROBLEM - DPKG on ms-fe4 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [00:26:44] PROBLEM - DPKG on ms-be4 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [00:27:14] RECOVERY - DPKG on mc12 is OK: All packages OK [00:27:21] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: [00:27:34] RECOVERY - DPKG on mc1013 is OK: All packages OK [00:27:34] RECOVERY - DPKG on mc1011 is OK: All packages OK [00:27:44] RECOVERY - DPKG on ms-fe4 is OK: All packages OK [00:27:59] !log reedy synchronized wmf-config/ [00:28:09] mutante: Aha. Cached things. Touching helped [00:28:18] It's now at the same screen as loginwiki [00:31:40] https://login.wikimedia.org/ [00:31:50] ^ Anyone know what landing screen that is? [00:31:56] wikimedia.org Reedy [00:32:12] controlled via Meta [00:32:15] Right, thanks [00:32:23] mutante: Which suggests it is hitting the wildcard one first [00:32:34] mutante: Does it do something weird like order of virtualhost definitions? [00:32:44] RECOVERY - DPKG on ms-be1 is OK: All packages OK [00:32:45] i just tested that on mw1044 [00:32:48] (which you said above) [00:32:49] putting them first [00:32:56] but i don't see a difference yet [00:33:09] http://iegcom.wikimedia.org/wiki/ * 301 Moved Permanently https://iegcom.wikimedia.org/wiki/ [00:33:14] http://iegcom.wikimedia.org/wiki/Foo * 301 Moved Permanently https://iegcom.wikimedia.org/wiki/Foo [00:33:19] http://login.wikimedia.org/wiki/ * 301 Moved Permanently http://wikimediafoundation.org/wiki/ [00:33:23] https://login.wikimedia.org * 200 OK 6251 [00:33:44] RECOVERY - DPKG on ms-be4 is OK: All packages OK [00:33:46] mutante: [00:33:47] Name-based virtual hosts for the best-matching set of s are processed in the order they appear in the configuration. The first matching ServerName or ServerAlias is used, with no different precedence for wildcards (nor for ServerName vs. ServerAlias). [00:33:56] So order does matter, yes [00:33:58] yes, it does [00:34:05] just had the same thing when moving secure [00:34:14] heh [00:34:18] but..it doesnt appear to fix it anyways.. wth [00:34:50] Wouldn't hurt to move them and push them out as is [00:35:07] yea, doing [00:36:43] New review: Reedy; "(1 comment)" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/62566 [00:37:02] ^ Same problem would happen with my wikimania docroot moves. Comment left so I remember to "fix" it [00:38:28] that would mean transitionteam would have broken by moving it [00:38:56] and it did! [00:39:02] :D [00:40:56] New patchset: Dzahn; "move 3 private wikis before wildcard ServerName" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/62767 [00:41:34] PROBLEM - Puppet freshness on db44 is CRITICAL: No successful Puppet run in the last 10 hours [00:41:36] New review: Dzahn; "this should work better :p" [operations/apache-config] (master) C: 2; - https://gerrit.wikimedia.org/r/62767 [00:41:36] Change merged: Dzahn; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/62767 [00:43:33] i hate it when my ssh agent dies on graceful [00:43:55] Yeah,not helpful [00:45:41] so, transitionteam is back [00:46:04] login redirects to foundation Main_Page [00:46:13] mutante: Wohooo [00:46:17] and iegcom is not found [00:46:18] I just purged login [00:46:20] and logged in :) [00:46:25] and i was about to purge..heh [00:46:25] nice [00:46:43] Hmm, iegcom is back to not found :/ [00:48:25] mutante: To answer our previous discussion. It was something stupidly simple [00:49:01] which one?:) but yea, it usually is [00:50:27] !log reedy synchronized wmf-config/InitialiseSettings.php 'loginwiki logo' [00:51:45] !log reedy synchronized wmf-config/InitialiseSettings.php 'favico' [00:51:58] New patchset: Reedy; "Set favico and logo for loginwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62769 [00:52:36] hmm what's login wiki going to be used for? [00:53:12] Thehelpfulone: oauth/openid stuffs [00:53:16] Thehelpfulone: Logins. :-) [00:53:51] Thehelpfulone: https://login.wikimedia.org/wiki/Special:ListUsers :-) [00:53:54] oh it's a SUL wiki [00:54:05] Thehelpfulone: Yeah. [00:54:13] I didn't see a create account link so was wondering :) [00:54:35] https://login.wikimedia.org/wiki/Special:ListGroupRights is amusingly less populated [00:54:59] We should probably drop a few of those groups too [00:54:59] Thehelpfulone: No local accounts, only global. [00:55:37] ok [00:55:49] Reedy: Yeah, drop importers at least. [00:55:53] I guess people should be able to create accounts from there. But wouldn't be the common starting point for most [00:55:54] And bots [00:56:02] Hmm, no.. [00:56:04] Reedy: And edituserJS should be false for sysops. [00:56:07] Bots might be used for notifications [00:56:13] yeah, bots should stay. [00:56:14] oh you can't even edit James_F? [00:56:24] At least until OAuth is working well. ;-) [00:56:50] Thehelpfulone: Yeah, this is a very locked-down wiki; essentially no user-editable parts, at least for now. [00:57:28] https://login.wikimedia.org/w/index.php?title=Main_Page&diff=2&oldid=1 [00:57:33] Hmm, why can I edit the main page? [00:57:37] show off :p [00:57:43] Reedy, global sysadmin/staff? [00:57:44] Reedy: It's unprotected. [00:57:45] Global groups? [00:57:46] Yeah [00:58:14] Reedy: Now protected. [00:58:20] My vandalism!! [00:58:53] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62769 [00:59:19] Reedy: The Sister Projects section forgets Commons and Wikidata. [00:59:53] https://gerrit.wikimedia.org/r/gitweb?p=mediawiki/extensions/WikimediaMaintenance.git;a=blob;f=addWiki.php;h=dab7d255f68d68bb3ab7091891dc905cba522f78;hb=HEAD [01:00:18] Easily added, not sure in what order they should go [01:00:50] * James_F nods. [01:01:06] For stuff like this, it'd be really nice if gerrit had a web interface for creating changesets [01:01:13] I added Wikidata then Commons, which probably suffices. [01:01:15] Yeah, I'd kill for that. [01:01:24] I'm mostly make language and setting changes. [01:01:48] Exactly [01:01:54] Or fixing small typos in larger commits [01:02:07] rather than shell, login to server... [01:02:28] Indeed. [01:02:35] You can fix commit summaries. Why not commits? [01:02:42] Indeed [01:02:43] * James_F grumbles with Reedy. [01:02:50] magical button "fix typos" [01:02:50] That's a newer feature though [01:03:04] Sure. Maybe we should throw the ideas into the gerrit upstream tracker? [01:03:17] Where's ^Demon to tell me where that is when I need him? ;-) [01:03:54] http://code.google.com/p/gerrit/issues/list [01:04:35] so what else besides dblists can cause "No wiki found" [01:04:56] it's wikiversions not dblists [01:05:06] I'm slightly confused why that regressed when you fixed the apache rules [01:05:08] eh, yea, both, both have it [01:05:21] wikversions.cdb and .dat and the dblists [01:05:34] dont see a diff to loginwiki [01:05:51] mutante: Thanks! [01:07:40] * Jasper_Deng draws attention to http://en.wikipedia.org/wiki/Wikipedia:Village_pump_%28proposals%29#Developer.27s_Noticeboard .... doubting anyone on this channel would follow that noticeboard [01:09:01] what are they asking for? another mailing list? IRC channel number 5? [01:09:19] mutante: A wiki page for us to all follow. [01:09:21] SAL transclusion ? [01:09:28] !log reedy synchronized wmf-config/CommonSettings.php [01:10:05] James_F: sounds like they want SAL [01:10:18] plus gerrit pending changes [01:11:25] !log reedy synchronized wmf-config/CommonSettings.php [01:12:22] doesn't follow the "put you are putting it where there is high-traffic, that's why we don't see it" [01:13:05] New patchset: Reedy; "More misc config for loginwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62771 [01:13:35] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62771 [01:14:01] " You know, from bitter experience, that the enwp community isn't going to follow you to mediawiki.org and ferret out what changes you're planning on making to code that affects enwp." [01:14:11] !log reedy synchronized wmf-config/ [01:14:18] <-- doesn't work, that would mean every single language wiki has to be told separately [01:14:31] Echo! [01:14:43] We should leave every user on every wiki a talk page message [01:14:56] mutante: Indeed. [01:14:57] ah, yes, can you turn SAL into that:) [01:15:07] Reedy: Well, actually, that is going to be a part of Notifications. [01:15:11] Reedy: But it's not built yet [01:15:30] Reedy: Partially because it's only deployed to one wiki, so it's no better than now. [01:16:41] Reedy: loginwiki sitename is Wikipeida [01:16:48] " a place for Wikimedia developer announcements, not general tech discussion" <- as soon as a developer makes an announcement it will likely turn into tech discussion.. until people say to do that elsewhere [01:17:12] mutante: We could hard-protect the page in the DB? [01:17:13] What should it be called? Loginwiki? Login Wiki? Login? [01:17:19] mutante: Page only editable via SQL? [01:17:25] Reedy: Wikimedia Login Wiki. [01:17:35] mutante: Special:HearMeMortals? [01:17:38] What is this used for exactly? [01:17:42] Nothing yet [01:17:44] Reedy: James_F: Wikimedia Login? [01:17:53] i don't see how en.wp expects to be treated other than non-en wikis though [01:17:53] Krinkle: Loginwiki? It's the new replacement login system for CA. [01:17:57] Reedy: I gathered that much [01:18:05] Need to name the Project/Wikipedia namespace something sensible too [01:18:21] i'd recommend Project as Project namespace [01:18:21] Reedy: "Login" is fine. [01:18:28] like on mediawiki.org [01:18:28] And yeah, Project works fine. [01:18:32] It's not going to have much. [01:20:27] !log reedy synchronized wmf-config/InitialiseSettings.php [01:20:35] "We, sitting here on enwp, would really like to know about upcoming deployments. " <-- i can't help it, i'd have to say "soo.. subscribe to lists, join IRC, read SAL, have one of the bots talk to you, use Twitter or Identi.ca or whatnot.. ..." [01:21:05] New patchset: Reedy; "Set sitename and meta namespace for loginwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62772 [01:21:11] or open up the deployment calendar? [01:21:18] mutante: Right, and for anything we need to push (instead of having them poll) we use wikitech-ambasadors [01:21:23] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62772 [01:21:27] Another channel is certainly not the solution [01:23:22] wikitech-ambasadors: major change announcements [01:23:33] mutante: don't forget technology category on blog.wikimedia.org [01:23:44] Krinkle: heh, and the monthly reports [01:23:48] those are both on a high level, ideal for the community I think [01:23:52] we have too many channels [01:24:03] for more details, SAL, IRC, wikitech-l, .. [01:24:11] Definitely. [01:24:40] then people are going to say they don't understand the technical discussion on wikitech [01:24:51] after they said they want to know more about the technical side of things :p [01:25:04] been there [01:25:36] If wikitech is too detailed, then that means you shouldn't subscribe to it and instead using something on higher level (such as techblog.wikimedia.org) [01:25:45] Yeah [01:25:47] :) [01:25:51] Same same, but differenc [01:25:51] simple.sal.wikimedia.org :) [01:25:59] !log fixed [01:26:48] btw, no logbot:) [01:27:27] hrmm, if we could just see iegcom now that would be nice [01:27:44] Thehelpfulone: it's like practically done, just that it's not found:) [01:27:48] but it's there [01:29:21] Reedy: need swift containers for login? i guess not [01:29:35] need search index for login? [01:30:15] mutante: NOOOOOO not multiple subdomains!!!! [01:30:44] Reedy: haha, i wasn't that serious about simple.sal [01:31:01] I think search indexing is needed [01:31:13] If we're going to have some documentation there, we might aswell [01:31:30] can do tomorrow? it takes soo long :p [01:31:50] Reedy: What would be documented there that doesn't belong on metawiki, mw.org or wikitech? [01:31:52] and last time i added one.. oh wait.. did that ever work [01:32:21] I think loginwiki would ideally only have special pages, and no content :P [01:32:21] I've no idea [01:33:04] it did not.. wikimania2014 wiki search .. The search backend returned an error: [01:33:14] so i wouldn't know how to fix that :( [01:33:30] I was only really asked to create the wiki at this point ;) [01:33:58] kk [01:34:47] That was to Krinkle :p [01:34:55] http://iegcom.wikimedia.org * 301 Moved Permanently https://iegcom.wikimedia.org/ [01:34:58] https://iegcom.wikimedia.org * 404 Not Found [01:36:09] so that tells me the Apache part is ok [01:36:10] because: [01:36:18] RewriteCond %{HTTP:X-Forwarded-Proto} !https [01:36:29] RewriteRule ^/(.*)$ https://iegcom.wikimedia.org/$1 [R=301,L] [01:36:39] Yup, that's wanted behavior [01:37:36] I wonder what made it go back to 404ing [01:39:42] https://login.wikimedia.org/wiki/Special:Version?printable=yes [01:39:47] Special pages should have printable links [01:40:52] Special:Random?printable=yes works:) [01:41:32] Exactly [01:41:49] Not going to be the most common thing to do, but if you wanted to print it, you really dont want the border stuff [01:43:15] !date-au [01:43:44] Reedy: 2:43 AM .. you wanted earlier than 4:30 [01:43:54] i'll have to look at iegcom again tomorrow [01:44:00] Yup, I'm in the process of going to bed ;) [01:44:06] Guinea pigs need bringing in and feeding next [01:44:10] i don't see it and too tired too keep staring at it [01:44:19] guinea pigs :) [01:45:19] alright, cya later [02:11:52] quit [02:13:34] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [02:15:00] !log LocalisationUpdate completed (1.22wmf3) at Wed May 8 02:15:00 UTC 2013 [02:17:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:18:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.132 second response time [02:24:54] !log LocalisationUpdate completed (1.22wmf2) at Wed May 8 02:24:54 UTC 2013 [02:27:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:29:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [03:03:14] PROBLEM - Host mw72 is DOWN: PING CRITICAL - Packet loss = 100% [03:03:44] RECOVERY - Host mw72 is UP: PING OK - Packet loss = 0%, RTA = 26.58 ms [03:04:34] PROBLEM - Puppet freshness on neon is CRITICAL: No successful Puppet run in the last 10 hours [03:06:44] PROBLEM - Apache HTTP on mw72 is CRITICAL: Connection refused [03:27:54] !log morebots is down? [03:28:30] legoktm: were you still going to work on it? [03:28:45] Hi [03:28:52] Yeah, I just have one final left on Thursday [03:29:03] ok, that comes first :) [03:29:23] So I'll work on it when I get home on Saturday :) [03:44:11] i don't see how en.wp expects to be treated other than non-en wikis though # The Wikimedia Foundation seems to have no trouble treating the English Wikipedia differently. ;-) [03:45:14] What kind of logo is being used on login.wikimedia.org... [03:45:50] Susan, it's a wikimedia foundation logo, what else could it be? [03:45:56] the logo [03:46:25] It's not rendering proproperly. [03:46:27] Properly. [03:46:31] It has some weird in it. [03:46:34] Can you see it? [03:46:42] It's fine for me [03:46:48] There's a green line connecting the two green pieces. [03:46:51] On my screen. [03:46:55] I see it [03:47:02] very faint [03:47:34] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [03:47:34] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [03:47:34] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [03:47:34] PROBLEM - Puppet freshness on virt1005 is CRITICAL: No successful Puppet run in the last 10 hours [03:52:04] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed May 8 03:52:04 UTC 2013 [03:56:08] https://bugzilla.wikimedia.org/show_bug.cgi?id=47228 [05:10:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:13:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [05:22:20] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62581 [05:29:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:32:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [05:57:46] New review: ArielGlenn; "I'd prefer we choose one and stick to it (string OR boolean but not both)." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62583 [05:58:42] New review: ArielGlenn; "will +2 once dependencies are straightened out" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62603 [06:03:59] AaronSchulz: https://bugzilla.wikimedia.org/show_bug.cgi?id=48164 is tracking the bug btw [06:04:35] PROBLEM - SSH on searchidx2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:05:24] RECOVERY - SSH on searchidx2 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [06:06:45] morebots is dead! <-- I believe this more effective than linking a bug few will click [06:07:10] So is StewardBot in -stewards [06:07:21] 02.20 -!- morebots [~morebots@wikitech-static.wikimedia.org] has quit [Ping timeout: 264 seconds] [06:08:15] completely unrelated: other code, other cluster and 06.35 -!- StewardBot [stewardbot@wikimedia/bot/StewardBot] has quit [Excess Flood] [06:08:45] you will need to talk to someone with access to wikitech-static to reboot moreboots (eg: TimStarling for one) [06:09:36] talk to ori-l, he was the one who specifically asked me to not merge the fix for its periodic failures [06:10:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:10:38] i dutifully harassed the person who committed to doing it earlier and the latter said they'd update the patch on saturday after their last exam [06:11:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [06:11:46] TimStarling: having the logs logged, even sporadically, would be better in this case, no? [06:13:36] legoktm: would you mind if i touched up your patch? i wouldn't ask if morebots's current broken state was not so disruptive [06:13:50] go for it [06:13:56] !log restarted adminbot (morebot) on wikitech-static [06:14:05] Logged the message, Master [06:14:37] thanks. i think the code is kind of crappy so there will be plenty left to do. [06:14:58] :P [06:15:32] the bot's code, i mean. your patch is fine other than the issues i pointed it. [06:15:38] * out [06:15:47] right :) [06:22:28] I guess I killed the wrong one [06:23:37] apergos: maybe next time you restart it, you should kill the one that is running already first [06:24:51] I did stop it (but it claimed the process was not running) [06:25:50] and when I looked to see what was running, after starting it, there was only the one... [06:27:14] well, I suppose it's possible that we both stopped it simultaneously [06:29:46] guess so [06:45:22] !fakelog hrm [06:45:23] Logged the message, Python nerd [06:45:36] :>>> [06:49:54] New patchset: Ori.livneh; "Convert logbot to use ircbot.SingleServerIRCBot" [operations/debs/adminbot] (master) - https://gerrit.wikimedia.org/r/60240 [06:50:59] it has 42 fewer lines of code and supports SSL w/SASL authentication [06:52:31] New review: ArielGlenn; "Why not move the ganglia stuff in role/protoproxy.pp to a class in the module?" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62582 [06:53:36] appropriately PEP8 is upset because "06:49:57 ./adminlogbot.py:263:1: E303 too many blank lines (3) " [06:54:23] New patchset: Ori.livneh; "Convert logbot to use ircbot.SingleServerIRCBot" [operations/debs/adminbot] (master) - https://gerrit.wikimedia.org/r/60240 [06:54:31] * 43 fewer lines [06:57:52] New review: Ori.livneh; "Tested, should be good to go. Be advised that bot config files will need to have their 'port' settin..." [operations/debs/adminbot] (master) - https://gerrit.wikimedia.org/r/60240 [08:13:14] PROBLEM - Packetloss_Average on oxygen is CRITICAL: CRITICAL: packet_loss_average is 9.74084898438 (gt 8.0) [08:13:24] PROBLEM - Packetloss_Average on analytics1003 is CRITICAL: CRITICAL: packet_loss_average is 9.3427396875 (gt 8.0) [08:13:34] PROBLEM - Packetloss_Average on analytics1006 is CRITICAL: CRITICAL: packet_loss_average is 9.98934252033 (gt 8.0) [08:14:34] PROBLEM - Packetloss_Average on analytics1004 is CRITICAL: CRITICAL: packet_loss_average is 10.1732789844 (gt 8.0) [08:17:14] RECOVERY - Packetloss_Average on oxygen is OK: OK: packet_loss_average is -0.262847542373 [08:17:24] RECOVERY - Packetloss_Average on analytics1003 is OK: OK: packet_loss_average is 1.02041419355 [08:17:34] PROBLEM - Packetloss_Average on gadolinium is CRITICAL: CRITICAL: packet_loss_average is 9.04610550459 (gt 8.0) [08:17:35] RECOVERY - Packetloss_Average on analytics1006 is OK: OK: packet_loss_average is -0.0390388135593 [08:18:34] RECOVERY - Packetloss_Average on analytics1004 is OK: OK: packet_loss_average is 1.0285370339 [08:18:44] PROBLEM - Packetloss_Average on analytics1008 is CRITICAL: CRITICAL: packet_loss_average is 9.48142945736 (gt 8.0) [08:21:34] RECOVERY - Packetloss_Average on gadolinium is OK: OK: packet_loss_average is 0.167124536082 [08:22:34] PROBLEM - Puppet freshness on db45 is CRITICAL: No successful Puppet run in the last 10 hours [08:22:44] RECOVERY - Packetloss_Average on analytics1008 is OK: OK: packet_loss_average is 0.915050166667 [08:25:32] New review: Hashar; "So yeah the fixme is because I have removed the configuration dependency. IIRC the nginx service is..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/62582 [08:27:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:28:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [08:53:30] hey. I found a sql injection and have a patch ready for review. Any tips? [08:54:53] awight: bugzilla -> Security, and attach the patch [08:55:06] (its a private section where only a few can see) [08:55:10] great, thanks [09:01:30] ciao! [09:06:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:07:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [09:09:34] PROBLEM - Puppet freshness on cp3003 is CRITICAL: No successful Puppet run in the last 10 hours [09:09:34] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [09:48:34] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [09:49:57] New patchset: Ori.livneh; "Puppetize Bugzilla" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62404 [10:19:11] New review: ArielGlenn; "I'm not opposed in principle but where do we want to send folks instead for information?" [operations/debs/squid] (master) - https://gerrit.wikimedia.org/r/61950 [10:40:14] PROBLEM - Host srv291 is DOWN: PING CRITICAL - Packet loss = 100% [10:42:34] PROBLEM - Puppet freshness on db44 is CRITICAL: No successful Puppet run in the last 10 hours [10:45:05] New patchset: ArielGlenn; "option to run until all wikis have dumps more recent than cutoff date" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/62800 [10:58:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:59:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [11:13:49] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/62800 [11:17:59] New patchset: ArielGlenn; "clean up import statements in main dump script" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/62802 [11:18:21] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/62802 [11:18:33] New patchset: Ori.livneh; "Puppetize Bugzilla" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62404 [11:25:31] New patchset: ArielGlenn; "usage message to stderr, not stdout" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/62803 [11:26:27] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/62803 [11:33:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:35:08] New patchset: Matthias Mullie; "Add AFTv5 archive feedback cron job" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62602 [11:35:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [11:48:11] New patchset: ArielGlenn; "more prints converted to stderr writes in main dump script" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/62805 [11:50:29] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/62805 [12:06:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:07:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [12:11:24] PROBLEM - Disk space on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:11:44] PROBLEM - DPKG on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:11:54] PROBLEM - RAID on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:12:34] PROBLEM - SSH on ms-be1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:13:24] PROBLEM - HTTP radosgw on ms-fe1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:13:24] PROBLEM - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:13:34] PROBLEM - HTTP radosgw on ms-fe1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:13:34] PROBLEM - HTTP radosgw on ms-fe1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:13:44] PROBLEM - HTTP radosgw on ms-fe1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:14:34] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [12:15:04] nice [12:17:04] 1003 fell over I guess (swap) [12:17:42] yeah looks like it [12:18:21] the peering process is taking rather long [12:18:32] ms-be1003 login: [8503083.418489] INFO: task kworker/10:4:5301 blocked for more than 120 seconds. [12:18:32] [8503083.426498] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [12:18:43] stuff like that on console, I do have a login prompt [12:19:51] 1700 pgs left to peer [12:20:14] gonna reboot the box [12:20:17] noo [12:20:19] wait [12:20:25] if I get a shell prompt (ok waiting) [12:22:57] that if is turning out to be false btw [12:25:33] let's see if it converges this time [12:25:49] heh [12:25:50] oom [12:26:03] [8503525.410170] Out of memory: Kill process 25446 (ceph-osd) score 835 or sacrifice child [12:26:03] [8503525.419173] Killed process 25446 (ceph-osd) total-vm:53525504kB, anon-rss:43269732kB, file-rss:0kB [12:26:09] all of them? [12:26:11] now have shell prompt :-D [12:26:14] RECOVERY - Disk space on ms-be1003 is OK: DISK OK [12:26:15] ah [12:26:24] RECOVERY - SSH on ms-be1003 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [12:26:40] RECOVERY - DPKG on ms-be1003 is OK: All packages OK [12:26:44] RECOVERY - RAID on ms-be1003 is OK: OK: State is Optimal, checked 1 logical device(s) [12:27:00] OSDs are coming back up [12:27:22] hmm it had been ramping up for an hour or two [12:27:24] RECOVERY - HTTP radosgw on ms-fe1004 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 8.420 second response time [12:27:25] memleak perhaps [12:27:34] RECOVERY - HTTP radosgw on ms-fe1003 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 1.333 second response time [12:28:24] RECOVERY - HTTP radosgw on ms-fe1001 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 8.125 second response time [12:29:35] correlates with the network spike [12:29:43] http://ganglia.wikimedia.org/latest/?r=4hr&cs=&ce=&c=Ceph+eqiad&h=ms-be1003.eqiad.wmnet&tab=m&vn=&mc=2&z=medium&metric_group=ALLGROUPS [12:30:59] maybe atop would tell you a little more [12:31:24] PROBLEM - HTTP radosgw on ms-fe1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:31:44] PROBLEM - HTTP radosgw on ms-fe1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:34:33] we'll lose be1001 within the next hour [12:34:38] http://ganglia.wikimedia.org/latest/?r=4hr&cs=&ce=&c=Ceph+eqiad&h=ms-be1001.eqiad.wmnet&tab=m&vn=&mc=2&z=medium&metric_group=ALLGROUPS [12:35:14] RECOVERY - HTTP radosgw on ms-fe1001 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.004 second response time [12:35:14] RECOVERY - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.216 second response time [12:35:34] RECOVERY - HTTP radosgw on ms-fe1003 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.057 second response time [12:35:39] 1010 too [12:36:23] yeah, a single osd which is very large [12:39:24] RECOVERY - HTTP radosgw on ms-fe1002 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.004 second response time [12:41:08] i wonder if that's the 3 OSDs responsible for a certain PG [12:48:11] how do you list any of that? [12:48:31] i'm pokng around the ceph docs without much success [12:49:44] PROBLEM - HTTP radosgw on ms-fe1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:50:24] PROBLEM - HTTP radosgw on ms-fe1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:50:24] PROBLEM - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:50:34] PROBLEM - HTTP radosgw on ms-fe1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:50:34] PROBLEM - HTTP radosgw on ms-fe1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:53:09] hey apergos, we have some backup processes running on stat1 since yesterday afternoon and they consume a lot of memory and have put stat1 in swapping mode; could you maybe have a look to see what's going on? [12:56:14] RECOVERY - HTTP radosgw on ms-fe1001 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.007 second response time [12:56:14] RECOVERY - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.004 second response time [12:56:24] RECOVERY - HTTP radosgw on ms-fe1002 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.003 second response time [12:56:24] RECOVERY - HTTP radosgw on ms-fe1004 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.003 second response time [12:56:34] RECOVERY - HTTP radosgw on ms-fe1003 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.003 second response time [13:03:51] drdee: I don't really know the amanda setup. I see there are two of the home_0 jobs running, I could shoot the older one of those [13:04:34] there are 4 processes in total, right? [13:04:46] I see 4 going, right [13:05:06] I hav no idea what the different home_x are [13:05:26] me neither, nothing changed recently afaik [13:05:34] PROBLEM - Puppet freshness on neon is CRITICAL: No successful Puppet run in the last 10 hours [13:06:12] are there some log files that we can look at? [13:06:24] PROBLEM - HTTP radosgw on ms-fe1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:06:24] PROBLEM - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:06:28] I looked at the amandad ones and didn't get much out of them [13:06:34] PROBLEM - HTTP radosgw on ms-fe1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:06:44] PROBLEM - HTTP radosgw on ms-fe1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:07:07] the client ones are pretty opaque to me as well [13:07:24] RECOVERY - HTTP radosgw on ms-fe1001 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 8.403 second response time [13:07:34] PROBLEM - HTTP radosgw on ms-fe1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:08:43] hm this one log seems to indicate that a process still running is actually finished [13:09:02] odd [13:09:08] ok I'm going to make a command decision and shoot all these. [13:09:24] RECOVERY - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 8.085 second response time [13:09:26] RECOVERY - HTTP radosgw on ms-fe1002 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.008 second response time [13:09:26] RECOVERY - HTTP radosgw on ms-fe1004 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 2.558 second response time [13:09:34] RECOVERY - HTTP radosgw on ms-fe1003 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.015 second response time [13:12:02] let's see what happens on the next run [13:17:25] ok, ceph finally came out of the peering process, after more than an hour [13:17:34] that was horrible [13:17:37] heh [13:18:38] ty apergos [13:18:51] sure [13:30:24] PROBLEM - Host search1024 is DOWN: PING CRITICAL - Packet loss = 100% [13:30:24] PROBLEM - Host mc1009 is DOWN: PING CRITICAL - Packet loss = 100% [13:30:24] PROBLEM - Host mc1007 is DOWN: PING CRITICAL - Packet loss = 100% [13:30:24] PROBLEM - Host mc1011 is DOWN: PING CRITICAL - Packet loss = 100% [13:30:34] RECOVERY - Host mc1007 is UP: PING OK - Packet loss = 0%, RTA = 2.19 ms [13:30:44] PROBLEM - Apache HTTP on mw1160 is CRITICAL: Connection timed out [13:30:44] PROBLEM - Apache HTTP on mw1108 is CRITICAL: Connection timed out [13:30:44] PROBLEM - Apache HTTP on mw1106 is CRITICAL: Connection timed out [13:30:44] PROBLEM - Apache HTTP on mw1020 is CRITICAL: Connection timed out [13:30:44] PROBLEM - Apache HTTP on mw1030 is CRITICAL: Connection timed out [13:32:51] bits are teh br0ken. [13:33:02] Ah, coming back. [13:33:44] PROBLEM - Apache HTTP on mw1078 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:33:44] PROBLEM - Apache HTTP on mw1175 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:33:44] PROBLEM - Apache HTTP on mw1166 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:33:44] PROBLEM - Apache HTTP on mw1161 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:33:44] PROBLEM - Apache HTTP on mw1179 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:33:46] wtf all the mw* hosts. well mw1017 and up [13:34:22] Yeah, I got a page without CSS. [13:34:32] Then the blue and green error. Then some CSS. [13:34:40] en.wikipedia.org via HTTPS. [13:34:54] PROBLEM - Frontend Squid HTTP on amssq33 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:35:03] we have a switch failure [13:35:25] PROBLEM - Backend Squid HTTP on knsq23 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:35:35] PROBLEM - Backend Squid HTTP on sq66 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:36:24] PROBLEM - Backend Squid HTTP on amssq44 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:36:35] PROBLEM - Backend Squid HTTP on sq60 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:36:44] PROBLEM - Backend Squid HTTP on sq62 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:36:54] PROBLEM - Frontend Squid HTTP on knsq23 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:36:54] PROBLEM - Frontend Squid HTTP on amssq32 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:36:54] PROBLEM - Backend Squid HTTP on knsq29 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:36:54] PROBLEM - Backend Squid HTTP on amssq34 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:24] PROBLEM - Frontend Squid HTTP on amssq40 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:24] PROBLEM - Frontend Squid HTTP on amssq35 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:25] PROBLEM - Backend Squid HTTP on sq71 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:25] PROBLEM - Frontend Squid HTTP on amssq36 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:25] PROBLEM - Frontend Squid HTTP on knsq24 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:25] PROBLEM - Frontend Squid HTTP on knsq28 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:25] PROBLEM - Backend Squid HTTP on amssq39 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:25] PROBLEM - Frontend Squid HTTP on amssq45 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:34] PROBLEM - Frontend Squid HTTP on cp1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:34] PROBLEM - Backend Squid HTTP on knsq28 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:34] PROBLEM - Backend Squid HTTP on amssq42 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:34] PROBLEM - Backend Squid HTTP on cp1008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:34] PROBLEM - Frontend Squid HTTP on amssq46 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:35] PROBLEM - Backend Squid HTTP on sq63 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:35] PROBLEM - Backend Squid HTTP on sq37 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:50] PROBLEM - Backend Squid HTTP on sq59 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:50] PROBLEM - Backend Squid HTTP on sq64 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:54] PROBLEM - Frontend Squid HTTP on knsq27 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:54] PROBLEM - Frontend Squid HTTP on amssq42 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:55] PROBLEM - Backend Squid HTTP on amssq40 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:55] PROBLEM - Backend Squid HTTP on amssq32 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:55] PROBLEM - Backend Squid HTTP on amssq37 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:55] PROBLEM - Backend Squid HTTP on knsq27 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:55] PROBLEM - Backend Squid HTTP on amssq41 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:55] PROBLEM - Frontend Squid HTTP on amssq37 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:04] PROBLEM - Frontend Squid HTTP on amssq38 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:14] PROBLEM - Frontend Squid HTTP on amssq31 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:14] PROBLEM - Backend Squid HTTP on knsq24 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:34] PROBLEM - Backend Squid HTTP on sq61 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:34] PROBLEM - Backend Squid HTTP on sq65 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:44] PROBLEM - Frontend Squid HTTP on cp1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:44] PROBLEM - Backend Squid HTTP on amssq36 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:54] PROBLEM - Frontend Squid HTTP on amssq44 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:54] PROBLEM - Frontend Squid HTTP on knsq29 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:39:34] PROBLEM - Frontend Squid HTTP on cp1020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:39:34] PROBLEM - Backend Squid HTTP on cp1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:39:44] PROBLEM - Frontend Squid HTTP on cp1017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:39:44] PROBLEM - Frontend Squid HTTP on cp1010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:39:44] PROBLEM - Frontend Squid HTTP on cp1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:39:44] PROBLEM - Backend Squid HTTP on cp1009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:40:24] RECOVERY - Frontend Squid HTTP on cp1020 is OK: HTTP OK: HTTP/1.0 200 OK - 1293 bytes in 0.002 second response time [13:40:24] RECOVERY - Backend Squid HTTP on cp1006 is OK: HTTP OK: HTTP/1.0 200 OK - 1259 bytes in 0.001 second response time [13:40:34] RECOVERY - Frontend Squid HTTP on cp1017 is OK: HTTP OK: HTTP/1.0 200 OK - 1293 bytes in 0.003 second response time [13:41:44] PROBLEM - LVS HTTPS IPv6 on wikidata-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:41:54] PROBLEM - Varnish HTTP bits on cp3022 is CRITICAL: Connection timed out [13:42:44] PROBLEM - LVS HTTPS IPv4 on wikidata-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:42:54] RECOVERY - Varnish HTTP bits on cp3022 is OK: HTTP OK: HTTP/1.1 200 OK - 634 bytes in 6.691 second response time [13:42:54] PROBLEM - Varnish HTTP bits on cp3020 is CRITICAL: Connection timed out [13:43:25] PROBLEM - LVS HTTP IPv4 on bits.esams.wikimedia.org is CRITICAL: Connection timed out [13:43:34] PROBLEM - Backend Squid HTTP on cp1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:43:34] PROBLEM - Frontend Squid HTTP on cp1020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:43:44] PROBLEM - LVS HTTP IPv6 on bits-lb.esams.wikimedia.org_ipv6 is CRITICAL: Connection timed out [13:43:45] PROBLEM - Frontend Squid HTTP on cp1019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:43:46] PROBLEM - Frontend Squid HTTP on cp1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:43:46] PROBLEM - Frontend Squid HTTP on cp1017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:43:46] PROBLEM - Frontend Squid HTTP on cp1015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:45:35] RECOVERY - LVS HTTPS IPv4 on wikidata-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 601 bytes in 0.010 second response time [13:45:37] RECOVERY - LVS HTTPS IPv6 on wikidata-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 601 bytes in 0.009 second response time [13:45:54] PROBLEM - Varnish HTTP bits on cp3022 is CRITICAL: Connection timed out [13:45:54] PROBLEM - LVS HTTPS IPv4 on bits.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:45:54] PROBLEM - Varnish HTTP bits on cp3021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:46:32] Wikipedia isn't working #apocalypse [13:46:33] haha [13:46:35] RECOVERY - Frontend Squid HTTP on cp1019 is OK: HTTP OK: HTTP/1.0 200 OK - 1294 bytes in 0.004 second response time [13:46:35] RECOVERY - Frontend Squid HTTP on cp1017 is OK: HTTP OK: HTTP/1.0 200 OK - 1294 bytes in 0.002 second response time [13:46:54] PROBLEM - Varnish HTTP bits on cp3019 is CRITICAL: Connection timed out [13:46:54] RECOVERY - LVS HTTPS IPv4 on bits.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 3851 bytes in 7.964 second response time [13:47:24] PROBLEM - LVS HTTP IPv6 on wikidata-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:47:44] PROBLEM - LVS HTTP IPv4 on wikidata-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:47:54] RECOVERY - Varnish HTTP bits on cp3020 is OK: HTTP OK: HTTP/1.1 200 OK - 634 bytes in 6.303 second response time [13:48:24] PROBLEM - LVS HTTPS IPv4 on wikidata-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:48:34] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [13:48:34] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [13:48:34] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [13:48:34] PROBLEM - Puppet freshness on virt1005 is CRITICAL: No successful Puppet run in the last 10 hours [13:48:35] PROBLEM - LVS HTTPS IPv6 on wikidata-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:49:24] RECOVERY - Backend Squid HTTP on cp1006 is OK: HTTP OK: HTTP/1.0 200 OK - 1260 bytes in 0.001 second response time [13:49:54] PROBLEM - LVS HTTPS IPv6 on bits-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:50:54] RECOVERY - LVS HTTPS IPv6 on bits-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 3851 bytes in 7.963 second response time [13:50:56] PROBLEM - Varnish HTTP bits on cp3020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:51:44] PROBLEM - Frontend Squid HTTP on cp1019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:51:44] PROBLEM - Frontend Squid HTTP on cp1017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:51:44] RECOVERY - Varnish HTTP bits on cp3021 is OK: HTTP OK: HTTP/1.1 200 OK - 634 bytes in 3.162 second response time [13:51:54] RECOVERY - Varnish HTTP bits on cp3020 is OK: HTTP OK: HTTP/1.1 200 OK - 634 bytes in 7.851 second response time [13:52:21] can't reach chris [13:52:34] RECOVERY - Frontend Squid HTTP on cp1019 is OK: HTTP OK: HTTP/1.0 200 OK - 1294 bytes in 0.003 second response time [13:52:34] RECOVERY - Frontend Squid HTTP on cp1017 is OK: HTTP OK: HTTP/1.0 200 OK - 1294 bytes in 0.006 second response time [13:52:54] RECOVERY - Varnish HTTP bits on cp3019 is OK: HTTP OK: HTTP/1.1 200 OK - 634 bytes in 5.908 second response time [13:53:40] <^demon> mark: Do you need someone in eqiad? [13:53:47] yes [13:54:21] <^demon> I can start heading that way in case he was hit by a bus, but it'll take me just under 2h. [13:54:28] do you have access? [13:54:38] <^demon> Not without hand-holding :( [13:54:45] then it's no use I'm afraid [13:54:54] PROBLEM - Varnish HTTP bits on cp3021 is CRITICAL: Connection timed out [13:54:54] PROBLEM - Varnish HTTP bits on cp3020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:55:18] <^demon> Well, unless someone gave me access without me knowing. [13:55:35] PROBLEM - Backend Squid HTTP on cp1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:55:44] PROBLEM - Frontend Squid HTTP on cp1017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:55:44] PROBLEM - Frontend Squid HTTP on cp1019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:55:54] RECOVERY - Varnish HTTP bits on cp3020 is OK: HTTP OK: HTTP/1.1 200 OK - 634 bytes in 4.780 second response time [13:56:15] so we've lost a chunk of memcached it seems [13:56:21] but i'm not entirely sure how much [13:56:24] RECOVERY - Backend Squid HTTP on cp1006 is OK: HTTP OK: HTTP/1.0 200 OK - 1260 bytes in 0.001 second response time [13:56:34] RECOVERY - Frontend Squid HTTP on cp1015 is OK: HTTP OK: HTTP/1.0 200 OK - 1294 bytes in 0.007 second response time [13:56:34] RECOVERY - Frontend Squid HTTP on cp1002 is OK: HTTP OK: HTTP/1.0 200 OK - 1292 bytes in 0.004 second response time [13:57:01] the same would hold for redis [13:57:54] RECOVERY - Varnish HTTP bits on cp3021 is OK: HTTP OK: HTTP/1.1 200 OK - 634 bytes in 7.783 second response time [13:58:31] !log Removing mc1009 from the eqiad memcached pool [13:58:40] Logged the message, Master [13:58:41] !log mark synchronized wmf-config/mc-eqiad.php [13:58:44] RECOVERY - Varnish HTTP bits on cp3022 is OK: HTTP OK: HTTP/1.1 200 OK - 634 bytes in 1.276 second response time [13:58:45] RECOVERY - Apache HTTP on mw1057 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 742 bytes in 9.342 second response time [13:58:45] RECOVERY - Backend Squid HTTP on amssq32 is OK: HTTP OK: HTTP/1.0 200 OK - 1414 bytes in 4.448 second response time [13:58:49] Logged the message, Master [13:58:54] RECOVERY - Frontend Squid HTTP on amssq37 is OK: HTTP OK: HTTP/1.0 200 OK - 1408 bytes in 4.872 second response time [13:58:54] RECOVERY - Apache HTTP on mw1076 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 5.797 second response time [13:58:54] RECOVERY - Backend Squid HTTP on amssq34 is OK: HTTP OK: HTTP/1.0 200 OK - 1414 bytes in 5.245 second response time [13:58:54] RECOVERY - Frontend Squid HTTP on knsq23 is OK: HTTP OK: HTTP/1.0 200 OK - 1406 bytes in 5.399 second response time [13:58:54] RECOVERY - Apache HTTP on mw1039 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 6.307 second response time [13:58:54] RECOVERY - Backend Squid HTTP on amssq37 is OK: HTTP OK: HTTP/1.0 200 OK - 1414 bytes in 5.873 second response time [13:58:54] RECOVERY - Apache HTTP on mw1111 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 6.185 second response time [13:58:55] RECOVERY - Frontend Squid HTTP on knsq29 is OK: HTTP OK: HTTP/1.0 200 OK - 1406 bytes in 6.211 second response time [13:58:55] RECOVERY - Backend Squid HTTP on amssq41 is OK: HTTP OK: HTTP/1.0 200 OK - 1414 bytes in 6.481 second response time [13:58:56] RECOVERY - Frontend Squid HTTP on knsq27 is OK: HTTP OK: HTTP/1.0 200 OK - 1406 bytes in 6.545 second response time [13:58:56] RECOVERY - Backend Squid HTTP on knsq27 is OK: HTTP OK: HTTP/1.0 200 OK - 1411 bytes in 6.709 second response time [13:58:57] RECOVERY - Apache HTTP on mw1066 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 8.097 second response time [13:58:57] RECOVERY - LVS HTTP IPv4 on appservers.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 64868 bytes in 7.765 second response time [13:58:58] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 64868 bytes in 0.494 second response time [13:58:58] RECOVERY - Frontend Squid HTTP on amssq42 is OK: HTTP OK: HTTP/1.0 200 OK - 1408 bytes in 8.675 second response time [13:58:59] RECOVERY - Frontend Squid HTTP on amssq32 is OK: HTTP OK: HTTP/1.0 200 OK - 1408 bytes in 9.201 second response time [13:58:59] RECOVERY - Backend Squid HTTP on knsq29 is OK: HTTP OK: HTTP/1.0 200 OK - 1411 bytes in 9.471 second response time [13:59:00] RECOVERY - Frontend Squid HTTP on amssq38 is OK: HTTP OK: HTTP/1.0 200 OK - 1406 bytes in 2.625 second response time [13:59:04] RECOVERY - Backend Squid HTTP on knsq24 is OK: HTTP OK: HTTP/1.0 200 OK - 1411 bytes in 0.453 second response time [13:59:04] RECOVERY - Frontend Squid HTTP on amssq31 is OK: HTTP OK: HTTP/1.0 200 OK - 1408 bytes in 0.459 second response time [13:59:04] RECOVERY - Apache HTTP on mw1220 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.053 second response time [13:59:14] RECOVERY - Apache HTTP on mw1017 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.059 second response time [13:59:14] RECOVERY - Frontend Squid HTTP on amssq40 is OK: HTTP OK: HTTP/1.0 200 OK - 1414 bytes in 0.175 second response time [13:59:14] RECOVERY - LVS HTTPS IPv4 on wikidata-lb.pmtpa.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 987 bytes in 0.250 second response time [13:59:14] RECOVERY - Apache HTTP on mw1087 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.051 second response time [13:59:14] RECOVERY - Apache HTTP on mw1170 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.055 second response time [13:59:14] RECOVERY - Apache HTTP on mw1047 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.063 second response time [13:59:14] RECOVERY - Frontend Squid HTTP on amssq35 is OK: HTTP OK: HTTP/1.0 200 OK - 1408 bytes in 0.460 second response time [13:59:15] RECOVERY - Apache HTTP on mw1067 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.047 second response time [13:59:15] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.045 second response time [13:59:17] wtf [13:59:24] removing that single memcached unblocks everything? [13:59:54] urk [14:00:20] ^demon: i did an uncommitted change to mc-eqiad.php on fenari, what's the process these days? ;) [14:00:34] RECOVERY - LVS HTTP IPv6 on bits-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 3903 bytes in 0.422 second response time [14:00:41] yeah all the mw*s are back according to icinga [14:00:42] <^demon> Commit to gerrit so it's not lost, that repo should be cloned with ssh. [14:00:44] RECOVERY - Varnish HTTP bits on cp3019 is OK: HTTP OK: HTTP/1.1 200 OK - 632 bytes in 0.171 second response time [14:00:44] RECOVERY - Varnish HTTP bits on cp3020 is OK: HTTP OK: HTTP/1.1 200 OK - 634 bytes in 0.177 second response time [14:00:52] ok [14:01:14] RECOVERY - LVS HTTP IPv4 on bits.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 3911 bytes in 0.171 second response time [14:01:34] PROBLEM - SSH on cp3022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:01:35] PROBLEM - Apache HTTP on mw1152 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:01:41] New patchset: Mark Bergsma; "Depool mc1009" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62812 [14:02:24] RECOVERY - SSH on cp3022 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [14:02:25] RECOVERY - Apache HTTP on mw1152 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.051 second response time [14:02:31] Change merged: Demon; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62812 [14:03:52] hey [14:04:18] you managed to miss two separate outages ;p [14:04:24] I missed them both? [14:04:26] damn [14:05:09] reading back [14:05:33] good work mark ^demon and co :) [14:07:31] sigh [14:13:41] mark saved the day, again. [14:14:26] how do you to the beer stuff using ascii? :-P [14:14:31] do you do* [14:17:12] odder: Wait till Amsterdam and buy in person? ;) [14:19:20] hi, nobody seems to have a solution for this in #wikimedia-tech but the gadgetsection on nlwiki ( http://nl.wikipedia.org/wiki/Speciaal:Voorkeuren ) disappeared after the downtime [14:19:35] do you guys know why? [14:20:09] on every other wiki I checked there were gadgets [14:20:47] !log ceph osd out 128, primary for stuck unclean pg 3.19e1 [14:20:55] Logged the message, Master [14:21:45] New patchset: Anomie; "Make $wgCodeEditorEnableCore configurable per wiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62814 [14:21:48] Wiki13: Very likely due to the memcached issues [14:21:58] ah [14:22:03] thanks for the info [14:22:15] !log ceph osd out 39, primary for stuck unclean pg 3.5c6 [14:22:22] Logged the message, Master [14:22:54] Wiki13: Fixed [14:40:24] RECOVERY - Host mc1009 is UP: PING OK - Packet loss = 0%, RTA = 1.27 ms [14:54:38] hey apergos, could you please restart apache on stat1001 ? [14:56:10] drdee: done [14:56:20] ty [14:58:59] New patchset: Mark Bergsma; "Revert "Depool mc1009"" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62816 [14:59:26] Change merged: Mark Bergsma; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62816 [15:00:44] !log mark synchronized wmf-config/mc-eqiad.php [15:00:52] Logged the message, Master [15:03:28] New patchset: Odder; "(bug 48236) Fix login.wm.o's (and other wikis') logo" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62817 [15:21:40] apergos: only 1 of the 3 disks will arrive in tampa today. I contacted dell support late. you/we/paravoid will have to coordinate with steve this week to replace them [15:21:53] ok, bummer [15:21:59] thanks for the update [15:39:04] New patchset: Demon; "Add second key for myself" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60846 [15:42:00] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60846 [16:01:25] New patchset: coren; "Preliminary toollabs module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59969 [16:03:09] Can someone check https://gerrit.wikimedia.org/r/#/c/59969 so I can finally puppetize the tool labs? Pretty please? :-) [16:10:58] it's almost an empty shell ;) [16:11:20] no real problems from me right now, but that may of course change once there's actual content in those manifests ;) [16:11:22] It is; but once it's in place I can add stuff with +2 self without disrupting anything. :-) [16:11:38] Because nothing else will use that module. [16:11:51] If I break tools, I get to keep both pieces. :-) [16:11:58] New review: Mark Bergsma; "(1 comment)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59969 [16:12:37] it's true that I care far less about the toollabs module than anything that is in production ;) [16:17:47] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62576 [16:23:39] about to deploy, does anyone know who made docroot/ changes, and is it safe to deploy? [16:24:52] reedy@fenari:/home/wikipedia/common$ git diff [16:24:52] reedy@fenari:/home/wikipedia/common$ [16:25:27] it's never safe to deploy [16:25:39] thank you mark [16:26:00] Reedy, try git status;) [16:26:03] at any point you may feel a knife stuck in your back ;) [16:26:16] mark, you are too late, its already there [16:26:21] in multiple instances [16:26:29] MaxSem: And they've been there for AGES [16:26:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:26:58] Reedy, ages === hours? [16:27:02] days [16:27:03] months [16:27:06] a year? [16:27:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [16:38:22] scaping... [16:39:16] scap forth and prosper [16:41:15] god scap [16:41:30] New review: Alex Monk; "Add wikimania2014wiki and loginwiki" [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/47307 [16:57:49] New patchset: Ori.livneh; "Puppetize Bugzilla" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62404 [16:59:52] addshore I am here too :P [17:00:01] if u need to fix ur file [17:00:05] petan: petan everywhere [17:00:12] oh, damn wrong channel lol [17:03:30] New patchset: coren; "Preliminary toollabs module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59969 [17:04:16] marktraceur: moved to labs.pp; that should make a nice place to put project-specific stuff in a nice namespace (role::labs::foo::xxx) [17:04:21] New patchset: Aaron Schulz; "Disabled TTMServerMessageUpdateJob jobs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62834 [17:04:59] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62834 [17:05:05] Coren: Whatnow? [17:05:31] marktraceur: wrong nick autocomlete [17:05:42] mark: moved to labs.pp; that should make a nice place to put project-specific stuff in a nice namespace (role::labs::foo::xxx) [17:05:46] *nod* [17:06:36] New review: Isarra; "If there's extra compression from that of the usual png, have you checked if it works in IE6-?" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62817 [17:07:29] !log yurik Started syncing Wikimedia installation... : Zero extension [17:07:38] Logged the message, Master [17:11:16] !log aaron synchronized php-1.22wmf2/includes/job/JobQueueGroup.php 'revert hack' [17:11:25] Logged the message, Master [17:11:36] !log aaron synchronized php-1.22wmf3/includes/job/JobQueueGroup.php 'revert hack' [17:11:44] Logged the message, Master [17:13:12] !log aaron synchronized wmf-config/CommonSettings.php 'Disabled TTMServerMessageUpdateJob jobs' [17:13:19] Logged the message, Master [17:13:42] RobH: can you run ddsh -g job-runners "/etc/init.d/mw-job-runner restart" ? [17:16:16] New review: Odder; "I haven't used Windows-based machines for more than, I guess, 50 hours since 2007, and I am unable (..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62817 [17:16:30] oi, stalkers [17:16:52] AaronSchulz: sure, though my ddsh terminal sometimes is wonky (i get agent timeouts on large batches, but this is small) [17:17:37] !log restarting all mw jobrunners [17:17:45] Logged the message, RobH [17:17:50] AaronSchulz: completed [17:18:05] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62763 [17:18:09] looks like some werent running.... as they failed to restart (and just upstarted normally) [17:18:22] how many? [17:18:35] the logs say all the boxes were doing stuff yesterday [17:19:00] oh god damn it [17:19:06] i just hotkeyed to swap tabs [17:19:12] and cleared my terminal by accident ;_; [17:19:30] fuckin a. [17:19:45] AaronSchulz: well, i can restart again, dunno if needed. [17:19:54] cuz well, they are more than likely fine now [17:20:05] and it was fail to find the process id of the mw jobrunner on restart [17:20:12] not 'they werent running' correction on my part [17:20:27] they could have been running before the restart,just not the process id that expected. [17:21:27] New review: Isarra; "Well, if it's a normal png that just happens to be optimised, it shouldn't be an issue. Generally I'..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62817 [17:22:05] RobH: https://gdash.wikimedia.org/dashboards/jobq/ \o/ [17:22:15] maybe the backlog will stop building up now [17:23:05] there was one job taking up 88% of time even though it had .02% of jobs [17:23:34] PROBLEM - Host aluminium is DOWN: CRITICAL - Host Unreachable (208.80.154.6) [17:25:56] what job was that? [17:26:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:26:55] Some translate thing.. [17:27:00] apergos: you can run /home/aaron/getJobProfileTimes on fluorine [17:27:08] * apergos goes to do so [17:27:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [17:27:54] after looking at it first :-) [17:28:16] Password: _ [17:28:37] * AaronSchulz needs to make that allow using stdin for doing zcat with older logs [17:29:13] !log yurik Finished syncing Wikimedia installation... : Zero extension [17:29:15] TTMServerMessageUpdateJob [17:29:17] yeow [17:29:20] Logged the message, Master [17:29:59] refreshlinks2 is pretty far down there [17:30:27] yurik: MaxSem, so your code update to 1.22wmf2 Zero is going to live for what 30 minutes? Worth it, eh? [17:31:06] apergos: which is funny :) [17:31:14] * AaronSchulz makes it reverse sort instead of regular sort [17:32:08] apergos: it's #2 [17:32:24] AaronSchulz: jobs look happier. [17:32:36] or sometimes #3, fighting with ChangeNotification (one spawns the other) [17:32:51] so both refreshLinks2 and changeNotification spawn refreshLinks [17:33:48] http://ganglia.wikimedia.org/latest/?c=Miscellaneous%20pmtpa&h=www.wikimedia.org&m=cpu_report&r=hour&s=descending&hc=4&mc=2#metric_Global_JobQueue_length [17:34:02] ah changenotification too [17:34:07] RobH: yeah nice [17:34:09] http://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&c=Miscellaneous+pmtpa&h=www.wikimedia.org&tab=m&vn=&mc=2&z=medium&metric_group=ALLGROUPS [17:34:14] it was building up for a while [17:34:16] wow that is a nice sudden drop [17:34:57] can't wait to see how low that goes [17:41:58] !log kernel updates and reboot on pappas [17:42:06] Logged the message, Master [17:45:45] !log about to graceful apaches *and api servers* (which somehow are not in the actual apache graceful all script [17:45:52] Logged the message, Master [17:45:58] !log ) [17:46:06] Logged the message, Master [17:46:09] hate hanging parens [17:50:01] You know what would be cool? A +2 to https://gerrit.wikimedia.org/r/#/c/59969 would be cool. Because then I could finally puppetize the tool labs, and add new compute nodes, and all of that is cool stuff. Really. [17:53:40] morebots: ... [17:53:40] I am a logbot running on wikitech-static. [17:53:40] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [17:53:40] To log a message, type !log . [17:53:47] someone must have restarted it? [17:55:52] Coren: I saw the review request but though Ryan_Lane was handling it :-) [17:56:06] heh [17:56:35] Ryan has about eleventwelve umptilion things on his plate. :-) [17:56:43] Coren: include role::labsnfs::client # temporary measure ? [17:56:49] oh [17:56:50] rigt [17:56:52] *right [17:56:54] till we switch ldap [17:56:59] * Coren nodsnods. [17:57:20] I'm just working on upgrading openstack to folsom [17:57:22] it's not high priority [17:57:23] this is [17:57:42] I'm forgetful, don't hesitate to bug me :) [17:57:44] RECOVERY - Apache HTTP on mw72 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 742 bytes in 0.344 second response time [17:57:59] New patchset: Andrew Bogott; "Fix up deps and unless clause for privacy policy import." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62840 [17:58:06] Ryan_Lane: Muahaha. You *know* I'll hold you to that at the most inconvenient times. :-) [17:58:12] :D [17:58:24] ori-l: ^ [17:58:27] New patchset: RobH; "making new racktables host racktables2 for testing" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62841 [17:58:33] Besides, the upgrade to folsom is full of win; it's also important. Just not time-critical. [17:58:36] Coren: $grid_master = "tools-master.pmtpa.wmflabs" [17:58:55] you're defining that in like 80 places :) [17:59:10] In every role; should I include a superclass? [17:59:10] you could split that into a config class and use the variable from there [17:59:28] the benefit of doing so is that you only have to change common config in one spot [17:59:33] I didn't see examples doing that while looking in /roles/, but that makes the sense. [17:59:36] it's ugly in other ways [17:59:51] but puppet is ugly no matter how you go about it :) [17:59:54] It's also be a sane place to include role::labsnfs::client [18:00:05] yep [18:00:05] DEPLOY TIME [18:00:13] !log authdnsupdate for racktables2 [18:00:14] DRY ;) [18:00:22] Logged the message, RobH [18:00:23] :) [18:00:35] nothing extra to do for wikidata [18:01:25] Ryan_Lane: Did the use of a labs:: namespace look sane to you? [18:01:28] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62841 [18:01:49] yeah [18:02:00] include ssh::bastion ? [18:02:04] https://gerrit.wikimedia.org/r/#/c/59969/9/modules/toollabs/manifests/bastion.pp,unified [18:02:10] include toollabs ? [18:02:25] also, why include on every line rather than using commas? [18:03:09] Ryan_Lane: Easier editing, but yeah. I forgot I already included ssh::bastion as a matter of course. :-) [18:03:25] what is ssh::bastion? [18:03:32] another module? [18:03:41] something in the spaghetti code repo? [18:03:48] ssh::bastion is the class we added way back that adds the "problems logging in" /etc/issue [18:03:56] ah. right [18:04:22] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Everything else to 1.22wmf3 [18:04:29] ah. toollabs class is empty [18:04:29] Logged the message, Master [18:04:46] is that class meant to be included by everything? [18:05:09] Ryan_Lane: Ayup. [18:05:12] ok [18:05:24] New patchset: Reedy; "Everything else to 1.22wmf3" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62842 [18:05:47] Ryan_Lane: Do you prefer the commas to the multiple includes? I'll push it along the config role [18:06:08] New patchset: Reedy; "Everything else to 1.22wmf3" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62842 [18:06:13] it would be more consistent with what we have in the repo [18:06:22] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62842 [18:06:26] I have no really strong preference [18:06:36] Consistent is good [18:06:41] indeed [18:06:49] are you using spaces or tabs? [18:07:11] !log Created EducationProgram tables on mkwiki [18:07:19] Logged the message, Master [18:07:22] Thehelpfulone: https://iegcom.wikimedia.org/wiki/Main_Page [18:07:50] Coren: make those changes and it looks good otherwise [18:07:57] great :) you've got my email address to create me an account right? [18:08:00] want me to add the comments o the change? [18:08:10] !log reedy synchronized wmf-config/InitialiseSettings.php 'Enable EducationProgram on mkwiki' [18:08:18] Logged the message, Master [18:08:34] Ryan_Lane: I'm used cindent by accident; it mixed them, but I didn't want to push a change just for whitespace. [18:08:48] I can use the opportunity to fix them. [18:08:57] you're going to make people angry if your whitespace is bad :) [18:09:02] New patchset: Reedy; "Enable EducationProgram on mkwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62843 [18:09:08] people are incredibly nitpicky about that [18:09:31] Yeah, put the comments in so we have a history of why things got where they are. [18:09:33] because gerrit makes it red :p [18:09:44] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62843 [18:10:21] Wait, do we use tabs or all spaces? If I'm going to change it, I want to change it to the right thing. :-) [18:11:11] spaces [18:11:27] Ryan_Lane: Oh, another thing. I now have a couple .deps to dependencies that came from pip. Do we want to push those in our repo and if so... how? :-) [18:11:30] two spaces [18:11:49] we're going with upstream puppet's (stupid) style guide [18:12:02] heh [18:12:06] :) [18:12:06] our process for this sucks [18:12:18] Coren: if you like you can install puppet-lint locally and have it tell you [18:12:25] add the packaging code to a repo under operations/debs [18:12:42] New patchset: Reedy; "(bug 48026) Close wikimania2012 wiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62626 [18:12:52] then create the package and push it to brewster [18:12:53] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62626 [18:13:02] Coren: https://wikitech.wikimedia.org/wiki/Reprepro [18:13:09] Ryan_Lane: Bookmarked. [18:13:29] this is a process I'd love for us to improve, but it's never been very high priority [18:13:32] so it never gets done [18:14:03] New review: Aaron Schulz; "Does the ForeignDBViaLBRepo I see there even work at all? I'd hope not, but why is it there?" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62606 [18:14:03] if it worked something like launchpad it would be nice [18:14:08] that's like the only nice thing in launchpad [18:14:11] actually i opened a bug to make the jenkins check ignore the tab/space issues but still report all others [18:14:11] New patchset: Reedy; "(bug 47820) Localise $wgMetaNamespace for udmwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62244 [18:14:21] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62244 [18:14:28] i would just like it to match, whichever we use [18:14:30] I think we should enforce spaces for modules [18:14:41] and as we move code to modules, then we use spaces [18:14:45] otherwise we use tabs [18:14:50] New patchset: Reedy; "(bug 47620) Exclude user and talk pages from wikidata features in clients" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/60978 [18:15:02] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/60978 [18:15:47] !log reedy synchronized database lists files: [18:15:55] Logged the message, Master [18:15:56] Coren: fwiw, for the 2-space softtab one in vim: set tabstop=2 set shiftwidth=2 set expandtab [18:16:20] highlight ExtraWhitespace ctermbg=red guibg=red [18:16:22] !log reedy synchronized wmf-config/ [18:16:23] mutante: I know. I have a black-belt in .exrc-fu. :-) [18:16:29] Logged the message, Master [18:16:41] Coren: kk:) [18:16:43] New patchset: coren; "Preliminary toollabs module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59969 [18:17:00] Ryan_Lane: New changeset w/ whitespace fixies and the config class [18:18:03] ori-l: this good to merge? https://gerrit.wikimedia.org/r/#/c/62751/ [18:18:09] Coren: heh [18:18:12] this isn't going to work :) [18:18:18] mutante: It's just that, by force of habit, I'm in ci with cino={.5s:.5s=.5sl1g.5sh.5s(0u0U1 rather than ai. [18:18:26] gridmaster => $grid_master, <— $grid_master isn't defined [18:18:43] it would now be $role::labs::tools::config::grid_master [18:18:57] Ryan_Lane: Wait what? [18:19:01] yep [18:19:06] No, I mean where? [18:19:18] https://gerrit.wikimedia.org/r/#/c/59969/9..10/manifests/role/labs.pp,unified [18:19:21] New patchset: RobH; "fixing typo in racktables class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62845 [18:19:33] Ah! Curses. [18:19:41] I see what you mean. [18:19:43] !log reedy synchronized wmf-config/InitialiseSettings.php [18:19:50] Logged the message, Master [18:20:01] New patchset: Reedy; "Fix wgCanonicalServer for loginwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62846 [18:20:04] racktables host? [18:20:07] oh [18:20:07] heh [18:20:10] rebased change [18:20:27] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62846 [18:21:00] I need to inherit and not include. [18:21:09] yep [18:21:48] can you do some kind of grid_master in browser? [18:21:51] maybe some kinda grid_master::flash ? [18:22:44] to the bit bot abitty abitty to the bit bit bot and you don't stop tooling [18:23:01] New patchset: coren; "Preliminary toollabs module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59969 [18:23:09] Ryan_Lane: I think we have a spec :) [18:23:14] :D [18:23:34] PROBLEM - Puppet freshness on db45 is CRITICAL: No successful Puppet run in the last 10 hours [18:23:52] Ryan_Lane: Now that I look at this, I should probably wrap the whole thing in a single class role::labs::tools { [18:23:59] for clarity? [18:24:01] sounds good to me [18:26:11] New patchset: Ryan Lane; "Add support for essex and drop support for diablo" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62848 [18:26:41] oh. I forgot I need to add the cloud archive repo [18:26:52] yeah no need for gaming in labs [18:27:01] :D [18:27:22] New patchset: coren; "Preliminary toollabs module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59969 [18:27:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:27:36] Last changeset is most bestest evar!!1!one [18:27:41] heh [18:28:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [18:28:40] puppet style best practices says not to embed classes in other classes :) [18:29:03] see [18:29:05] of course some folks at wikimedia disagree with that recommendation [18:29:06] another thing I disagree with :P [18:29:12] ;) [18:29:19] Marc's best practices says "readable is maintainable screw best practice" [18:29:32] Mark's best practices say the same [18:29:36] (in this case ;p) [18:29:42] I actually find not embedding them to be *more* readable [18:30:02] when I'm diving through the code, it's way easier to find a class if it's not embedded [18:30:18] You're loosing the hiearchical nature if you do that, though. Structure is what guides reading. [18:30:18] if we used modules this would be less of an issue [18:30:25] i don't mind if people not embed, but, I won't let that stupid best practices document stop me if I find it better [18:30:32] yeah :( [18:30:38] hate puppetlabs [18:30:52] they don't make life easy, I'll say that [18:30:59] !log updating and rebooting grosley [18:31:06] Personally, I would have made the roles into modules too. Right now, having this be consistent won. [18:31:06] Logged the message, Master [18:31:08] Coren: so, I'm not going to −1 on that or anything :) [18:31:23] we kind of leave it up to personal preference [18:31:27] AaronSchulz: this still +1 in your books? https://gerrit.wikimedia.org/r/#/c/61479/ [18:31:33] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62845 [18:31:38] Coren: +2'd [18:31:57] notpeter: yes [18:32:07] Ryan_Lane: In this particular case, because of the added namespace logic, the embedding makes semantic sense "this is the 'tools' project" [18:32:17] AaronSchulz: cool@! thanks [18:32:26] New patchset: Pyoungmeister; "Assign high priority to EchoNotificationJob" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61479 [18:32:39] Ryan_Lane: Yeay! [18:32:42] * Coren deploys. [18:32:51] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61479 [18:32:54] yeah. I don't have an issue with it. just mentioning what's recommended [18:33:41] AaronSchulz: also, this one: https://gerrit.wikimedia.org/r/#/c/62751/ [18:33:47] looks good t ome, but want to take a quick look? [18:35:03] !log reedy synchronized php-1.22wmf3/extensions/CentralAuth [18:35:11] Logged the message, Master [18:35:55] New patchset: Aaron Schulz; "Added abandonded job stats to gdash." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62852 [18:36:02] Change merged: coren; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59969 [18:37:34] notpeter: after noms :) [18:45:17] let me guess, this syntax won't work: require class { "openstack::repo": openstack_version => $openstack_version } [18:45:37] So, Ryan_Lane, I'm going to be tweaking the actual toollabs module a lot during the day; since it's isolated, do you think it's a problem if I +2 myself? [18:45:45] yes [18:45:55] err [18:45:56] sorry [18:46:00] I misread that :) [18:46:08] it's fine for you to +2/merge [18:46:45] Right. Part of the point of doing the skeleton is to make sure that I can't break anything in prod. :-) [18:49:18] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59995 [18:55:27] Ryan_Lane: Ohcrap. If I remove a puppet class from the per-project config, it doesn't actually remove the keys from LDAP if it was turned on somewhere? [18:56:00] So now I have to add them back, turn 'em off, and remove them again. I'm calling this a bug. :-) [18:57:36] <^demon> Welcome to puppet! [18:57:41] no [18:57:46] wait [18:58:00] per-project config? [18:58:10] I'm confused about what you're asking :) [18:58:46] the thing that generates the keys does a query in ldap for users [18:59:01] it pulls the keys from their entry and writes it out to a file [18:59:11] nothing will stop it from running except for screwing up ldap :) [18:59:17] or disabling the cron [19:00:04] !log Ensuring securepoll database tables exist on all wikis [19:00:15] Logged the message, Master [19:00:38] man I fucking hate puppet [19:00:45] there's: require [19:00:55] but if you switch to using parameterized classes, you can't do that [19:01:54] <^demon> Does it imply require or include when you use a param'd class? [19:02:02] <^demon> Or some new dark magic that's neither? [19:02:15] well, I want the ordering specifically [19:02:26] <^demon> Indeed... [19:03:18] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59996 [19:05:09] ugh. and I can't fucking include it more than once [19:05:25] this is bullshit [19:07:03] I guess I'll need to do an if ! defined [19:07:37] New patchset: Pyoungmeister; "style cleanup" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62856 [19:08:32] hm, I see you guys submitting lots of stuff to the puppet repo [19:08:50] How long does usually an addition to planet.wikimedia.org take? [19:08:56] https://gerrit.wikimedia.org/r/#/c/60902/ [19:10:34] PROBLEM - Puppet freshness on cp3003 is CRITICAL: No successful Puppet run in the last 10 hours [19:10:34] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [19:10:45] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60000 [19:17:44] New patchset: Ryan Lane; "Add support for essex and drop support for diablo" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62848 [19:19:40] New patchset: Ryan Lane; "Change eqiad to use folsom release" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62859 [19:20:21] mutante, when you get a chance, could you run that script to create an account for me? [19:21:27] New patchset: Ryan Lane; "Add Orikrin1998 to the French Planet Wikimedia" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60902 [19:21:36] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60902 [19:23:36] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62856 [19:23:51] odder: merged. it'll apply in the next 30 mins or so [19:25:20] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62848 [19:27:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:28:19] Thehelpfulone: i forgot the name .. searching [19:28:34] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 5.117 second response time [19:29:05] thank Ryan_Lane [19:29:10] yw [19:33:56] New patchset: Ryan Lane; "Only bond interface in pmtpa for nova-network" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62862 [19:34:06] New patchset: Ryan Lane; "Change eqiad to use folsom release" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62859 [19:34:20] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62859 [19:34:29] New patchset: Ryan Lane; "Only bond interface in pmtpa for nova-network" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62862 [19:34:53] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62862 [19:36:59] New patchset: coren; "SSH keystore for toollabs class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62863 [19:39:21] New patchset: coren; "SSH keystore for toollabs class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62863 [19:39:54] RECOVERY - Puppet freshness on virt1005 is OK: puppet ran at Wed May 8 19:39:48 UTC 2013 [19:40:21] !log accidently removed power from magnesium [19:40:29] Logged the message, Master [19:40:31] * Coren needs to find a better workflow with a puppetmaster::self. [19:40:35] New patchset: RobH; "fixing typos in config file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62864 [19:40:44] PROBLEM - Host magnesium is DOWN: CRITICAL - Host Unreachable (208.80.154.5) [19:41:05] !log magnesium is future racktables host, disregard errors for now [19:41:08] New patchset: coren; "SSH keystore for toollabs class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62863 [19:41:12] Logged the message, RobH [19:42:15] New review: coren; "Self +2 (local to tools)" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/62863 [19:42:16] Change merged: coren; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62863 [19:43:34] RECOVERY - Host magnesium is UP: PING OK - Packet loss = 0%, RTA = 0.84 ms [19:43:58] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62864 [19:44:31] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62751 [19:47:04] Thehelpfulone: check your inbox .. "A password reset email has been sent. " [19:47:20] Thehelpfulone: i set email to the one i know of you and used reset [19:47:32] and you got user_id 1 that way [19:47:36] yep works, thanks [19:47:36] heh [19:47:40] didn't even have to add myself [19:47:43] cool [19:47:57] I got user ID 1 on wikimania 2013 wiki too, but Reedy took it for 2014 wiki :P [19:48:22] i guess Reedy is uid 1 on almost all :) [19:49:03] AaronSchulz: redis ganglia stats back up [19:49:06] thanks ori-l ! [19:49:15] oh, yeah, that looked sane [19:49:34] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [19:49:51] New patchset: Ryan Lane; "Remove explicit requirement for essex from role" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62865 [19:50:02] New patchset: Ryan Lane; "Remove explicit requirement for essex from role" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62865 [19:50:19] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62865 [19:52:36] Ryan_Lane: So yeah, I've circumvented the lack of storeconfig on puppet by storing stuff in the shared NFS on puppet runs instead. It's teh uglies, but it works. :-) [19:52:58] oh mutante could you +crat me too so I can create accounts? [19:53:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:53:45] Thehelpfulone: he just walked out to lunch [19:54:01] notpeter, thanks, yeah this can wait :) [19:54:24] notpeter: https://gerrit.wikimedia.org/r/#/c/62852/1 [19:54:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [19:55:17] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62852 [19:55:35] AaronSchulz: merged on sockpuppet [20:02:05] Thehelpfulone: You has 'crat [20:02:12] thanks [20:14:33] New patchset: Yurik; "Removed X-Carrier and testing IP ranges" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62867 [20:14:57] !log authdns-update to swap in grosley for dead aluminium [20:15:05] Logged the message, Master [20:15:57] New patchset: coren; "More tweaks and bugfixes for the Tool Labs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62869 [20:19:11] New review: coren; "This gets the last of the base stuff working." [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/62869 [20:19:12] Change merged: coren; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62869 [20:19:31] notpeter: np :) [20:20:17] New patchset: Andrew Bogott; "First pass at a labsconsole puppet setup" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53989 [20:20:18] New patchset: Andrew Bogott; "Fix up deps and unless clause for privacy policy import." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62840 [20:20:18] New patchset: Andrew Bogott; "Switch the openstack manifest to use webserver::php5." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51798 [20:20:27] !log mwalker synchronized php-1.22wmf3/extensions/CentralNotice/ 'CentralNotice security bugfix 48255' [20:20:35] Logged the message, Master [20:26:10] New patchset: Petrb; "Inserted some more packages to development set" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62913 [20:26:32] New patchset: coren; "Fix duplicate package definitions in toollabs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62914 [20:26:57] I should probably make a branch and stop spamming -operations. Does gerrit support that workflow at all? [20:27:10] Ah, nevermind, I couldn't deploy it if I did. [20:30:01] Coren: it's a standard for people to be self-conscious about that, but I wouldn't worry about it -- people don't notice it as much as you think. There are lots of noisy event-emitting things around and after a while people just learn to pick out the things they care about. Do whatever feels productive. [20:30:07] New review: coren; "Can't work; puppet doesn't want multiple definitions of the same package." [operations/puppet] (production) C: -2; - https://gerrit.wikimedia.org/r/62913 [20:31:40] New review: coren; "(1 comment)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62913 [20:31:43] New patchset: Petrb; "Inserted some more packages to development set" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62913 [20:32:51] Is jenkins ill again? [20:32:56] Pressing the rebase button is fun [20:43:34] PROBLEM - Puppet freshness on db44 is CRITICAL: No successful Puppet run in the last 10 hours [20:44:24] New review: coren; "LGM. Will submit once Jenkins comes back." [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/62913 [20:45:02] * Coren kicks Jenkins. [20:46:28] Reedy: I'd like to run a mini maintenance script for each wiki (it echo's $wgRC2UDPPrefix, $lang.$site and $wgServer sans .org). But it appears mwscript won't take a path as argument. [20:46:33] Any experience in this area? [20:46:41] e.g. how should I be doing it. [20:47:04] <^demon> mwscript expects stuff to be relative to $IP, iirc. [20:47:27] Right. Looks like you can fool it with ../../../home/johndoe/foo.php [20:48:12] hm.. works when I run it manually, it fails when I do it with foreachwiki [20:48:32] ah, different path. foreachwiki needs sudo -u apache, in which case the path is slightly deeper. [20:56:13] New review: coren; "LGM, change local to tools." [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/62914 [20:56:14] Change merged: coren; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62914 [20:56:27] Change merged: coren; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62913 [21:00:00] sudo -u apache foreachwiki path/to/script.php --parameter [21:06:50] !log Running small maintenance script foreachwiki to gather data about odd wgRC2UDPPrefix values (bug 28276) [21:06:58] Logged the message, Master [21:11:10] Krinkle: thank you for working on that bug :D [21:11:30] Reedy: legoktm: https://gist.github.com/Krinkle/5543694 [21:11:53] Now that we have the odd values, we can easily configure the "new" way without breaking the existing exceptions [21:12:30] hm [21:12:47] that includes private wikis which dont have IRC channels right? [21:13:34] Doesn't matter [21:13:45] ok [21:13:56] the variable is set for each wiki, whether it is used is another matter [21:15:27] right [21:17:44] PROBLEM - Host barium is DOWN: CRITICAL - Host Unreachable (208.80.154.12) [21:23:34] New patchset: RobH; "updating racktables role to include php5-mysql" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62921 [21:24:33] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62921 [21:28:15] New patchset: Krinkle; "Fix various path inflexibilities and inconsistencies" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62923 [21:28:49] New review: Krinkle; "Tested locally but not 100% whether it will work as expected on fenari. Needs to be tested carefully." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62923 [21:29:00] Seriously. You need to login to github to view a gist? [21:29:13] Really? [21:29:28] works for me without signing in [21:29:30] Second time it didn't [21:29:33] really helpful [21:34:30] New patchset: coren; "Collect Tool Labs' SSH keys into a ssh_known_host" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62924 [21:36:27] !log mholmquist Started syncing Wikimedia installation... : Fix UploadWizard concurrency and minor message updates [21:36:35] Logged the message, Master [21:36:51] New review: Reedy; "Needs rebasing" [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/62814 [21:36:52] Woooooo [21:38:49] New patchset: coren; "Collect Tool Labs' SSH keys into a ssh_known_host" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62924 [21:42:12] New patchset: Krinkle; "populateWikiversions: Use MULTIVER_COMMON_HOME instead of /h/w/common" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62926 [21:42:21] New review: coren; "Evil, but correct." [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/62924 [21:42:22] Change merged: coren; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62924 [21:42:59] New patchset: Krinkle; "Bug 28276 - Inconsistently named IRC recent changes channels" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47307 [21:44:46] !log mholmquist Finished syncing Wikimedia installation... : Fix UploadWizard concurrency and minor message updates [21:44:54] Logged the message, Master [21:46:13] New patchset: coren; "Fix: gdb already part of base.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62927 [21:47:41] New review: coren; "Trivial fix" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/62927 [21:47:41] Change merged: coren; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62927 [21:50:11] New patchset: coren; "Typo fix: Puppet doesn't like proper grammar" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62928 [21:54:37] New review: coren; "Last one for today." [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/62928 [21:54:38] Change merged: coren; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62928 [21:58:37] New patchset: coren; "Typo fix: Another instance of the same typo." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62929 [21:58:37] I lied. [21:59:53] New review: coren; "Not going to say last one, because that will clearly /cause/ a problem that needs a fix." [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/62929 [21:59:54] Change merged: coren; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62929 [22:06:11] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62926 [22:15:34] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [22:17:38] New patchset: RobH; "final steps of racktables puppetization and migration" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62932 [22:18:55] New patchset: RobH; "final steps of racktables puppetization and migration" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62932 [22:20:35] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62932 [22:23:49] !log mholmquist synchronized php-1.22wmf3/extensions/UploadWizard/UploadWizardHooks.php 'Fixing Special:Preferences for wikis with UploadWizard' [22:23:57] Logged the message, Master [22:25:24] PROBLEM - NTP on magnesium is CRITICAL: NTP CRITICAL: Offset unknown [22:26:49] !log authdns update to migrate racktables from hooper to magnesium. both servers are already using db1001 as master, which is why hooper is slow to load racktables data [22:26:58] Logged the message, RobH [22:29:24] RECOVERY - NTP on magnesium is OK: NTP OK: Offset 0.00342297554 secs [22:44:39] New patchset: Anomie; "Make $wgCodeEditorEnableCore configurable per wiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62814 [23:01:09] andrewbogott_afk: ping me when you get back? [23:06:34] PROBLEM - Puppet freshness on neon is CRITICAL: No successful Puppet run in the last 10 hours [23:22:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:23:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [23:27:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:28:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [23:49:34] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [23:49:34] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [23:49:34] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours