[00:00:19] hoo: Psh. :-) [00:00:24] (03CR) 10Chad: [C: 032] Elasticsearch upgrade starting [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/117095 (owner: 10Manybubbles) [00:00:31] <^d> In that case, lez go. [00:00:31] Fine [00:00:32] thanks ^d [00:00:38] (03Merged) 10jenkins-bot: Elasticsearch upgrade starting [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/117095 (owner: 10Manybubbles) [00:00:50] I have the conch [00:02:32] !log Starting Elasticsearch upgrade [00:02:41] Logged the message, Master [00:02:41] !log manybubbles synchronized wmf-config/jobqueue-eqiad.php 'Pausing Cirrus jobs for the duration of the upgrade.' [00:02:49] Logged the message, Master [00:03:07] !log manybubbles synchronized wmf-config/InitialiseSettings.php 'Turn Cirrus of for the duration of the upgrade' [00:03:15] Logged the message, Master [00:04:05] greg-g: I'm done syncing for now [00:04:14] I'm not sure if I want to do both submodule updates at once...probably faster that way though [00:04:58] ^d: cirrus jobs are still running. I imagine the puppet change hasn't hit all the job runners yet [00:05:01] manybubbles: cool [00:05:16] rdwrer: you ready? or should I pass the conch to jgonera / kaldari? [00:05:37] <^d> manybubbles: puppet change just removed them from prioritized queue. [00:05:48] Give it to jgonera [00:05:49] <^d> It was the wmf-config change that added them to excluded from default. [00:05:50] and that config change removed them from the default queue [00:06:00] so they should stop [00:06:56] jgonera: you ready? [00:07:02] maybe they are still running because the job runners have to finish their loops? [00:07:11] <^d> manybubbles: Could be. [00:07:12] greg-g, kaldari will start and I will continue [00:07:25] I'd like to go before you guys scap though [00:07:37] Or I can scap I guess [00:07:39] you are scapping again? give me a sec [00:07:58] jgonera: can you have kaldari join this channel please? [00:08:11] yep [00:08:12] about to remove snapshot hosts from dsh [00:08:33] if i could load the change .. [00:08:34] scap's log output looks a little funky right now [00:08:50] It's in a state of flux [00:08:59] So don't panic! [00:09:21] Also you will get failures from shapshot[1234] [00:09:23] (03CR) 10Dzahn: [C: 032] Re-remove snapshot[1234] from mediawiki-installation dsh group [operations/puppet] - 10https://gerrit.wikimedia.org/r/117326 (owner: 10BryanDavis) [00:09:33] Or maybe not :) [00:09:55] ^d: confirmed that cirrus is not available as a betafeature on enwiki [00:09:58] bd808: there you go, i saw it, i saw scap.. [00:10:11] mutante: My hero [00:10:20] <^d> manybubbles: Yeah, we've failed back. [00:10:30] <^d> You can see the drop off in traffic on Elastic. [00:10:30] that part worked [00:10:40] holy mother of god [00:10:44] you can [00:10:48] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=es_query_time&s=by+name&c=Elasticsearch+cluster+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=2&z=small&hc=4 [00:11:11] so the load just generated by the jobs is ~5% which is cool to know [00:11:57] OK I have three commits up and ready to go [00:12:02] Whenever [00:13:20] manybubbles, rdwrer: is it OK for me to sync things or are you guys still deploying? [00:13:29] you may go any time [00:13:33] I have given up the conch [00:13:46] kaldari: I haven't started, nobody's said go [00:14:05] ^d: I'm thinking we wait 20 minutes after the sync before we start going "why hasn't it stopped yet?" [00:14:08] <^d> manybubbles: So, how to shut these jobs off. [00:14:14] <^d> Heh [00:14:16] well [00:14:29] the loop has times in it like: lpmaxdelay=600 [00:15:31] if you can shell into the job runners and take a look then be my guest. [00:15:54] if they have outdated versions of /usr/local/bin/jobs-loop.sh then we'll be here a while [00:16:01] basically if that lists cirrus [00:16:54] be back in a minute [00:18:06] kaldari: Are you going now? [00:18:11] <^d> manybubbles|away: Worst case: we could set $wgDisableSearchUpdate to true. It'll turn all our jobs into no-ops. [00:18:12] yep [00:18:15] 'k [00:18:30] ^d: yeah. we can recover from that by reindexing a wide swath of things [00:19:19] I don't see the jobs slacking.... [00:19:22] <^d> mw1002 (chosen at random) has the up to date version of jobs-loop [00:19:38] Reedy: around by chance? (not that you should at midnight) [00:19:44] I am [00:19:55] jamesofur: You don't know Reedy very well do you [00:20:02] heh [00:20:12] rdwrer: I do ;) I just feel the need to put a caveat ;) [00:20:13] * James_F grins. [00:20:50] Reedy: I'm trying to figure out what's up with legalteamwiki, it seems like everything was done but so far the dns seems to be going elsewhere. https://legalteam.wikimedia.org/ (for me) goes to a wikimedia.org type site and https://legalteam.wikimedia.org/wiki redirects to foundation wiki [00:21:03] is it just apache's catching up [00:21:05] (03CR) 10Dzahn: "we need to find a replacement host for this either or, we are really trying to get away from fenari instead of adding new things. so the q" [operations/puppet] - 10https://gerrit.wikimedia.org/r/117250 (owner: 10Nemo bis) [00:21:06] ? [00:21:33] Nope... [00:21:43] We were having a couple of issues earlier, but they should've been fixed [00:21:59] ^d: https://gist.github.com/nik9000/8514c609412bcbe745a3 [00:22:03] got it, I think [00:22:35] that list doesn't have cirrusSearchLinksUpdate which is the only one I see still running [00:23:10] springle: is it just me or do we have unnecessarily unique indexes on the _links tables? [00:23:13] like something is stripping it [00:23:15] <^d> Also doesn't have delete. [00:23:32] what happened! [00:23:40] CREATE UNIQUE INDEX /*i*/il_from ON /*_*/imagelinks (il_from,il_to); [00:23:41] CREATE UNIQUE INDEX /*i*/il_to ON /*_*/imagelinks (il_to,il_from); [00:24:15] * AaronSchulz prefers the first being unique (random preference) [00:25:25] ^d: maybe replace += with array_merge? [00:26:54] The + operator returns the right-hand array appended to the left-hand array; for keys that exist in both arrays, the elements from the left-hand array will be used, and the matching elements from the right-hand array will be ignored. [00:26:57] <^d> Yeah. [00:27:48] <^d> Fixing now. [00:28:11] (03PS1) 10Manybubbles: Shut off cirrus jobs for real this time [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/117353 [00:28:34] greg-g: gonna want the conch back when we can get it [00:28:46] Apparently we're going over time. [00:28:49] ^d: the other += is probably wrong too [00:28:53] AaronSchulz: yeah, don't need both strictly speaking [00:28:56] <^d> Yeah [00:28:59] <^d> I'm fixing it all now. [00:29:07] jgonera: kaldari how ya'll doing? [00:29:09] springle: I should check if we have those on prod [00:29:15] rdwrer: technically I have all night..... [00:29:18] AaronSchulz: still need the reverse as normal key though [00:29:28] yeah, don't much worry about manybubbles's stuff [00:29:41] mine can wait for you though [00:29:41] right [00:30:04] (03PS1) 10Chad: Don't use += with $wgJobTypesExcludedFromDefaultQueue [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/117354 [00:30:09] greg-g, fine, if I don't run out of disk space updating all the submodules [00:30:22] (03Abandoned) 10Manybubbles: Shut off cirrus jobs for real this time [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/117353 (owner: 10Manybubbles) [00:30:24] Hahahah [00:30:26] (03CR) 10Manybubbles: [C: 031] Don't use += with $wgJobTypesExcludedFromDefaultQueue [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/117354 (owner: 10Chad) [00:30:45] rdwrer: how prep'd are you? [00:30:47] we can merge that when we have the conch [00:30:54] | this | prepped, greg-g [00:31:06] (ready to pull code onto tin and sync) [00:31:22] Well, merge things first, then that [00:31:29] * greg-g nods [00:32:55] Jamesofur: seems it's a few rogue apaches [00:33:21] !log graceful mw1040 [00:33:26] that's good, do they generally just clean out? I know I've seen similar things for new wikis before [00:33:30] Logged the message, Master [00:33:35] that was one of them [00:33:39] ^d: gonna get what I can moving [00:33:42] AaronSchulz: although, thinking about it, we should benchmark the second unique index vs non-unique. the optimizer can use the uniqueness for choosing a query plan, but whether it's of more value that the overhead of maintaining uniqueness... [00:33:54] jamesofur: for some reason a random server did not get restarted [00:33:56] * springle needs more coffee [00:33:59] !log [Elasticsearch upgrade] Running puppet everywhere to make sure we have the newest config [00:34:06] Logged the message, Master [00:34:08] jamesofur: after normal graceful that also redirects [00:34:45] springle: what time is it on your side of the world? [00:35:00] mid-morning [00:35:08] ah, perfect time for coffee [00:35:15] but coffee is an all-day requirement ;) [00:35:17] heh [00:35:26] It's always 06:15 somewhere [00:35:26] I could use some, now that I think about it [00:35:44] * manybubbles goes to start brewing coffee [00:36:04] (03CR) 10Chad: [C: 032] Don't use += with $wgJobTypesExcludedFromDefaultQueue [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/117354 (owner: 10Chad) [00:36:08] mutante: redirects to foundation wiki? [00:36:45] (03Merged) 10jenkins-bot: Don't use += with $wgJobTypesExcludedFromDefaultQueue [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/117354 (owner: 10Chad) [00:36:46] jamesofur: no, to legalteam. i said redirect because i see it as a 301 to the https version [00:36:49] 301 Moved Permanently https://legalteam.wikimedia.org/wiki/ [00:37:09] springle: I doubt it would help considering that uniqueness means poor change buffer usage and all the write queries are either inserts or DELETE WHERE *_from = X [00:37:15] ahh, yeah.. but so far I've always going from there to https://wikimediafoundation.org/wiki/Main_Page (3 different servers so far) [00:37:36] !log demon synchronized wmf-config/jobqueue-eqiad.php 'Fixing $wgJobTypesExcludedFromDefaultQueue config' [00:37:44] Logged the message, Master [00:38:41] <^d> manybubbles: Slowing down now. [00:38:55] <^d> And stopped. [00:39:40] indeed [00:40:01] cool [00:40:16] we look dormant [00:40:24] I'm still getting one last puppet run in on the machine [00:40:26] machines [00:40:56] !log gracefull'ing rogue apaches, mw1070,mw1089,mw1104,mw1111 [00:40:57] <^d> I think the only person who noticed we failed back was Dan because he was still playing with that delete bug. [00:40:59] <^d> :) [00:41:03] ha [00:41:04] Logged the message, Master [00:41:14] I'll be back in a moment. one kid is sick [00:41:34] (03PS1) 10Cmjohnson: Adding dns entries for db1061-63 [operations/dns] - 10https://gerrit.wikimedia.org/r/117358 [00:42:49] <^d> Heh, even with all those wikis going back to lsearchd lsearchd is still bored. [00:43:40] !log gracefull'ing rogue apaches, mw1131,mw1189,mw1190,mw1215 [00:43:47] Logged the message, Master [00:43:52] kaldari, jgonera: Are you guys walking the updates to the datacenter over there? :P [00:43:58] AaronSchulz: probably correct, but difficult to predict for all traffic patterns. technically insert buffer can slow down reads on non-unique indexes if many unmerged records, but of course you'll say reads are easier to scale than writes :) [00:44:06] rdwrer, almost there [00:44:19] back [00:44:26] Reedy: jamesofur ^ that is more of that [00:44:28] !log [Elasticsearch upgrade] Elasticsearch is now quiescent [00:44:28] heh [00:44:36] Logged the message, Master [00:44:41] thanks much mutante [00:44:46] something must have gone wrong with apache-graceful-all [00:44:50] when that was added [00:45:17] mutante: as I find other servers do you want them? :) [00:45:41] jamesofur: we got the list with apache-fast-test pybal option that goes through all [00:45:43] rdwrer, we had to update both wmf16 and wmf17 and jenkins is taking it's time merging [00:45:52] jamesofur: so _should_ be all now, Reedy checked [00:45:53] wmf16 is done, waiting for 17 [00:46:14] Oh, good, I have that to look forward to then [00:46:44] * greg-g ponders a deployment-only jenkins [00:47:34] !log [Elasticsearch upgrade] Disabling puppet so it doesn't restart Elasticsearch while we're upgrading it [00:47:37] Smart plan [00:47:42] Logged the message, Master [00:47:58] (03PS3) 10Chad: Elasticsearch upgrade ending [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/117251 [00:47:58] mutante: I just had it on mw1078 mw1066 and mw1173 (if the comment is to be believed) haven't yet gotten to the wiki :( [00:48:08] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (213572) [00:48:11] rdwrer: there's already some thoughts on a 'private' gerrit/jenkins for dealing with eg security patches [00:48:16] (03PS2) 10Cmjohnson: Adding dns entries for db1061-63 [operations/dns] - 10https://gerrit.wikimedia.org/r/117358 [00:48:16] Hm [00:48:18] yeah yeah yeah [00:48:19] Interesting [00:48:22] we're going to have lots of jobs [00:48:23] surprise surprise [00:48:30] we're just sucking them up while we do the upgrade [00:48:43] !log [Elasticsearch upgrade] Turning off shard reallocation so we don't thrash while Elasticsearch shuts down [00:48:51] Logged the message, Master [00:49:22] !log [Elasticsearch upgrade] Shutting down Elasticsearch [00:49:30] Logged the message, Master [00:49:55] rdwrer: we're all done [00:50:00] rdwrer, we're done, just need you to scap after yo deploy [00:50:02] Righto [00:50:07] * rdwrer does this thing [00:50:12] (03CR) 10MarkTraceur: [C: 032 V: 032] Enable ULS 'compact language links' Beta Feature on all normal wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/117344 (owner: 10Jforrester) [00:50:24] Two more and I'll start pulling [00:50:38] in a few minutes icinga will warn about elasticsearch being down.... [00:50:45] <^d> manybubbles: got your e-mail, ack. [00:50:52] cool [00:50:58] half elasticsearch nodes down [00:51:03] springle: https://en.wiktionary.org/wiki/Special:WhatLinksHere/Module:languages/data3/w?namespace=0 how did this not give me a gateway timeout? [00:51:05] (03Merged) 10jenkins-bot: Enable ULS 'compact language links' Beta Feature on all normal wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/117344 (owner: 10Jforrester) [00:51:11] jamesofur: Reedy apache-fast-test legalteam.url mw1078 [00:51:11] By the power of Thor and Mjolnir, I beseech the gods to guide this lightening deploy. [00:51:20] that's been spinning for like 30min in FF ;) [00:51:24] jamesofur: Reedy * 301 Moved Permanently https://legalteam.wikimedia.org/wiki/ [00:51:44] (03PS1) 10Chad: Revert "Remove cirrus jobs from priority list" [operations/puppet] - 10https://gerrit.wikimedia.org/r/117360 [00:51:58] !log [Elasticsearch upgrade] Upgrading Elasticsearch [00:52:05] (03CR) 10Chad: [C: 04-1] "Not yet, just prepping." [operations/puppet] - 10https://gerrit.wikimedia.org/r/117360 (owner: 10Chad) [00:52:06] Logged the message, Master [00:52:08] PROBLEM - ElasticSearch health check on elastic1005 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.112 [00:52:08] PROBLEM - ElasticSearch health check on elastic1008 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.140 [00:52:08] PROBLEM - ElasticSearch health check on elastic1015 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.12 [00:52:12] jamesofur: so.. what we fixed is making them all the same state [00:52:13] !log mholmquist updated /a/common to {{Gerrit|Iad8c84a7d}}: Don't use += with $wgJobTypesExcludedFromDefaultQueue [00:52:18] PROBLEM - ElasticSearch health check on elastic1003 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.110 [00:52:19] PROBLEM - ElasticSearch health check on elastic1004 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.111 [00:52:19] PROBLEM - ElasticSearch health check on elastic1002 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.109 [00:52:19] PROBLEM - ElasticSearch health check on elastic1011 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.143 [00:52:19] PROBLEM - ElasticSearch health check on elastic1001 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.108 [00:52:19] PROBLEM - ElasticSearch health check on elastic1006 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.113 [00:52:19] PROBLEM - ElasticSearch health check on elastic1010 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.142 [00:52:20] PROBLEM - ElasticSearch health check on elastic1014 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.11 [00:52:20] PROBLEM - ElasticSearch health check on elastic1016 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.13 [00:52:21] PROBLEM - ElasticSearch health check on elastic1013 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.10 [00:52:21] Logged the message, Master [00:52:21] PROBLEM - ElasticSearch health check on elastic1012 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.144 [00:52:27] jamesofur: but i also never saw the wiki , so maybe that was reverted [00:52:28] PROBLEM - ElasticSearch health check on elastic1009 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.141 [00:52:34] rdwrer I'm buying you a drink sometime for being the only one who follows the directions. [00:52:35] surprise! [00:52:43] Hehehe [00:53:08] PROBLEM - LVS HTTP IPv4 on search.svc.eqiad.wmnet is CRITICAL: Connection refused [00:53:25] mutante: Reedy yeah, they are certainly all doing the same thing but for me that means 'https://legalteam.wikimedia.org/' is showing wikimedia.org (though not actually redirecting) and /wiki is going to wikimediafoundation.org [00:54:28] uh, it isn't downloading the deb.... [00:54:35] 0% [1 elasticsearch 49.2 kB/18.5 MB 0%] [00:55:07] well... [00:55:07] there it goes [00:55:10] it was stuck..... [00:55:13] jamesofur: yea, uhm, i wasn't involved in adding the actual redirect, i just heard that something was reverted [00:55:23] Jamesofur: How about now? [00:55:32] \o/ [00:55:34] works [00:55:35] I purged some stuffs from apache/varnish [00:55:43] you've got shell, right? [00:55:47] I do yes [00:55:47] ah, varnish ? [00:55:53] AaronSchulz: there are a bunch of /* SpecialWhatLinksHere::showIndirectLinks Aaron Schulz */ queries logged all getting sniped at 5m, then reappear. would it retry endlessly? [00:55:53] I can create an account for myself [00:55:58] If pages show weirdly.. [00:55:59] echo "http://legalteam.wikimedia.org" | mwscript purgeList.php aawiki [00:56:07] Reedy: cool! works [00:56:10] nice [00:56:23] good old aawiki [00:56:23] try using the url in there (full path), it'll make sure it's purged [00:56:39] There's always a few that get caught up when we have a few servers playing up [00:57:19] needs logo [00:57:25] yup, I need to upload [00:57:30] :) [00:57:39] thank you very much both of you [00:57:40] hi [00:58:10] yw, i wonder how the "legal" logo will look like [00:58:25] !log Killing a duplicate Jenkins java process on gallium (init.d script sucks, I really need to get it fixed one day) [00:58:33] Logged the message, Master [00:58:36] With my choice? I can tell you, it'll be warm and fuzzy and look like a tiger ;) [00:58:40] we'll see if that lasts [00:58:43] rdwrer, let me know when scap is finished [00:59:28] jgonera: Still waiting on slowass^WJenkins [00:59:29] !log [Elasticsearch upgrade] Starting Elasticsearch [00:59:34] jamesofur: oooh, that makes sense, you also still have those mail redirects, his name in a bunch of spelling variants [00:59:37] Logged the message, Master [00:59:38] rdwrer: hah [00:59:46] rdwrer, no worries, I just need to test things when it's done [00:59:50] 'kay [00:59:53] !log [Elasticsearch upgrade] Verifying versions [01:00:00] Logged the message, Master [01:00:09] https://integration.wikimedia.org/zuul/ I'm waiting on 117350 [01:00:09] !log killed wrong jenkins process (+1 for 2am fix up). Restarting jenkins [01:00:12] mutante: heheh, when we get email at those it's great, usually someone takes lunch or free time and writes a response in character (with correct legal info and all) [01:00:15] Oh fuck [01:00:18] Logged the message, Master [01:00:36] surely handling a jenkins issue at 2am with X beers consumed is not the smartest idea [01:00:39] springle: I'm not refreshing on my end [01:00:39] our internal lca tools already use a picture of him [01:00:52] hashar: During the lightning deploy, while we're waiting for things to merge? maybe not [01:00:57] rdwrer: I killed Jenkins sorry. Will be back short [01:00:59] * rdwrer grumbles [01:00:59] greg-g: Do you have shell access? [01:01:02] !log [Elasticsearch upgrade] Wait for the cluster to recover. [01:01:06] meta-jenkins, go validate the status of the jenkis admin :) [01:01:06] rdwrer: sorry :/ Jenkins went wild btw [01:01:08] RECOVERY - LVS HTTP IPv4 on search.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 387 bytes in 0.002 second response time [01:01:09] Logged the message, Master [01:01:24] rdwrer: luckily it is restarting fast nowadays [01:01:27] Yeah [01:01:27] hashar: !:) hi, thanks, i think it already fixed itself after ^d restarted [01:01:40] bd808: yeah [01:01:57] hashar: Jenkins has been a slow jerk all day. ^d restarted it earlier [01:02:15] greg-g: on fluorine you can do this `tail -50f /a/mw-log/scap.log | python ~bd808/scaplog.py` [01:02:25] And watch scap happen irl [01:02:30] ^d: yeah so when restarting jenkins, it emits a kill to the jenkins process but for some interesting reason the java process never died and a SECOND Jenkins is started while the first slowly consume all the memory (ping mutante) [01:02:41] <^d> Blegh. [01:02:44] Jenkins piss me off [01:03:00] to be honest that is merely become I have Zero Java knowledge or I would fix i [01:03:00] !log [Elasticsearch upgrade] All primary shards have started. Waiting on secondary. [01:03:00] t [01:03:10] Logged the message, Master [01:03:26] !log Jenkins backup [01:03:30] !log Jenkins back up [01:03:31] !log [Elasticsearch upgrade] Reenabling puppet [01:03:33] Logged the message, Master [01:03:40] Logged the message, Master [01:03:45] rdwrer: sorry for the trouble during lightning deplot [01:03:47] Logged the message, Master [01:03:52] Welcome to the second hour of the lightning deploy, with your DJs greg-g and marktraceur... [01:04:02] ^d: https://test2.wikipedia.org/w/index.php?search=asdf&title=Special%3ASearch&go=Go&fulltext=1&srbackend=CirrusSearch worked [01:04:07] * hashar get yet another booze and listen to the DJs [01:04:11] And now a word from our sponsors: Greased Lightning [01:04:44] ^d: bd808: I have tricked some third parties in implementing a python job runner which should be able to replace most of Jenkins functionalities :-] [01:05:20] ^d: also good: https://commons.wikimedia.org/w/index.php?search=file:space&title=Special%3ASearch&go=Go&fulltext=1&srbackend=CirrusSearch [01:05:28] hashar: sweet. [01:05:46] FINALLY [01:05:48] Fuck [01:06:03] <^d> manybubbles: How many nodes are we on? [01:06:52] ^d: all six nodes are up. we have about 2000 shards left to start [01:06:58] but all primary shards have started [01:07:02] OK, scapping [01:07:03] mutante: a good thing is that the Icinga alarms for contint works properly :] [01:07:04] Like a boss [01:07:18] bd808: I thought I did, I haven't used it in a long time, and I'm getting permission denied :/ [01:07:21] LIGHTNING DEPLOY!!!!111!!!!!!ONE!!!!!ELEVEN!!!!12111 [01:07:21] mutante: thank you very much. You have saved my friday morning! [01:07:23] !log mholmquist Started scap: (no message) [01:07:32] Logged the message, Master [01:07:32] ...oh, I'm bad. [01:07:35] rdwrer: append BBQ !!111HOLY [01:07:36] yes.... [01:07:38] what a n00b [01:07:49] hashar: haha, nice how you always find the positive part, but yea, good to know you got mail [01:07:50] !log That scap was for ULS, VE, and MobileFrontend fixes and updates. [01:07:59] Logged the message, Master [01:07:59] hashar: cheers [01:08:00] ^d: I think it restores the primary shards at a higher priority [01:08:07] <^d> I want bbq now. [01:08:07] the last 2000 are going slower then the first 1000 [01:08:16] mutante: given our workload and all the crazy projects we have to handle, we better have to find the good parts or we would all shoot ourselves! :D [01:08:36] Hm... [01:09:05] Keep an eye on this one [01:10:06] springle: huh, << curl --head 'https://en.wiktionary.org/wiki/Special:WhatLinksHere/Module:languages/data3/w?namespace=0' >> gives me the expected 504 [01:10:19] (03PS1) 10Springle: ops ldap group too restrictive [operations/puppet] - 10https://gerrit.wikimedia.org/r/117361 [01:10:54] James_F: Did you set wmgULSCompactLinks anywhere? [01:11:08] It may be that I failed to gate properly on that patchset. [01:11:11] rdwrer: Yes… [01:11:20] James_F: Where? [01:11:36] Argh. [01:11:42] In InitialSettings.php. [01:11:46] Which didn't get into the commit. [01:11:48] (03CR) 10Springle: [C: 032] ops ldap group too restrictive [operations/puppet] - 10https://gerrit.wikimedia.org/r/117361 (owner: 10Springle) [01:11:51] * James_F sighs. [01:12:12] James_F: Excellent, well, at least we only need to sync one file [01:12:29] James_F: Do you want to fix it or should I? [01:12:46] ^d: seriously, it took ~3 minutes to restore all 1486 primary shards and it is taking that long for just 100 secondary ones.... [01:12:57] (03PS1) 10Jforrester: Follow-up: Icf0bef96306661 – missing file(!) from commit [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/117363 [01:13:00] <^d> :\ [01:13:01] rdwrer: ^^^ [01:13:18] (03CR) 10MarkTraceur: [C: 032] Follow-up: Icf0bef96306661 – missing file(!) from commit [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/117363 (owner: 10Jforrester) [01:13:25] Let's not leave things borked for too long, eh [01:13:26] (03Merged) 10jenkins-bot: Follow-up: Icf0bef96306661 – missing file(!) from commit [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/117363 (owner: 10Jforrester) [01:13:29] rdwrer: Sorry, was moving too quickly and didn't check. :-( [01:13:37] whatever, we can wait a few minutes. I think it might think that it is back in service so it doesn't want to swamp the disk [01:13:38] James_F: Ditto [01:13:46] No errors yet knockonwood [01:13:48] PROBLEM - Disk space on snapshot1003 is CRITICAL: DISK CRITICAL - free space: / 563 MB (2% inode=69%): [01:14:08] PROBLEM - Disk space on snapshot1001 is CRITICAL: DISK CRITICAL - free space: / 281 MB (1% inode=69%): [01:14:43] rdwrer: You weren't watching closely then :) snapshot1004 is out of disk space still [01:15:07] Errors in the scap, but you told me not to worry about those [01:15:12] ^d: 1500 [01:15:14] No errors on the site [01:15:14] no* errors* [01:15:53] (03CR) 10Cmjohnson: [C: 032] Adding dns entries for db1061-63 [operations/dns] - 10https://gerrit.wikimedia.org/r/117358 (owner: 10Cmjohnson) [01:16:16] jgonera: The sync to apaches is done, waiting on cdbs [01:16:28] rdwrer: *blushes* Thanks for the save. [01:16:41] James_F: It's fine, that's how LDs go :) [01:16:47] ^d: might want to try some stuff with the url parameter [01:16:53] rdwrer: (Clearly this was just part of my plan to rise up the total-commits-to-`mediawiki-config` table :-)) [01:17:03] Hah, obvs [01:17:06] yoyo manybubbles, checking in [01:17:08] PROBLEM - Disk space on snapshot1001 is CRITICAL: DISK CRITICAL - free space: / 1048 MB (3% inode=69%): [01:17:10] how goes it? [01:17:20] ottomata: everything is wonderful [01:17:34] yeah!? yeah! [01:17:34] we're coming back online [01:17:36] awesome [01:17:55] it'll take another 20 minutes for it to be fully full of redundancy but we're pretty good so far as I can tell [01:18:00] * greg-g has to run [01:18:12] greg-g: We should be fine, just one more sync-file to run [01:18:18] greg-g: Have a nice night [01:18:19] !log mholmquist Finished scap: (no message) (duration: 10m 55s) [01:18:24] !log mholmquist updated /a/common to {{Gerrit|I48e98d28f}}: Follow-up: Icf0bef96306661 – missing file(!) from commit [01:18:32] Logged the message, Master [01:18:38] Logged the message, Master [01:18:52] (03CR) 10Dzahn: "auth_name was still Ops" [operations/puppet] - 10https://gerrit.wikimedia.org/r/117361 (owner: 10Springle) [01:19:14] !log mholmquist synchronized wmf-config/InitialiseSettings.php 'Fix James_F's commit, follow-up, should gate ULS beta feature' [01:19:21] Logged the message, Master [01:19:26] jgonera: OK everything should be set now [01:19:40] James_F: Want to make sure things are working for you too? [01:20:08] rdwrer: Hmm. Damn it. [01:20:11] Damn it? [01:20:32] I don't like damn it [01:20:47] rdwrer: I didn't hide the BF hook in the commit, just the BF action. [01:20:56] Ah. [01:21:03] rdwrer: So it'll be exposed but not work on loginwiki and votewiki. [01:21:11] |log heading bed, have a nice evening folks. Jenkins seems happy right now. [01:21:16] *waves* [01:21:19] James_F: As mistakes go that seems pretty much OK [01:21:22] rdwrer: Oh well, it's not too bad. Not worth fixing in the LD. [01:21:30] Fix it in another commit, it'll go out next week [01:21:30] rdwrer: Will do a follow-up commit to ULS. [01:21:33] Yeah. [01:21:36] Follow-follow-up [01:22:12] thanks rdwrer [01:22:28] Let's call this the end of the LD from hell [01:22:45] Yeah. [01:23:05] rdwrer: Wanna do the 1.23wmf18 deploy next week :) [01:23:13] It's super fun [01:23:25] bd808: Hm, I think I'm going to be sick that day [01:23:50] Maybe someday [01:24:00] heh. It's actually not that bad but lots of little steps [01:24:13] rdwrer: Argh, no. [01:24:15] greg-g, I forgot we still have one bug and I need to touch and sync MobileFrontend [01:24:19] rdwrer: I'm an actual moron. [01:24:20] can I do that now? [01:24:27] night hashar [01:24:29] rdwrer: Forgotten global statement. [01:25:22] jgonera: greg-g stepped out. If you've got enough folks around to help in case of breakage and rdwrer and James_F are really done… be bold [01:25:24] James_F: Where's that? [01:25:30] * bd808 has to split [01:25:45] jgonera: Go nuts, I'll keep an eye on you [01:26:02] bd808, I'm literally only going to do find -type f -exec touch {} \; [01:26:10] and sync [01:26:24] Oh, in ULS. [01:26:28] Shit, I hate submodules so much [01:26:32] rdwrer: Yeah. [01:26:38] James_F: Fixing? [01:26:43] Yeah. [01:27:08] PROBLEM - Disk space on snapshot1001 is CRITICAL: DISK CRITICAL - free space: / 1058 MB (3% inode=69%): [01:27:17] This is what happens when you have someone not used to PHP reviewing, I guess... [01:27:37] Now, one thing I don't get is why Jenkins merged your initial confgig patch that had undefined errors all over [01:27:40] rdwrer: https://gerrit.wikimedia.org/r/117365 [01:28:02] rdwrer: Indeed. Maybe. [01:28:19] <^d> manybubbles: Seems to be working for me. [01:29:33] James_F: I'll take it now. [01:29:40] !log jgonera synchronized php-1.23wmf16/extensions/MobileFrontend/ 'Touch MobileFrontend.i18n.php to update RL cache' [01:29:48] Logged the message, Master [01:29:53] !log [Elasticsearch upgrade] temporarily raising recovery speed [01:30:00] Logged the message, Master [01:30:41] lets see if that does anything [01:32:13] jgonera: Whenever you're done I'd like to push the Actually Last patch to ULS [01:32:21] Luckily it's just a sync-file [01:32:43] there we go [01:32:49] ^demon|away: you can see the bump: http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=es_query_time&s=by+name&c=Elasticsearch+cluster+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=2&z=small&hc=4 [01:32:59] apparently chaning the recovery throttling takes some time [01:33:13] maybe it sets the throttle for the duration of the index recovering [01:33:14] who knows [01:33:28] <^demon|away> This is why we don't like full cluster restarts :\ [01:33:52] rdwrer, done [01:34:21] 'kay [01:34:24] well, that and the outage [01:34:47] (03PS1) 10Springle: use correct auth_name [operations/puppet] - 10https://gerrit.wikimedia.org/r/117369 [01:34:58] <^demon|away> k, running to the store to pick up a few dinner things. [01:35:02] <^demon|away> should be back before you miss me. [01:36:35] (03CR) 10Springle: [C: 032] use correct auth_name [operations/puppet] - 10https://gerrit.wikimedia.org/r/117369 (owner: 10Springle) [01:39:41] Syncing [01:39:42] !log mholmquist synchronized php-1.23wmf17/extensions/UniversalLanguageSelector/UniversalLanguageSelector.hooks.php 'Actually gate the beta feature for ULS' [01:39:49] Done [01:39:51] Logged the message, Master [01:39:53] Actually The End [01:40:11] Fin [01:40:32] I think I'm going to go for the big Asahi now. [01:48:12] ^demon|away: I'm pretty sure it doesn't bother recovering the secondaries from the disk they are on.... as soon as it got to secondaries the network spiked [01:48:30] I've temporarily raised the throttles so it is going faster [01:48:37] I wonder why it doesn't just recover from the disk though [01:48:45] probably because it can't be sure nothing has changed [01:52:20] (03PS1) 10Manybubbles: Revert "Remove cirrus jobs from priority list" [operations/puppet] - 10https://gerrit.wikimedia.org/r/117372 [01:52:45] (03CR) 10Manybubbles: [C: 04-1] "Merge when we're sure cirrus is working." [operations/puppet] - 10https://gerrit.wikimedia.org/r/117372 (owner: 10Manybubbles) [01:52:59] ^demon|away: 2009 [01:53:00] 209 [01:53:16] 180 [01:56:08] 75 [01:56:24] (03Abandoned) 10Chad: Revert "Remove cirrus jobs from priority list" [operations/puppet] - 10https://gerrit.wikimedia.org/r/117360 (owner: 10Chad) [01:58:06] (03Abandoned) 10Manybubbles: Revert "Remove cirrus jobs from priority list" [operations/puppet] - 10https://gerrit.wikimedia.org/r/117372 (owner: 10Manybubbles) [01:58:47] ... [01:58:52] you do it [01:59:57] <^d> lol. [02:00:00] <^d> we both abandoned. [02:00:04] <^d> our dupes. [02:00:16] race condition [02:00:26] you guys need zookeeper [02:00:27] (03Restored) 10Chad: Revert "Remove cirrus jobs from priority list" [operations/puppet] - 10https://gerrit.wikimedia.org/r/117360 (owner: 10Chad) [02:00:57] <^d> We need edit conflict handling. [02:02:49] ^d: ^^^^ [02:02:49] 36 [02:04:17] !log [Elasticsearch upgrade] restoring more sane recovery speed [02:04:26] Logged the message, Master [02:04:41] ^d: we both abandoned out puppet reverts [02:05:08] <^d> manybubbles: I know, so I restored mine :) [02:05:18] I see [02:05:44] seven shards left to restore [02:05:51] we have at least two copies of everything ready [02:06:09] and none of the shards we're restoring we are the primary backend for [02:06:13] I say we turn it back on [02:06:17] maybe just the search first? [02:06:55] (03CR) 10Manybubbles: [C: 031] Elasticsearch upgrade ending [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/117251 (owner: 10Chad) [02:07:05] we're probably ok just turning it all on [02:07:54] all shards are back [02:07:58] we're green again [02:08:05] so icinga should really recover... [02:08:08] RECOVERY - ElasticSearch health check on elastic1008 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1486: active_shards: 4397: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [02:08:08] RECOVERY - ElasticSearch health check on elastic1005 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1486: active_shards: 4397: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [02:08:08] RECOVERY - ElasticSearch health check on elastic1015 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1486: active_shards: 4397: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [02:08:09] (03PS4) 10Chad: Elasticsearch upgrade ending [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/117251 [02:08:12] there we go [02:08:19] RECOVERY - ElasticSearch health check on elastic1003 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1486: active_shards: 4397: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [02:08:19] RECOVERY - ElasticSearch health check on elastic1001 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1486: active_shards: 4397: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [02:08:20] RECOVERY - ElasticSearch health check on elastic1011 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1486: active_shards: 4397: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [02:08:20] RECOVERY - ElasticSearch health check on elastic1002 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1486: active_shards: 4397: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [02:08:20] RECOVERY - ElasticSearch health check on elastic1006 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1486: active_shards: 4397: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [02:08:20] RECOVERY - ElasticSearch health check on elastic1016 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1486: active_shards: 4397: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [02:08:20] RECOVERY - ElasticSearch health check on elastic1010 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1486: active_shards: 4397: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [02:08:20] RECOVERY - ElasticSearch health check on elastic1014 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1486: active_shards: 4397: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [02:08:21] RECOVERY - ElasticSearch health check on elastic1013 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1486: active_shards: 4397: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [02:08:21] RECOVERY - ElasticSearch health check on elastic1004 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1486: active_shards: 4397: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [02:08:22] RECOVERY - ElasticSearch health check on elastic1012 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1486: active_shards: 4397: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [02:08:26] I hate that thing [02:08:28] RECOVERY - ElasticSearch health check on elastic1009 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1486: active_shards: 4397: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [02:08:32] so dense [02:08:32] (03CR) 10Chad: [C: 032] Elasticsearch upgrade ending [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/117251 (owner: 10Chad) [02:08:39] (03Merged) 10jenkins-bot: Elasticsearch upgrade ending [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/117251 (owner: 10Chad) [02:09:27] !log demon synchronized wmf-config/jobqueue-eqiad.php 'Turn Cirrus jobs back on' [02:09:36] Logged the message, Master [02:09:38] jobs completing "good" [02:09:42] and no more errors then before [02:10:16] !log demon synchronized wmf-config/InitialiseSettings.php 'Turn Cirrus back on for all wikis how it was before' [02:10:30] Logged the message, Master [02:10:43] we're 200,000 low priority updates behind on commons [02:11:02] 11,000 high priority behind on enwiki [02:11:20] but we're going down [02:12:22] <^d> We'll catch up [02:12:32] yup [02:12:35] I'm not worried [02:12:43] it might take some time on the lower priority jobs [02:12:47] but that isn't a big deal [02:12:51] because they are low priority! [02:13:24] in an hour we'll be caught up on the high priority ones I think [02:14:50] !log [Elasticsearch upgrade] done. we'll take a while to catch up on jobs that piled up during the upgrade, but we'll get them in time. [02:14:58] Logged the message, Master [02:21:46] !log LocalisationUpdate completed (1.23wmf16) at 2014-03-07 02:21:46+00:00 [02:21:53] Logged the message, Master [02:23:19] (03PS2) 10coren: Labs: Disable FSC on all NFS mounts [operations/puppet] - 10https://gerrit.wikimedia.org/r/117347 [02:35:08] PROBLEM - Disk space on snapshot1001 is CRITICAL: DISK CRITICAL - free space: / 1064 MB (3% inode=69%): [02:36:58] (03PS1) 10Ori.livneh: Emit GeoIP cookie using dedicated Set-Cookie header [operations/puppet] - 10https://gerrit.wikimedia.org/r/117375 [02:37:48] (03CR) 10coren: [C: 032] "Bleh. I wish this would have worked without." [operations/puppet] - 10https://gerrit.wikimedia.org/r/117347 (owner: 10coren) [02:43:40] !log LocalisationUpdate completed (1.23wmf17) at 2014-03-07 02:43:40+00:00 [02:43:48] Logged the message, Master [02:48:10] (03PS1) 10MarkTraceur: WIP Add MMV feature flags for beta and pilot sites [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/117376 [02:48:33] (03CR) 10MarkTraceur: [C: 04-2] "Needs list of wikis before merging." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/117376 (owner: 10MarkTraceur) [03:13:59] (03CR) 10Manybubbles: [C: 031] "Now would be ok." [operations/puppet] - 10https://gerrit.wikimedia.org/r/117360 (owner: 10Chad) [03:14:12] (03CR) 10Manybubbles: "But it can wait until the morning too." [operations/puppet] - 10https://gerrit.wikimedia.org/r/117360 (owner: 10Chad) [03:24:54] (03PS1) 10Jalexander: Add logo for legalteamwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/117377 [03:31:43] !log LocalisationUpdate ResourceLoader cache refresh completed at 2014-03-07 03:31:43+00:00 [03:31:52] Logged the message, Master [03:35:36] (03CR) 10Tobias Gritschacher: [C: 031] "any objectives against this?" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/113972 (owner: 10Thiemo Mättig (WMDE)) [04:37:08] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [04:41:15] (03CR) 10Chad: [C: 031] Revert "Remove cirrus jobs from priority list" [operations/puppet] - 10https://gerrit.wikimedia.org/r/117360 (owner: 10Chad) [06:08:43] (03CR) 10Nemo bis: "AFAIK there is none, which is why you/we added the global job queue check to hume (I4b67f60a). This could be moved to eqiad when noc.wikim" [operations/puppet] - 10https://gerrit.wikimedia.org/r/117250 (owner: 10Nemo bis) [06:47:11] (03CR) 10Nemo bis: "This is a followup of the ULS change Ia268c3a49b5aa14b6a00e33c7f01a61eba48e776" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/117344 (owner: 10Jforrester) [08:57:19] PROBLEM - MySQL Replication Heartbeat on db1042 is CRITICAL: CRIT replication delay 316 seconds [08:57:28] PROBLEM - MySQL Slave Delay on db1042 is CRITICAL: CRIT replication delay 325 seconds [09:10:28] RECOVERY - MySQL Slave Delay on db1042 is OK: OK replication delay 0 seconds [09:11:20] RECOVERY - MySQL Replication Heartbeat on db1042 is OK: OK replication delay -0 seconds [12:12:31] (03CR) 10Alexandros Kosiaris: [C: 032] Removed notes_url from nagios host extra info [operations/puppet] - 10https://gerrit.wikimedia.org/r/112441 (owner: 10Alexandros Kosiaris) [13:20:54] (03CR) 10Mark Bergsma: [C: 032] Allocate /27 for public1-d-eqiad [operations/dns] - 10https://gerrit.wikimedia.org/r/114738 (owner: 10Mark Bergsma) [13:23:14] akosiaris: seems like reverse DNS for row D wasn't done yet? [13:29:38] who could review https://gerrit.wikimedia.org/r/113755 ? ("Make labs' sql command work with -v and remove cruft") [13:31:15] (03PS1) 10Mark Bergsma: Add reverse DNS for row D subnets [operations/dns] - 10https://gerrit.wikimedia.org/r/117409 [13:33:25] mark: yes. We had not even thought about it tbh [13:33:39] if you could do a quick review of that change ^ [13:33:45] i'm about to add more records for lvs [13:33:56] we should probably store that todo list for next time :) [13:33:59] as a check list [13:36:35] PROBLEM - Host mw31 is DOWN: PING CRITICAL - Packet loss = 100% [13:37:15] RECOVERY - Host mw31 is UP: PING OK - Packet loss = 0%, RTA = 35.58 ms [13:39:16] PROBLEM - Apache HTTP on mw31 is CRITICAL: Connection refused [13:39:47] (03PS1) 10coren: Tool Labs: package addition to exec environ [operations/puppet] - 10https://gerrit.wikimedia.org/r/117410 [13:41:19] (03CR) 10coren: [C: 032] Tool Labs: package addition to exec environ [operations/puppet] - 10https://gerrit.wikimedia.org/r/117410 (owner: 10coren) [13:41:25] (03CR) 10Alexandros Kosiaris: [C: 04-1] Add reverse DNS for row D subnets (031 comment) [operations/dns] - 10https://gerrit.wikimedia.org/r/117409 (owner: 10Mark Bergsma) [13:41:52] ah yep, will correct that [13:47:01] (03PS1) 10coren: Tool Labs: package tweak [operations/puppet] - 10https://gerrit.wikimedia.org/r/117411 [13:50:26] (03CR) 10coren: [C: 032] Tool Labs: package tweak [operations/puppet] - 10https://gerrit.wikimedia.org/r/117411 (owner: 10coren) [13:51:15] PROBLEM - NTP on mw31 is CRITICAL: NTP CRITICAL: Offset unknown [13:55:44] (03CR) 10Alexandros Kosiaris: [C: 04-2] "We merged the commit mentioned above, which makes this obsolete." [operations/puppet] - 10https://gerrit.wikimedia.org/r/112019 (owner: 10Matanya) [13:56:15] RECOVERY - NTP on mw31 is OK: NTP OK: Offset 0.003237247467 secs [13:59:21] (03PS2) 10Mark Bergsma: Add reverse DNS for row D subnets [operations/dns] - 10https://gerrit.wikimedia.org/r/117409 [13:59:23] (03PS1) 10Mark Bergsma: Add LVS subnet IPs for row D [operations/dns] - 10https://gerrit.wikimedia.org/r/117412 [14:04:05] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [14:04:07] (03PS3) 10Mark Bergsma: Add reverse DNS for row D subnets [operations/dns] - 10https://gerrit.wikimedia.org/r/117409 [14:04:09] (03PS2) 10Mark Bergsma: Add LVS subnet IPs for row D [operations/dns] - 10https://gerrit.wikimedia.org/r/117412 [14:05:25] (03PS1) 10Mark Bergsma: Setup eqiad LVS servers for row D real servers [operations/puppet] - 10https://gerrit.wikimedia.org/r/117413 [14:41:19] (03PS1) 10QChris: Make wikimetrics' database setup depend on alembic [operations/puppet] - 10https://gerrit.wikimedia.org/r/117416 [14:50:11] If lab instance is down, whom I can poke here? [14:50:30] kart_: try #wikimetrics-labs [14:50:33] sorry [14:50:33] ha [14:50:34] kart_: for wikimedia labs, go to #wikimedia-labs [14:50:35] i mean [14:50:36] yeah [14:50:38] what hoo said [14:50:46] i have wikimetrics on the brain :p [14:50:55] :) [14:51:30] Thank you :) [15:01:05] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [15:18:26] (03CR) 10Alexandros Kosiaris: "Seems like this will make manutius unhappy. All the /etc/ganglia/aggregators/*.conf files have their locations changed to eqiad from pmtpa" [operations/puppet] - 10https://gerrit.wikimedia.org/r/112889 (owner: 10Matanya) [15:32:05] (03PS1) 10Alexandros Kosiaris: Rename rt $site variable to $website [operations/puppet] - 10https://gerrit.wikimedia.org/r/117418 [15:34:06] (03CR) 10Alexandros Kosiaris: "Daniel, Andrew, I added you because git blame on that file named you. I am was wrong in doing that, sorry." [operations/puppet] - 10https://gerrit.wikimedia.org/r/117418 (owner: 10Alexandros Kosiaris) [15:36:04] (03CR) 10Dzahn: "hey Alex, Andrew, also see Change-Id: Ib0e7476a612b25f37905f9f475521606d3bb73d7 where I'm converting RT to module and I was already doing" [operations/puppet] - 10https://gerrit.wikimedia.org/r/117418 (owner: 10Alexandros Kosiaris) [15:37:21] mutante: damn... Ok I will abandon that then. You are already solving it better [15:38:29] akosiaris: well, unless it's urgent and the module takes too long to get in there, then yes:) [15:38:36] (03Abandoned) 10Alexandros Kosiaris: Rename rt $site variable to $website [operations/puppet] - 10https://gerrit.wikimedia.org/r/117418 (owner: 10Alexandros Kosiaris) [15:39:08] if you wanna review that one though...:) [15:40:29] you still have a WIP in the topic, which is why I have skipped it. If you want a review, a review it is :-) [15:41:38] fair, yea [15:42:10] well, it already went through initial matanya check:) [15:47:47] mutante: define ganglia_new::monitor::aggregator::instance($site) [15:47:49] argh [15:47:53] the same case here :( [15:48:24] akosiaris: and in my pending change for planet->module i used manifests/site.pp, i should change that too i suppose [15:48:32] even if it works, just for style [15:48:51] and of course matanya's 112889 change would have broken things up [15:49:05] meh.. I hate puppet scoping :( [15:49:14] mutante: gerrit change ? [15:49:27] ooh, i see, that's why you started fixing that [15:49:52] akosiaris: https://gerrit.wikimedia.org/r/#/c/108674/ [15:50:02] yeap. I am slowly uncovering various issues here and there thanks to matanya's changes [15:50:06] it's just the file name in that case, but i figured site.pp isn't a good name [15:50:15] define planet::site { [15:50:24] $sites_directory = '/etc/apache2/sites-available' 4 [15:50:37] that becomes site.pp in module structure then, per convention [15:50:52] well it is way less worse that variables [15:51:14] i don't think it is that bad [15:51:38] if you feel like renaming it, do it, but I don't think it is needed [15:52:20] git grep -E '\$(::)?site[[:space:]]*=' works wonders to avoid those issues [15:52:59] ok. cool [15:53:18] hehe that file is even worse. It actually references both $site and $::site [15:53:39] ganglia? [15:53:41] yeah [15:53:54] hmm, there was also ganglia_new , right [15:54:06] modules/ganglia_new/manifests/monitor/aggregator/instance.pp [15:54:07] what is even the status of ganglia_(notnew) then [15:54:18] oh, ok [15:54:38] but you touched a wound with ganglia{,_new} i think [15:55:10] I hope jgage is going to help solve all this. [15:58:44] i was thinking maybe ganglia_new should just become ganglia? [15:58:48] but dunno yet [15:58:56] yep:) [15:59:10] as if it is going to be that easy [15:59:19] but yeah that is the goal [16:04:12] removes the "just" part [16:08:10] "title": "decommissioned_eqiad top level scope is: pmtpa and lower level scope is: eqiad" [16:08:15] yey! [16:18:50] (03CR) 10Hashar: [C: 031] "Most probably fine :-] The fatal.log is world readable so the script will be able to process it even when run as user 'nobody'." [operations/puppet] - 10https://gerrit.wikimedia.org/r/116146 (owner: 10Cmcmahon) [16:23:38] (03CR) 10Cmcmahon: [C: 031] "would like to have Ops merge this whenever convenient" [operations/puppet] - 10https://gerrit.wikimedia.org/r/116146 (owner: 10Cmcmahon) [16:27:27] hashar: changes on modules/beta can't influence prod, right [16:32:24] mutante: usually :] [16:33:09] mutante: that really depends on the part being changes. Some classes are only used on beta so they are prod safe [16:33:31] mutante: some other are configuration tweaks to class used in both labs and prod (for example the varnish conf) [16:33:54] hashar: modules/beta/files/monitor_fatals.rb [16:34:01] yeah that one is fine [16:34:02] i would have merged that , but path conflict now..sigh [16:34:03] rebasing it [16:34:06] tried :p [16:34:35] (03CR) 10Alexandros Kosiaris: [C: 04-1] "So this is going to indeed create a problem. The initial version referenced the top scope variable called $::site. However the define usin" [operations/puppet] - 10https://gerrit.wikimedia.org/r/112889 (owner: 10Matanya) [16:36:41] (03PS5) 10Hashar: update to read the fatals log and report problem extensions [operations/puppet] - 10https://gerrit.wikimedia.org/r/116146 (owner: 10Cmcmahon) [16:38:33] mutante: you can get https://gerrit.wikimedia.org/r/116146 in :] [16:39:30] (03CR) 10Dzahn: [C: 032] update to read the fatals log and report problem extensions [operations/puppet] - 10https://gerrit.wikimedia.org/r/116146 (owner: 10Cmcmahon) [16:39:55] done [16:41:18] (03PS1) 10Hashar: beta: run fatalmonitor only on pmtpa [operations/puppet] - 10https://gerrit.wikimedia.org/r/117425 [16:41:32] mutante: and that one as well while you are at it [16:41:37] only on ptmpa or NOT on pmtpa:) [16:42:00] only on pmtpa [16:42:03] the current cluster [16:42:10] I am rebuilding beta from scratch on eqiad [16:42:18] i see the FIXME, ok [16:42:40] thx :) [16:42:41] (03CR) 10Dzahn: [C: 032] beta: run fatalmonitor only on pmtpa [operations/puppet] - 10https://gerrit.wikimedia.org/r/117425 (owner: 10Hashar) [16:45:37] (03CR) 10Alexandros Kosiaris: [C: 032] solr: remove lookupvar and replace with top scope @ var [operations/puppet] - 10https://gerrit.wikimedia.org/r/112892 (owner: 10Matanya) [17:22:59] (03CR) 10Alexandros Kosiaris: [C: 032] bots: minor lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/111181 (owner: 10Matanya) [17:33:57] <^d> ottomata: Yo, can we get https://gerrit.wikimedia.org/r/#/c/117360/ merged now? [17:34:05] <^d> Upgrade's long over, prioritizing jobs would be nice :) [17:37:33] (03CR) 10Dzahn: [C: 031] Remove pmtpa apaches from site.pp,dhcpd [operations/puppet] - 10https://gerrit.wikimedia.org/r/117206 (owner: 10Reedy) [17:38:18] (03CR) 10Dzahn: "path conflict" [operations/puppet] - 10https://gerrit.wikimedia.org/r/117206 (owner: 10Reedy) [17:41:41] oh yes [17:41:57] (03PS2) 10Chad: Revert "Remove cirrus jobs from priority list" [operations/puppet] - 10https://gerrit.wikimedia.org/r/117360 [17:42:02] (03CR) 10Ottomata: [C: 032 V: 032] Revert "Remove cirrus jobs from priority list" [operations/puppet] - 10https://gerrit.wikimedia.org/r/117360 (owner: 10Chad) [17:42:19] ^d done [17:42:27] <^d> Thanks! [17:43:56] (03PS2) 10Dzahn: Remove pmtpa apaches from site.pp,dhcpd [operations/puppet] - 10https://gerrit.wikimedia.org/r/117206 (owner: 10Reedy) [17:44:15] PROBLEM - Squid on brewster is CRITICAL: Connection timed out [17:47:05] RECOVERY - Squid on brewster is OK: TCP OK - 0.035 second response time on port 8080 [17:51:15] PROBLEM - Squid on brewster is CRITICAL: Connection timed out [17:51:31] squid on brewster? eh [17:53:05] RECOVERY - Squid on brewster is OK: TCP OK - 0.035 second response time on port 8080 [17:53:12] !log restarted squid on brewster [17:53:20] Logged the message, Master [18:00:12] so guys, I've removed all the srv* and mw* entries from pybal for the following pmtpa groups: apache api rendering I'm about to [18:00:33] save these changes, it's possible that something will whine or page that has been overlooked [18:00:49] if there's any issues, we can put the files with full content back right away [18:01:02] so... 3... 2... 1... [18:01:34] and saved, let's see what explodes [18:04:25] PROBLEM - LVS HTTP IPv4 on appservers.svc.pmtpa.wmnet is CRITICAL: Connection refused [18:04:29] heh [18:04:35] im gettin paged [18:04:35] PROBLEM - LVS HTTP IPv4 on api.svc.pmtpa.wmnet is CRITICAL: Connection refused [18:04:43] we didnt turn that off before turning off the servers? [18:04:53] the servers are on [18:04:59] (i think that wakes up some folks, just fyi ;) [18:05:04] grrrr [18:05:15] PROBLEM - LVS HTTP IPv4 on rendering.svc.pmtpa.wmnet is CRITICAL: Connection refused [18:05:22] mutante: shall I put em back? I thought they'd be out of monitoring [18:05:23] actually, i dunno what paging sean is on [18:05:33] apergos: i can also fix the icinga [18:05:36] if its just sending pages to eu folks who cares [18:05:39] hold on [18:05:42] (its not ideal but not waking them up) [18:05:46] if you can do it fast, go to town [18:05:57] i just know both mark and faidon do 24/7 paging, but i now realize if apergos is up, so are they heh [18:06:09] so the pages prolly not a big deal, disregard my comment =] [18:06:17] well I hope they don't have to run into the channel for something that's not actually broken [18:06:29] they should be able to read its api in tampa and disregard [18:06:52] good point [18:06:54] ACKNOWLEDGEMENT - LVS HTTP IPv4 on appservers.svc.pmtpa.wmnet is CRITICAL: Connection refused daniel_zahn thats ok. we are shutting down Tampa. disabling notifications [18:06:56] neither of us are on 24/7 paging [18:07:08] oh, i thought mark was [18:08:02] (03PS1) 10Odder: Add Laura Hale Sochi blog to English Planet [operations/puppet] - 10https://gerrit.wikimedia.org/r/117434 [18:08:20] fixed [18:08:27] disabled notifications for the other 2 [18:08:37] api rendering and appservers all hushed? [18:08:42] sweet [18:09:05] so yea, ariel brandon and andrew o get 24/7 [18:09:09] no one else is anymore, heh [18:09:17] (except some devs for varios things, and tim is also 24/7) [18:09:40] RobH, except I seem to only get emails for analytics related alerts…. [18:09:44] haven't looked into why [18:09:48] tim doesn't get pages [18:09:59] apergos: yes, thanks for doing the pybal part!:) [18:09:59] ottomata: Ahh, your contact group isnt right, lets take a look [18:10:10] sure! [18:12:17] (03PS1) 10RobH: adds andrew otto to ops paging group [operations/puppet] - 10https://gerrit.wikimedia.org/r/117436 [18:12:43] ottomata: ^ so that will add you to the normal ops paging group [18:13:06] changing your paging hours is done via the private repo [18:13:16] OH sms? [18:13:18] do I want SMS? [18:13:26] i don't think it has my number a anyway? [18:13:30] well, yer in the ops team, so i dont think you really ahve a choice do you? [18:13:33] we need to fix that then =] [18:13:35] hahaha [18:13:37] rats! [18:13:41] hehe [18:13:48] you can have your own custom timezone if you like [18:13:52] there are other opsens that don't get SMS too [18:14:22] but it'd be nice if ottomata could be on the paging list [18:14:25] puppet/files/icinga/timeperiods.cfg [18:14:35] ottomata: I have it open now so i can make the changes to the private repo for you [18:14:38] to add your own zone.. then use that in contacts.cfg in private repo [18:14:42] i just have to pull some info off officewiki [18:14:55] oh true, it may not have timezone you want [18:15:24] there is EDT_awake_early_hours [18:15:25] RECOVERY - Apache HTTP on mw31 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.759 second response time [18:15:33] but ..it's early [18:16:04] ottomata: Also, who is your cell provider? (I have your number, just need to know what sms gateway to use ;) [18:16:27] paravoid: huh, i think all opsen should be on the paging list for their time zone, but thats indeed outside scope of this discussion i guess [18:16:49] AT&T [18:17:02] RobH yeah I am EDT, what is early hours? [18:17:02] ha [18:17:05] EDT? [18:17:06] wait [18:17:14] EST? EDT? Whatevr [18:17:19] alias EDT 0800 till midnight, 7 days a week [18:17:20] !log Restored pre Ic56177a versions of wmf-config/*pmtpa* config files to mw31 again. Something wiped them out since 20:23Z yesterday even though "mw31" is not found in any dsh group files on tin. [18:17:29] Logged the message, Master [18:17:30] ottomata: but you can totally make another one by just copying that [18:17:59] the periods themselves are in the public repo [18:18:00] ahhh, whatever that's fine, if it is annoying i'll change it [18:18:42] Anybody know what's special about mw31 that would cause it to sync common-local outside of scap asking it to? [18:18:52] it rebooted (because it's broken) [18:18:57] so it syncs again [18:19:02] puppet does that I think [18:19:17] Ah. ok. And it probably pulls from tin when it does that [18:19:26] I don't remember the details [18:19:28] (03CR) 10RobH: [C: 032] adds andrew otto to ops paging group [operations/puppet] - 10https://gerrit.wikimedia.org/r/117436 (owner: 10RobH) [18:19:51] but something does make sure that a server that was down doesn't have an outdated copy when it gets up again [18:19:55] bd808: it should not scap to it anymore though and also be gone from pybal since ..like now [18:19:55] before apache starts [18:20:18] we can just shut it down ? [18:20:21] I had heard from Ryan that there was an eventual consistency trigger but I haven't tracked it down in puppet [18:20:27] ottomata: Ok, I've added you to the group for ops paging and set it to the only eastern time zone option available. (Its as mentioned, an hour ealier than maybe you'd like) [18:20:33] but you can add more zones if you hate [18:20:36] neon is updating now [18:20:43] so, welcome to the paging group ;] [18:21:50] um, yay. [18:21:50] :) [18:23:02] bd808: shall we just shutdown -h ? i see you on the box [18:23:35] (03PS1) 10RobH: fixing ordering and spacing for contact list [operations/puppet] - 10https://gerrit.wikimedia.org/r/117439 [18:23:39] mutante: I logged out. If it's ready to die I won't stop it [18:25:39] mutante: This same issue (missing db.php config) is going to happen to any pmtpa mw box that reboots. [18:26:08] I wonder if we should put the config files back on tin for scap/rsync until the boxes are really gone [18:26:09] (03CR) 10RobH: [C: 032] fixing ordering and spacing for contact list [operations/puppet] - 10https://gerrit.wikimedia.org/r/117439 (owner: 10RobH) [18:26:20] bd808: i'll just continue with removing them all from puppet anyways [18:26:31] now that apergos removed them all from pybal [18:26:39] * bd808 nods [18:26:42] so, it's going away real soon [18:27:12] Coolio. I'll stop worrying and get on with life then [18:28:08] PROBLEM - Disk space on snapshot1001 is CRITICAL: DISK CRITICAL - free space: / 1054 MB (3% inode=69%): [18:28:21] wait, but we removed those too [18:28:32] oh, 1001 [18:29:43] bd808: ^ old mw versions that can be removed? [18:31:07] mutante: There probably are. We have branches back to early december on tin at the moment [18:31:26] I don't know the procedure for pruning them... [18:32:00] I'll see if I can find that out [18:33:07] RECOVERY - Disk space on snapshot1001 is OK: DISK OK [18:33:47] RECOVERY - Disk space on snapshot1003 is OK: DISK OK [18:34:07] RECOVERY - Disk space on snapshot1002 is OK: DISK OK [18:40:58] bd808: also I notice that wmf13 and up have an extra gb in the cache/l10n, from the file php-1.23wmf17/cache/l10n/upstream, with the json versions, so that's almost double the size for a branch now on the scap [18:42:06] l10n cache for those isn't needed anymore [18:42:23] probably only static stuff in skins/ is still needed, the rest could probably die [18:43:07] apergos: Hmmm… yes. I think the l10n json should have been around for longer but Sam may have been cleaning up [18:43:31] they can't have been around for that much longer or the snaps would have already run out of space [18:43:42] anyways, whatever you can do woudl be great! [18:43:52] How many days does bits still need to have access to the old branches that are not being used? [18:44:10] last live date + 30? [18:44:18] bd808: Varnish caching period [18:44:24] greg-g: Hate to ask, but could we have a deploy for a VE cherry-pick? Corruption bug in production that we fixed for wmf17 but didn't realise was also present in wmf16 until it went live. :-( https://gerrit.wikimedia.org/r/#/c/117445/ [18:44:32] 30 sounds right, but I'd have to look it up (no memory) [18:46:48] yeah role/cache and the apache expires all seem to be 20 days [18:46:49] 30! [18:48:18] apergos: ok 30. I'll see if I can do some cleanup after lunch and save everyone some space [18:48:40] thanks! [18:49:51] At the very least I can prune out the l10n json for all non-live branches [18:50:01] But I think I can do much better than that [18:52:13] wasn't the l10n json supposed to be much rsync/gzip friendlier [18:52:55] James_F: ek, yeah [18:52:57] Nemo_bis: It is, yes. But bigger on disk than cdbs [18:53:20] Oh. [18:53:43] Nemo_bis: We only send json during the scap which is faster but then the nodes builld cdb and keep the json too [18:54:07] paravoid, I have a quick question re https://rt.wikimedia.org/Ticket/Display.html?id=6980: do you think it would be easier to proxy the web service rather than assigning a public IP to the rt test server? [18:54:22] So the servers end up with more than twice as much l10n data. Trading disk for sync speed. [18:55:01] gwicke: what is the web interface used for? [18:55:18] But then you can't benefit from rsync diffing between the json files :/ [18:55:36] maybe you could make them bzip2 them and bunzip2 before scap :P [18:55:47] paravoid, it reports the results of round-trip tests [18:55:56] currently at http://parsoid.wmflabs.org:8001/ (but super slow) [18:55:57] but not used for them? [18:56:11] not used to run the tests, yes [18:56:28] Nemo_bis: Or just remove them after the branch is no longer active :) [18:56:37] we check the web interface to see how it went [18:56:54] so, yes, we could put it behind misc-web-lb if you want [18:57:15] sounds easier and maybe more secure than a full public IP [18:57:47] i wasnt sure if it would work so i wasnt going to commit =] [18:58:02] (or somethign you'd be cool with us rolling into misc-web-lb) [18:58:35] should I try to prepare a puppet patch for that? [18:58:57] if you really want to, sure [18:59:02] otherwise, maybe RobH can help you? [18:59:15] I recently did the doc and integration stuffs with folks [18:59:17] greg-g: Thanks. When will be a good moment? Don't want to break Search's work. [18:59:32] paravoid, ok [18:59:47] so im familiar with it [19:00:06] gwicke: ask away when you hit it im happy to help (just glad its something i know how to help with) [19:00:22] (or yea if you get really annoyed or too busy i can hack at it) [19:00:42] RobH, let me give it a quick try (currently grepping for it) [19:00:49] will ping you for help as needed [19:01:03] paravoid, thanks! [19:01:12] no worries [19:01:56] James_F: I think they're ok? [19:02:14] greg-g: OK; will ask Roan to do it at 11:30 after his meeting then. [19:02:19] cool [19:17:14] (03PS1) 10MaxSem: Redirect Kindle search requests to mobile [operations/puppet] - 10https://gerrit.wikimedia.org/r/117449 [19:18:04] (03PS3) 10Dzahn: Remove pmtpa apaches from site.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/117206 (owner: 10Reedy) [19:21:35] (03CR) 10Dzahn: "Reedy, i split that into 2 separate things, puppet and DHCP, and just remove from puppet for now. Robh/Chris might want to use DHCP to mas" [operations/puppet] - 10https://gerrit.wikimedia.org/r/117206 (owner: 10Reedy) [19:22:54] (03CR) 10Dzahn: [C: 032] Remove pmtpa apaches from site.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/117206 (owner: 10Reedy) [19:30:04] (03PS1) 10Bene: Enable Extension:GuidedTour on testwikidatawiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/117452 [19:30:24] hoo ^ :D [19:30:39] Do you guys want that *now*? [19:31:20] (03CR) 10John F. Lewis: [C: 031] "Would be useful pending discussions about enabling on Wikidata itself." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/117452 (owner: 10Bene) [19:32:32] anyone deploying? [19:32:50] hoo, nobody should: Friday! [19:32:52] Does lists.wikimedia.org go thorugh Google, or directly to mailman, possibly via spamassassin? [19:33:03] MaxSem: I will... testwiki only change ;) [19:34:22] bd808: Are you deploying atm? [19:34:39] Would like to push a testwikidata change [19:34:56] hoo: Nope. I'm just looking around and thinking about cleanup. [19:35:07] ok, will do that quickly then ;) [19:35:16] You need flight deck clearance from greg-g [19:35:29] If he's around at this time [19:36:01] hoo: Just https://gerrit.wikimedia.org/r/#/c/117452 ? [19:36:09] bd808: Exactly [19:36:42] hoo: LGTM [19:37:42] csteipp is on tin atm... but not deploying I guess [19:38:18] hello? [19:38:34] I'm not deploying [19:38:35] greg-g: hey :) [19:38:38] what's up? [19:38:57] greg-g: Volunteeers want me to push a smallish change to testwikidatawiki [19:39:02] enable guided tours there [19:39:09] .... [19:39:27] bug report? [19:39:41] we don't have one, I guess [19:39:46] no worries, just curious [19:39:48] (03CR) 10Addshore: [C: 031] "see no problem here" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/117452 (owner: 10Bene) [19:39:59] (03PS1) 10Dzahn: remove some Tampa LVS monitoring [operations/puppet] - 10https://gerrit.wikimedia.org/r/117454 [19:40:08] ... I would love for it to wait for monday, i don't like getting ops made at me [19:40:14] s/made/mad/ [19:40:59] mh, I don't want to block our volunteers for a whole weekend, that's their most productive time, after all [19:41:38] is there a plan for pushing it out on wikidata on a specific date? [19:42:02] greg-g: No, they want to get community consensus there by demoing it from testwikidata AFAIK [19:42:36] There's a thread on the village pump on wikidata, though [19:42:50] k, well, I don't want to be a jerk, but we currently have a "no deploys on friday except in extreme circumstances" policy, and once I compromise once, everyone will start puppy dogging me [19:43:14] "puppy dogging me" meaning "asking nicely with puppy dog eyes" [19:43:20] :) [19:43:21] heh [19:43:31] Ok, we'll let this wait till monday then [19:43:47] thanks, and sorry [19:43:59] greg-g: On a related note... [19:44:03] uh oh [19:44:05] see?! [19:44:30] * hoo hides [19:44:30] apergos askd that I clean up some cruft on tin that is running the shanpshot servers out of disk [19:44:47] that's a puppet change, right? [19:45:01] oh, no, the rm of old mw versions? [19:45:09] No it would be removing old wmf branches and a full scap [19:45:14] * greg-g nods [19:45:15] ugh [19:45:17] heh [19:45:19] fi you would prefer to wait, you can wait [19:45:28] but no new scaps of new branches til an old one goes away [19:45:34] it's essentially a no-op + l10n [19:45:43] In theory [19:45:45] but, honestly [19:45:51] this is tricky [19:45:56] I understand this is a friday etc so [19:46:02] If apergos can wait I can wait [19:46:02] the caching thing has bit us in the butt multiple times since I've been here [19:46:10] * bd808 nods [19:46:11] heh, did you really mean puppet class mediawiki::selfdestroy($version) .. ensure absent? [19:46:16] :) [19:46:19] just... seriously, no new branches til old ones get tossed. [19:46:35] apergos: Next is Thursday [19:46:43] apergos: Agreed. I'll write up a plan for clean up and circulated it today [19:46:45] so it's safe to let this sit till Monday [19:47:00] We can do the cleanup on Monday/Tuesday [19:47:04] cool [19:47:25] ok, thanks! (and if the number of live branches can be kept to a good size all the time we'll not have this problem til the code size gets a loooot bigger) [19:47:27] it could be a cron that uses find to find all files older than X [19:47:57] apergos: Yeah. We need to make cleanup a weekly thing, not just when Sam thinks to do it [19:48:01] heh [19:48:21] mutante: We have a version per week, so it should be ok to count versions... plus one or two for safety (as we sometimes skip a week) [19:48:22] once we figure out X [19:48:37] *should* [19:48:38] Seven. [19:49:03] We need N..N-6 branches for cache [19:49:09] that thing with doublig the size of the branch though... ouch [19:49:17] if that happens again all bets are off! [19:49:25] resize all the LVMs [19:49:33] We can dump the l10n cruft faster [19:49:42] much faster, N-3 there [19:49:49] oohhh now that's an idea I could get behind [19:49:53] bd808: until we change somethind :/ [19:49:58] mtime might be a better metric [19:50:22] all right, well I'm going to mentally check out, it's veg time, I'll drop back in later most likely [19:50:36] greg-g: When we change the release cadence we'll have to rethink many things [19:51:07] bd808: so why increase those things by one unneedly? :P [19:51:09] but I hear ya [19:51:32] I'd just like to decouple it as much as possible/reasonable [19:52:03] mtime doesn't cut it for knowing when a delete is ok. It's bounded by "last page served" time [19:52:38] N versions ago is better for that? [19:54:03] We can actually find out from git history +/-scap time when a version was last deployed [19:54:40] * greg-g nods [19:54:47] I like that thinking better ;) [19:54:57] just make a !killmw trigger on IRC and whenever Icinga complains about disk space, you kill :) [19:55:19] N versions is a convenient proxy for thinking about that with an assumption of the cadence [19:56:11] I'll write an email [19:56:13] bd808: yeah, I just assumed mtime would have been reasonable, but if that's not easy, then this is fine [19:56:18] s/easy/right/ [19:56:50] (03CR) 10Dzahn: [C: 032] remove some Tampa LVS monitoring [operations/puppet] - 10https://gerrit.wikimedia.org/r/117454 (owner: 10Dzahn) [19:59:41] hoo: You should add your extension enable to https://wikitech.wikimedia.org/wiki/Deployments for Monday before anyone forgets [20:01:47] right [20:13:59] !log catrope updated /a/common/php-1.23wmf16 to {{Gerrit|I4a10768ec}}: Update VisualEditor to wmf16 branch for cherry-pick [20:14:07] Logged the message, Master [20:14:46] !log catrope synchronized php-1.23wmf16/extensions/VisualEditor/modules/ve-mw/dm/nodes/ve.dm.MWBlockImageNode.js 'Fix image corruption bug' [20:14:54] Logged the message, Master [20:17:50] greg-g: Deployment over, no drama. [20:22:37] bd808: fyi.. https://gerrit.wikimedia.org/r/#/c/117206/3/manifests/site.pp [20:23:13] W00t. Down with pmtpa [20:26:21] !log revoking puppet certs for Tampa appservers [20:26:29] Logged the message, Master [20:27:56] !log killing mw1-16 from puppet stored configs, icinga,.. [20:28:05] Logged the message, Master [20:38:00] hashar: around? [20:38:10] hoo: not really [20:38:37] hashar: mh... quick one... if I change a -labs file... how do I deploy it to labs? [20:38:45] just merge it in gerrit and care no more? [20:39:02] beta cluster, I mean [20:39:21] hoo: any change to operations/mediawiki-config.git has to be reviewed since it might impact prod as well [20:39:44] hashar: Sure, I guess I can find someone for that [20:39:45] hoo: once merged, a jenkins job will update the beta cluster for you. doc is at https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/How_code_is_updated [20:39:57] hoo: the job is https://integration.wikimedia.org/ci/job/beta-mediawiki-config-update/ [20:40:06] ok, and that doesn't need to be synced in production for sanity reasons or whatever? [20:40:25] hoo: and you can find the job by browsing http://integration.wikimedia.org/ then click [Dashboard] in the top menu [20:40:36] hoo: usually one should sync it on production as well :D [20:40:38] I know that job, just wasn't sure [20:40:48] and there's our friday deploy problem again [20:40:56] ;-D [20:41:22] there's a long reddit about friday deploys as well [20:41:37] mutante: :D [20:42:09] (03PS1) 10Hoo man: Enable guided tours on beta's Wikidata [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/117467 [20:43:05] (03PS2) 10Hoo man: Enable guided tours on beta's Wikidata [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/117467 [20:43:18] (03PS1) 10Bene: Enable Extension:GuidedTour on betalabs wikidatawiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/117468 [20:43:23] my current screen is like [20:43:24] Killing mw18.pmtpa.wmnet...done. [20:43:24] Killing mw19.pmtpa.wmnet...done. [20:43:25] Killing mw20.pmtpa.wmnet...done. [20:43:28] and keeps going:) [20:44:18] (03CR) 10Hoo man: [C: 04-1] "Also redundant with https://gerrit.wikimedia.org/r/117467" (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/117468 (owner: 10Bene) [20:44:39] hashar: Could you have a look, please? :P I can sync it on monday... [20:44:49] hoo: he's not here:) [20:44:51] or pull it onto tin today and the sync it on Monday [20:45:03] (03Abandoned) 10Bene: Enable Extension:GuidedTour on betalabs wikidatawiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/117468 (owner: 10Bene) [20:45:53] icinga puppet restart fails but it works anyways [20:46:16] meh, or not, looking [20:47:46] ok, so... that's not going to happen until Monday either? [20:49:44] greg-g: ^ any opinion on that? [20:50:35] hoo: beta cluster? that can happen at any time [20:50:55] greg-g: Ok, it's ok if I just pull it onto tin? [20:51:18] no... [20:51:25] leave it? [20:51:40] oh, right, lemme actually look [20:52:18] ok, yeah, you can pull that on tin, it's a no-op there [20:52:21] https://gerrit.wikimedia.org/r/#/c/117467/ [20:52:23] greg-g: hashar> hoo: usually one should sync it on production as well [20:52:34] that [20:52:36] yeah [20:52:54] greg-g: If it's ok, I can sync it out, really a no-op [20:54:43] ottomata: asking you as RT duty person ;) : how do we figure out what's supposed to be the eqiad maintenance/whatever host with a MediaWiki install? I need it for https://gerrit.wikimedia.org/r/#/c/117250/ [20:54:49] hoo: yeah [20:55:04] (and for other things that will have to be migrated fenari and hume) [20:55:36] (03PS1) 10CSteipp: Use Type 'B' Passwords on CentralAuth Wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/117469 [20:56:00] greg-g: ah, thanks :) Will just wait for someone to +1 or +2 the change, so that I have enough confidence [20:56:28] * greg-g nods [20:59:54] (03CR) 10Greg Grossmeier: Enable guided tours on beta's Wikidata (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/117467 (owner: 10Hoo man) [21:00:44] why does everyone think this? [21:01:18] $wgConf->settings[$key] = array_merge( $wgConf->settings[$key], $value ); [21:02:00] greg-g: Am I totally wrong on this? AFAIS those are just overrides, so the default from the production file stays in place? [21:04:28] (03CR) 10Hoo man: "re" (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/117467 (owner: 10Hoo man) [21:04:30] hoo: The InitialiseSettings-labs.php file gets loaded instead if InitialiseSettings.php [21:04:38] s/if/of/ [21:04:40] bd808: since then? [21:04:42] wtf [21:04:54] That's the way multiversion works [21:05:17] It used to be different [21:05:33] and there's still stuff in the production one around to load the -labs [21:05:49] line 12971 and following in the production one [21:05:59] hoo: bd808 my comment was an honest question, not a socratic method thing :) but I trust bd808 [21:06:23] RobH, are there any rules on the host name to use? [21:06:25] but, honestly, even if it sources from production, it's nice to be explicit [21:06:40] gwicke: uh, as in what, dont rename them. [21:06:40] I'm considering parsoid.tests.wikimedia.org or parsoid.qa.wikimedia.org [21:06:51] oh, you mean cosmetic urls [21:06:54] eh, domain [21:06:55] not actual system hostnames [21:07:00] yeah ;) [21:07:06] greg-g: well, ok... [21:07:10] but I still want this to be clear [21:07:14] so you cannot do sub.sub.wikimedia.org on misc-web-lb [21:07:25] it doesnt have the certificates to support it [21:07:30] it can only do *.wikimedia.org [21:07:34] ah, ok [21:07:45] so parsoid-tests.wikimedia.org [21:07:52] sounds the best to me yep [21:07:53] hoo: You're right. Sorry. [21:08:19] hoo: I thought it used the realmSpecificFile stuff but it doesn't [21:08:21] even our primary varnish cannot handle sub.wildard.wikimedia.org [21:08:22] RobH, do you know off-hand in which file the DNS for that lives? [21:08:24] no problem [21:08:35] bd808: can you +1, then maybe? [21:08:38] yep, so there is the dns git project operations/dns [21:08:48] and then you'd put this into templates/wikimedia.org [21:09:01] ah, ok [21:09:12] basically copy doc.wikimedia.org [21:09:15] and you'll be good =] [21:09:24] (for dns) [21:09:35] also, thanks for taking the time to learn and do this stuff [21:09:47] np [21:09:49] cuz you could get away with just dumping this stuff in an RT tikcet and waiting on some opsen [21:09:56] I appreciate it =] [21:10:18] re the backend in varnish, do I need to configure that separately or can I just drop in ruthenium in there? [21:11:08] (03CR) 10Addshore: [C: 031] Enable guided tours on beta's Wikidata (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/117467 (owner: 10Hoo man) [21:12:25] So for varnish, you have to modify a couple of things [21:12:29] lemme find them [21:12:59] in operations puppet repo its manifests/role/cache.pp [21:13:21] (03CR) 10BryanDavis: [C: 032] Enable guided tours on beta's Wikidata (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/117467 (owner: 10Hoo man) [21:13:27] have to add the backend host there, as well as someplace else... lemme find the config file [21:13:33] (03Merged) 10jenkins-bot: Enable guided tours on beta's Wikidata [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/117467 (owner: 10Hoo man) [21:13:47] bd808: Thanks, will sync it [21:14:20] thanks all [21:14:55] (03CR) 10Addshore: Enable guided tours on beta's Wikidata (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/117467 (owner: 10Hoo man) [21:14:58] (03PS1) 10MaxSem: Always output Content-Length on mobile redirect [operations/puppet] - 10https://gerrit.wikimedia.org/r/117471 [21:15:04] gwicke: So for varnish backend configuration, you have to add your host to both templates/varnish/misc.inc.vcl.erb and manifests/role/cache.pp [21:15:12] PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 3917 MB (10% inode=95%): /srv 540561 MB (37% inode=99%): [21:15:42] RobH: k, got that [21:15:43] !log hoo synchronized wmf-config/InitialiseSettings-labs.php 'Syncing beta-only change for consistency' [21:15:46] anyone want an easy l10nupdate review? Needs +2 in puppet: https://gerrit.wikimedia.org/r/#/c/116718/ [21:15:51] Logged the message, Master [21:15:52] gwicke: if i recall correctly, those are the two spots to change for that, then the dns and you should be set. [21:16:04] feel free to add me as reviewer if you'd like [21:16:34] we had a lot more issues iwth integration/doc migration than that, but i think its due to odd caching headers sent by one of those [21:16:43] greg-g: Thanks for all your help and patience! :) [21:16:48] also bd808 [21:17:24] yw hoo [21:18:40] hoo: yw/ty [21:19:43] (03PS1) 10GWicke: Add parsoid-tests.wikimedia.org proxy forward to misc [operations/puppet] - 10https://gerrit.wikimedia.org/r/117472 [21:20:12] PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 3917 MB (10% inode=95%): /srv 540561 MB (37% inode=99%): [21:20:36] RobH, please add anybody else who should look at this [21:20:52] doing the DNS next [21:22:40] gwicke: well, im confident enough with the change to merge it for you without other folks reviewing. this is basically what we've been doing with misc-web-lb since it started [21:22:50] since paravoid already agreed that it was ok for it to exist there [21:22:58] im comfortable with the implementation [21:23:32] once zuul does its thing. [21:24:11] that ps looks good to me btw. [21:24:31] (03CR) 10RobH: [C: 032] Add parsoid-tests.wikimedia.org proxy forward to misc [operations/puppet] - 10https://gerrit.wikimedia.org/r/117472 (owner: 10GWicke) [21:25:12] PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 3916 MB (10% inode=95%): /srv 540561 MB (37% inode=99%): [21:25:36] RobH, ok; thanks! [21:26:20] its merged [21:26:34] on palladium, so cp1043 is running puppet update now [21:26:45] will kick cp1044 after im sure it didnt break cp1043 [21:26:49] (paranoia!) [21:27:43] there is nothing listening on that port yet btw [21:27:52] (on ruthenium) [21:27:58] yea, thats fine [21:28:06] i just wanna make sure the cp servers like the config, they should. [21:28:34] and it was fine [21:29:01] misc-web-lb varnish systems dont care if the backends they point to are online [21:29:16] i just am paranoid about varnish config changes and babysitting them. [21:29:58] (03PS1) 10Hashar: contint: deny access of node history via RSS [operations/puppet] - 10https://gerrit.wikimedia.org/r/117475 [21:30:12] PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 3916 MB (10% inode=95%): /srv 540561 MB (37% inode=99%): [21:30:47] (03PS1) 10GWicke: Add parsoid-tests to wikimedia.org zone [operations/dns] - 10https://gerrit.wikimedia.org/r/117476 [21:31:10] can someone please merge https://gerrit.wikimedia.org/r/117475 ? That tweak some apache deny rule for the Jenkins server. [21:32:44] (03CR) 10Dzahn: [C: 032] "looks reasonable, per hashar" [operations/puppet] - 10https://gerrit.wikimedia.org/r/117475 (owner: 10Hashar) [21:32:48] thanks [21:33:08] eh, needs verified [21:33:23] mutante: https://gerrit.wikimedia.org/r/#/c/116718/ :) :) [21:35:06] PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 3915 MB (10% inode=95%): /srv 540561 MB (37% inode=99%): [21:35:47] is it just me, or is disk space an issue lately? :) [21:37:51] (03PS3) 10Greg Grossmeier: Log length of l10nupdate to SAL and Graphite [operations/puppet] - 10https://gerrit.wikimedia.org/r/116718 [21:38:27] guys, no jenkins, needs rebase, Friday afternoon, "didnt test" comments, what's up ?:) [21:38:50] mutante: :P there's only one place to test production code [21:39:30] i'll do it if jenkins verifies [21:39:37] i dont feel like even overriding that [21:39:50] k [21:41:02] !log disabling puppet agent on all tampa appservers [21:41:10] Logged the message, Master [21:41:58] (03PS2) 10Hashar: contint: deny access of node history via RSS [operations/puppet] - 10https://gerrit.wikimedia.org/r/117475 [21:45:18] !log killing all tampa appservers from puppetstoredconfigs [21:45:27] Logged the message, Master [21:48:39] please see #wikimedia-tech - another instance of private IPs editing articles [21:49:27] Krenair: mind reporting and cc'ing people you think might know? [21:49:28] !log restarting Jenkins it is broken [21:49:32] I know this is a recurring problem... [21:49:36] Logged the message, Master [21:50:03] RobH, https://gerrit.wikimedia.org/r/#/c/117476/ [21:50:20] actually reading through the times, it seems this occured in the early hours of thursday morning (utc), probably no longer relevant... [21:50:31] (03CR) 10RobH: [C: 032] Add parsoid-tests to wikimedia.org zone [operations/dns] - 10https://gerrit.wikimedia.org/r/117476 (owner: 10GWicke) [21:50:49] RobH, thx! [21:50:59] quite welcome, its going live on dns now [21:51:02] (03CR) 10DamianZaremba: "This change has broken ircecho on labs bz #62407." [operations/puppet] - 10https://gerrit.wikimedia.org/r/104504 (owner: 10Matanya) [21:51:10] gwicke: its live [21:51:20] you should be all set now for everything outside your local server setup [21:51:30] (afaik for varnish-web-lb that is) [21:52:56] yup, I'll pick one of the cassandra servers as a worker then [21:54:06] !log Jenkins restarted [21:54:14] Logged the message, Master [21:55:31] from #-tech: [22:46] https://en.wikipedia.org/wiki/Special:Contributions/10.4.1.65 [21:55:42] [22:46] how is that possible? [21:56:03] are/were some internal proxies messed up? [21:56:05] MatmaRex, some host not configured as a proxy/whatever [21:56:12] Labs/tools are 10.x.x.x ips [21:56:36] (03CR) 10Dzahn: [C: 032] contint: deny access of node history via RSS [operations/puppet] - 10https://gerrit.wikimedia.org/r/117475 (owner: 10Hashar) [21:57:12] RobH, I think configuring ruthenium.wikimedia.org did not work as that does not seem to be in DNS [21:58:14] bahhhhhhhhh [21:58:21] yea should have been eqiad.wmnet [21:58:23] ruthenium.eqiad.wmnet [21:58:26] *nod* [21:58:27] i missed it [21:58:39] you fixing or shall i? [21:58:45] Damianz, all internal IPs are 10.*.*.* in our infrastructure [21:59:02] RobH, fixing it [21:59:10] cool, i'll merge when ya finish [21:59:10] MaxSem: True... but not much else edits from there [21:59:29] (03PS1) 10GWicke: Use ruthenium.eqiad.wmnet, .wikimedia.org won't work [operations/puppet] - 10https://gerrit.wikimedia.org/r/117602 [21:59:53] well, edits from labs wouldn't be coming over internal network, for obvious reasons it's impossible [22:00:06] yea, that'll work a bit better. [22:00:08] (03CR) 10Dzahn: [C: 032] Log length of l10nupdate to SAL and Graphite [operations/puppet] - 10https://gerrit.wikimedia.org/r/116718 (owner: 10Greg Grossmeier) [22:00:14] ;) [22:00:26] (03CR) 10RobH: [C: 032] Use ruthenium.eqiad.wmnet, .wikimedia.org won't work [operations/puppet] - 10https://gerrit.wikimedia.org/r/117602 (owner: 10GWicke) [22:00:37] greg-g: ^ [22:00:51] dear opsen, WTF is that 10.4.1.65? it's not in DNS [22:01:24] MaxSem: It would be a node in ptmpa labs I think [22:01:24] I need to know whether it should be added to MW config or there's some other problem that needs fixing [22:01:31] ewwww [22:01:34] mutante: weeee [22:01:38] mutante: thank you kindly [22:01:42] routing fail? [22:02:14] MaxSem: pmtpa labs is 10.4.0.0/24 at least internal to labs [22:02:29] 65.1.4.10.in-addr.arpa domain name pointer tools-exec-02.pmtpa.wmflabs. [22:03:06] Those edits were at the same time that varnish frontends were mangling the cookies [22:03:08] ; 10.4.0.0/21 - guest VMs subnet [22:03:08] ; 10.4.16.0/24 - VM host compute nodes subnet [22:03:59] hurry up zuul [22:04:03] greg-g: np [22:04:53] Coren, ^^^^ [22:04:56] MaxSem: must have been a bot on labs instance (that was already moved) [22:04:59] i think [22:05:14] that is about explaining edits from that IP right [22:05:21] * Coren reads backscroll [22:05:44] It was probably logged out due to https://bugzilla.wikimedia.org/show_bug.cgi?id=62288 [22:05:46] that IP shows up in MW history [22:06:12] Ah, yes. Bad bot that keeps on editing when logged off. Point me at an edit so I can kill it? [22:06:24] Coren: it's over, it seems [22:06:29] I think bd808 is being ignored :) [22:06:42] Coren, the problem is not in logging of but that it's editing from an internal IP [22:06:49] * bd808 is used to that [22:06:58] Coren: https://en.wikipedia.org/wiki/Special:Contributions/10.4.1.65 [22:07:10] MaxSem: Bots aren't allowed to edit while logged off, at least per enwiki polict; they should be using &assert=bot [22:07:28] MaxSem: It doesn't have an external ip does it? Do we route labs access to prod out via SNAT and back in again? [22:07:52] maxsem@bastion1:~$ ping mw1001.eqiad.wmnet [22:07:52] PING mw1001.eqiad.wmnet (10.64.0.31) 56(84) bytes of data. [22:07:52] From ae0-105.cr1-sdtpa.wikimedia.org (10.4.16.252) icmp_seq=1 Packet filtered [22:07:56] bd808: We don't. Bots normally edit from internal IPs. [22:08:05] * bd808 nods [22:08:11] Well, bots in labs anyways. [22:08:21] eh, so there's no bug? [22:08:38] MaxSem: https://bugzilla.wikimedia.org/show_bug.cgi?id=62288 [22:08:41] Fixed now [22:08:53] MaxSem: right, not anymore [22:08:59] as long as we never reuse that IP for a different bot? [22:09:08] I mean, that bots edit from internal IPs [22:09:16] because it's a deleted instance now, right [22:09:33] I'll just anon-block that range [22:09:42] MaxSem: That works. [22:09:49] Hm. Wait. [22:10:03] Couldn't that cause trouble with XFF blocks? [22:10:43] Coren, as long as we block labs range only [22:10:51] ...should be ok [22:13:06] https://en.wikipedia.org/w/index.php?title=Special:Log&type=block&user=&page=User:10.4.1.0%2F24 [22:13:56] !log deleting salt keys for all Tampa app servers removed from puppet [22:14:04] Logged the message, Master [22:15:06] RECOVERY - check_disk on lutetium is OK: DISK OK - free space: / 23658 MB (66% inode=95%): /srv 540714 MB (37% inode=99%): [22:18:09] Jenkins is happy so I will happily rush to bed *wave* [22:18:17] night hashar [22:23:28] RobH, https://gerrit.wikimedia.org/r/#/c/117602/ might need to be touched to convince Jenkins [22:23:29] greg-g: Hmm, your roadmap didn't note the roll-out of the new ULS Beta Feature (it'll go with the train). [22:24:01] yea i was waiting on it then got distracted, merging now [22:24:43] merged, pushing to cp1043 [22:26:39] k, ready to test [22:26:59] success [22:27:05] thanks! [22:29:43] heh, you hit cp1043 i bet [22:29:47] im rolling on 1044 now [22:29:57] (but puppet may have auto ran in those minutes, its possible) [22:30:13] misc-web-lb is just load balacned between those two varnish servers [22:30:14] James_F: probably because it wasn't on the calendar at all... [22:30:27] greg-g: Yeah, sorry, I should have thought to add it after the fact. [22:30:32] no worries [22:30:32] gwicke: glad it worked, its live on cp1044 now so you should be all set =] [22:30:32] * James_F grumbles. [22:30:37] the BF stuff is hard to track [22:30:45] … [22:30:51] * James_F avoids swearing. Just. [22:30:58] :) [22:30:59] Let's go with "yes". [22:31:12] James_F: I'm open to suggestions if you have any. [22:32:29] greg-g: We could try to re-engineer it to a whitelist rather than a hook, but… [22:32:43] greg-g: rdwrer might have some thoughts. [22:33:04] rdwrer: what say you, good sir? [22:33:09] (03CR) 10Ori.livneh: [C: 032] Adding submodule update support to git::clone [operations/puppet] - 10https://gerrit.wikimedia.org/r/117229 (owner: 10Ottomata) [22:35:32] (03PS2) 10Ori.livneh: Emit GeoIP cookie using dedicated Set-Cookie header [operations/puppet] - 10https://gerrit.wikimedia.org/r/117375 [22:36:47] (03CR) 10BryanDavis: "Added a post-merge comment about the possible need for `--recursive` to deal with sub-submodules." (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/117229 (owner: 10Ottomata) [22:38:26] (03PS1) 10Ori.livneh: Re-enable GeoIP Set-Cookie on Labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/117612 [22:39:51] * Damianz gets twitchy [22:45:50] greg-g: hey! Added myself https://wikitech.wikimedia.org/wiki/Deployments#Week_of_March_10th (very first row) [22:45:54] I hope that's fine [22:49:02] James_F, greg-g, what are we talking about? [22:51:33] rdwrer: Some way of controlling what BFs just magically appear in the train due to someone, somewhere merging a BF hook in a deployed extension. [22:52:52] James_F: Oh, totally. Whitelist would be relatively easy to do based on ID [22:53:35] rdwrer: James_F so then it'd be easy to plan/annouce them instead of them just appearing (is my main concern, I don't care tech wise ;) ) [22:54:05] rdwrer: So in the BF hook handler we just have a manually-updated whitelist, you're thinking? [22:54:29] rdwrer: Feels like that mixes code (for anyone) with config (for WMF) though. :-( [22:54:35] James_F: We'd have a whitelist configured in wmf-config, I'd imagine [22:55:03] And we'd probably want to default to pass-through, given that that's how people expect MW extensions to work. [22:57:09] (11:45:50 PM) hoo: greg-g: hey! Added myself https://wikitech.wikimedia.org/wiki/Deployments#Week_of_March_10th (very first row) [22:57:46] hoo: should be fine [22:57:57] there's enough opsen awake then, and hashar is too [22:58:18] ok :) Thanks again [22:58:29] James_F: hehe, wrong week, I'll fix [22:59:06] James_F: wait, no, misread [22:59:16] James_F: but I'll include it in next weeks mw deploy notes [22:59:22] on the cal [22:59:29] greg-g: Kk. [22:59:31] up to you on sending out to wikitech/ambassadors [22:59:44] greg-g: I told Pau to do so, and he said he would. [22:59:53] greg-g: It's in Tech/News, which is a good start. [23:00:20] cool [23:03:45] all tampa appservers out of icinga, cya later [23:03:54] nice [23:14:41] recentchanges is broken on beta [23:14:48] PHP fatal error in /data/project/apache/common-local/php-master/extensions/Flow/includes/Formatter/RecentChanges.php line 85: [23:14:48] Call to a member function getId() on a non-object [23:15:11] jackmcbarn: join me in #wikimedia-corefeatures and say the same thing [23:27:13] (03PS1) 10BryanDavis: Order branch directories as version numbers [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/117618 [23:36:30] (03CR) 10Ori.livneh: [C: 032] Order branch directories as version numbers [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/117618 (owner: 10BryanDavis) [23:36:37] (03Merged) 10jenkins-bot: Order branch directories as version numbers [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/117618 (owner: 10BryanDavis) [23:36:48] !log ori updated /a/common to {{Gerrit|If2530281b}}: Order branch directories as version numbers [23:36:56] Logged the message, Master