[00:00:19] hoo: Psh. :-) [00:00:24] (03CR) 10Chad: [C: 032] Elasticsearch upgrade starting [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/117095 (owner: 10Manybubbles) [00:00:31] <^d> In that case, lez go. [00:00:31] Fine [00:00:32] thanks ^d [00:00:38] (03Merged) 10jenkins-bot: Elasticsearch upgrade starting [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/117095 (owner: 10Manybubbles) [00:00:50] I have the conch [00:02:32] !log Starting Elasticsearch upgrade [00:02:41] Logged the message, Master [00:02:41] !log manybubbles synchronized wmf-config/jobqueue-eqiad.php 'Pausing Cirrus jobs for the duration of the upgrade.' [00:02:49] Logged the message, Master [00:03:07] !log manybubbles synchronized wmf-config/InitialiseSettings.php 'Turn Cirrus of for the duration of the upgrade' [00:03:15] Logged the message, Master [00:04:05] greg-g: I'm done syncing for now [00:04:14] I'm not sure if I want to do both submodule updates at once...probably faster that way though [00:04:58] ^d: cirrus jobs are still running. I imagine the puppet change hasn't hit all the job runners yet [00:05:01] manybubbles: cool [00:05:16] rdwrer: you ready? or should I pass the conch to jgonera / kaldari? [00:05:37] <^d> manybubbles: puppet change just removed them from prioritized queue. [00:05:48] Give it to jgonera [00:05:49] <^d> It was the wmf-config change that added them to excluded from default. [00:05:50] and that config change removed them from the default queue [00:06:00] so they should stop [00:06:56] jgonera: you ready? [00:07:02] maybe they are still running because the job runners have to finish their loops? [00:07:11] <^d> manybubbles: Could be. [00:07:12] greg-g, kaldari will start and I will continue [00:07:25] I'd like to go before you guys scap though [00:07:37] Or I can scap I guess [00:07:39] you are scapping again? give me a sec [00:07:58] jgonera: can you have kaldari join this channel please? [00:08:11] yep [00:08:12] about to remove snapshot hosts from dsh [00:08:33] if i could load the change .. [00:08:34] scap's log output looks a little funky right now [00:08:50] It's in a state of flux [00:08:59] So don't panic! [00:09:21] Also you will get failures from shapshot[1234] [00:09:23] (03CR) 10Dzahn: [C: 032] Re-remove snapshot[1234] from mediawiki-installation dsh group [operations/puppet] - 10https://gerrit.wikimedia.org/r/117326 (owner: 10BryanDavis) [00:09:33] Or maybe not :) [00:09:55] ^d: confirmed that cirrus is not available as a betafeature on enwiki [00:09:58] bd808: there you go, i saw it, i saw scap.. [00:10:11] mutante: My hero [00:10:20] <^d> manybubbles: Yeah, we've failed back. [00:10:30] <^d> You can see the drop off in traffic on Elastic. [00:10:30] that part worked [00:10:40] holy mother of god [00:10:44] you can [00:10:48] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=es_query_time&s=by+name&c=Elasticsearch+cluster+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=2&z=small&hc=4 [00:11:11] so the load just generated by the jobs is ~5% which is cool to know [00:11:57] OK I have three commits up and ready to go [00:12:02] Whenever [00:13:20] manybubbles, rdwrer: is it OK for me to sync things or are you guys still deploying? [00:13:29] you may go any time [00:13:33] I have given up the conch [00:13:46] kaldari: I haven't started, nobody's said go [00:14:05] ^d: I'm thinking we wait 20 minutes after the sync before we start going "why hasn't it stopped yet?" [00:14:08] <^d> manybubbles: So, how to shut these jobs off. [00:14:14] <^d> Heh [00:14:16] well [00:14:29] the loop has times in it like: lpmaxdelay=600 [00:15:31] if you can shell into the job runners and take a look then be my guest. [00:15:54] if they have outdated versions of /usr/local/bin/jobs-loop.sh then we'll be here a while [00:16:01] basically if that lists cirrus [00:16:54] be back in a minute [00:18:06] kaldari: Are you going now? [00:18:11] <^d> manybubbles|away: Worst case: we could set $wgDisableSearchUpdate to true. It'll turn all our jobs into no-ops. [00:18:12] yep [00:18:15] 'k [00:18:30] ^d: yeah. we can recover from that by reindexing a wide swath of things [00:19:19] I don't see the jobs slacking.... [00:19:22] <^d> mw1002 (chosen at random) has the up to date version of jobs-loop [00:19:38] Reedy: around by chance? (not that you should at midnight) [00:19:44] I am [00:19:55] jamesofur: You don't know Reedy very well do you [00:20:02] heh [00:20:12] rdwrer: I do ;) I just feel the need to put a caveat ;) [00:20:13] * James_F grins. [00:20:50] Reedy: I'm trying to figure out what's up with legalteamwiki, it seems like everything was done but so far the dns seems to be going elsewhere. https://legalteam.wikimedia.org/ (for me) goes to a wikimedia.org type site and https://legalteam.wikimedia.org/wiki redirects to foundation wiki [00:21:03] is it just apache's catching up [00:21:05] (03CR) 10Dzahn: "we need to find a replacement host for this either or, we are really trying to get away from fenari instead of adding new things. so the q" [operations/puppet] - 10https://gerrit.wikimedia.org/r/117250 (owner: 10Nemo bis) [00:21:06] ? [00:21:33] Nope... [00:21:43] We were having a couple of issues earlier, but they should've been fixed [00:21:59] ^d: https://gist.github.com/nik9000/8514c609412bcbe745a3 [00:22:03] got it, I think [00:22:35] that list doesn't have cirrusSearchLinksUpdate which is the only one I see still running [00:23:10] springle: is it just me or do we have unnecessarily unique indexes on the _links tables? [00:23:13] like something is stripping it [00:23:15] <^d> Also doesn't have delete. [00:23:32] what happened! [00:23:40] CREATE UNIQUE INDEX /*i*/il_from ON /*_*/imagelinks (il_from,il_to); [00:23:41] CREATE UNIQUE INDEX /*i*/il_to ON /*_*/imagelinks (il_to,il_from); [00:24:15] * AaronSchulz prefers the first being unique (random preference) [00:25:25] ^d: maybe replace += with array_merge? [00:26:54] The + operator returns the right-hand array appended to the left-hand array; for keys that exist in both arrays, the elements from the left-hand array will be used, and the matching elements from the right-hand array will be ignored. [00:26:57] <^d> Yeah. [00:27:48] <^d> Fixing now. [00:28:11] (03PS1) 10Manybubbles: Shut off cirrus jobs for real this time [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/117353 [00:28:34] greg-g: gonna want the conch back when we can get it [00:28:46] Apparently we're going over time. [00:28:49] ^d: the other += is probably wrong too [00:28:53] AaronSchulz: yeah, don't need both strictly speaking [00:28:56] <^d> Yeah [00:28:59] <^d> I'm fixing it all now. [00:29:07] jgonera: kaldari how ya'll doing? [00:29:09] springle: I should check if we have those on prod [00:29:15] rdwrer: technically I have all night..... [00:29:18] AaronSchulz: still need the reverse as normal key though [00:29:28] yeah, don't much worry about manybubbles's stuff [00:29:41] mine can wait for you though [00:29:41] right [00:30:04] (03PS1) 10Chad: Don't use += with $wgJobTypesExcludedFromDefaultQueue [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/117354 [00:30:09] greg-g, fine, if I don't run out of disk space updating all the submodules [00:30:22] (03Abandoned) 10Manybubbles: Shut off cirrus jobs for real this time [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/117353 (owner: 10Manybubbles) [00:30:24] Hahahah [00:30:26] (03CR) 10Manybubbles: [C: 031] Don't use += with $wgJobTypesExcludedFromDefaultQueue [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/117354 (owner: 10Chad) [00:30:45] rdwrer: how prep'd are you? [00:30:47] we can merge that when we have the conch [00:30:54] | this | prepped, greg-g [00:31:06] (ready to pull code onto tin and sync) [00:31:22] Well, merge things first, then that [00:31:29] * greg-g nods [00:32:55] Jamesofur: seems it's a few rogue apaches [00:33:21] !log graceful mw1040 [00:33:26] that's good, do they generally just clean out? I know I've seen similar things for new wikis before [00:33:30] Logged the message, Master [00:33:35] that was one of them [00:33:39] ^d: gonna get what I can moving [00:33:42] AaronSchulz: although, thinking about it, we should benchmark the second unique index vs non-unique. the optimizer can use the uniqueness for choosing a query plan, but whether it's of more value that the overhead of maintaining uniqueness... [00:33:54] jamesofur: for some reason a random server did not get restarted [00:33:56] * springle needs more coffee [00:33:59] !log [Elasticsearch upgrade] Running puppet everywhere to make sure we have the newest config [00:34:06] Logged the message, Master [00:34:08] jamesofur: after normal graceful that also redirects [00:34:45] springle: what time is it on your side of the world? [00:35:00] mid-morning [00:35:08] ah, perfect time for coffee [00:35:15] but coffee is an all-day requirement ;) [00:35:17] heh [00:35:26] It's always 06:15 somewhere [00:35:26] I could use some, now that I think about it [00:35:44] * manybubbles goes to start brewing coffee [00:36:04] (03CR) 10Chad: [C: 032] Don't use += with $wgJobTypesExcludedFromDefaultQueue [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/117354 (owner: 10Chad) [00:36:08] mutante: redirects to foundation wiki? [00:36:45] (03Merged) 10jenkins-bot: Don't use += with $wgJobTypesExcludedFromDefaultQueue [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/117354 (owner: 10Chad) [00:36:46] jamesofur: no, to legalteam. i said redirect because i see it as a 301 to the https version [00:36:49] 301 Moved Permanently https://legalteam.wikimedia.org/wiki/ [00:37:09] springle: I doubt it would help considering that uniqueness means poor change buffer usage and all the write queries are either inserts or DELETE WHERE *_from = X [00:37:15] ahh, yeah.. but so far I've always going from there to https://wikimediafoundation.org/wiki/Main_Page (3 different servers so far) [00:37:36] !log demon synchronized wmf-config/jobqueue-eqiad.php 'Fixing $wgJobTypesExcludedFromDefaultQueue config' [00:37:44] Logged the message, Master [00:38:41] <^d> manybubbles: Slowing down now. [00:38:55] <^d> And stopped. [00:39:40] indeed [00:40:01] cool [00:40:16] we look dormant [00:40:24] I'm still getting one last puppet run in on the machine [00:40:26] machines [00:40:56] !log gracefull'ing rogue apaches, mw1070,mw1089,mw1104,mw1111 [00:40:57] <^d> I think the only person who noticed we failed back was Dan because he was still playing with that delete bug. [00:40:59] <^d> :) [00:41:03] ha [00:41:04] Logged the message, Master [00:41:14] I'll be back in a moment. one kid is sick [00:41:34] (03PS1) 10Cmjohnson: Adding dns entries for db1061-63 [operations/dns] - 10https://gerrit.wikimedia.org/r/117358 [00:42:49] <^d> Heh, even with all those wikis going back to lsearchd lsearchd is still bored. [00:43:40] !log gracefull'ing rogue apaches, mw1131,mw1189,mw1190,mw1215 [00:43:47] Logged the message, Master [00:43:52] kaldari, jgonera: Are you guys walking the updates to the datacenter over there? :P [00:43:58] AaronSchulz: probably correct, but difficult to predict for all traffic patterns. technically insert buffer can slow down reads on non-unique indexes if many unmerged records, but of course you'll say reads are easier to scale than writes :) [00:44:06] rdwrer, almost there [00:44:19] back [00:44:26] Reedy: jamesofur ^ that is more of that [00:44:28] !log [Elasticsearch upgrade] Elasticsearch is now quiescent [00:44:28] heh [00:44:36] Logged the message, Master [00:44:41] thanks much mutante [00:44:46] something must have gone wrong with apache-graceful-all [00:44:50] when that was added [00:45:17] mutante: as I find other servers do you want them? :) [00:45:41] jamesofur: we got the list with apache-fast-test pybal option that goes through all [00:45:43] rdwrer, we had to update both wmf16 and wmf17 and jenkins is taking it's time merging [00:45:52] jamesofur: so _should_ be all now, Reedy checked [00:45:53] wmf16 is done, waiting for 17 [00:46:14] Oh, good, I have that to look forward to then [00:46:44] * greg-g ponders a deployment-only jenkins [00:47:34] !log [Elasticsearch upgrade] Disabling puppet so it doesn't restart Elasticsearch while we're upgrading it [00:47:37] Smart plan [00:47:42] Logged the message, Master [00:47:58] (03PS3) 10Chad: Elasticsearch upgrade ending [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/117251 [00:47:58] mutante: I just had it on mw1078 mw1066 and mw1173 (if the comment is to be believed) haven't yet gotten to the wiki :( [00:48:08] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (213572) [00:48:11] rdwrer: there's already some thoughts on a 'private' gerrit/jenkins for dealing with eg security patches [00:48:16] (03PS2) 10Cmjohnson: Adding dns entries for db1061-63 [operations/dns] - 10https://gerrit.wikimedia.org/r/117358 [00:48:16] Hm [00:48:18] yeah yeah yeah [00:48:19] Interesting [00:48:22] we're going to have lots of jobs [00:48:23] surprise surprise [00:48:30] we're just sucking them up while we do the upgrade [00:48:43] !log [Elasticsearch upgrade] Turning off shard reallocation so we don't thrash while Elasticsearch shuts down [00:48:51] Logged the message, Master [00:49:22] !log [Elasticsearch upgrade] Shutting down Elasticsearch [00:49:30] Logged the message, Master [00:49:55] rdwrer: we're all done [00:50:00] rdwrer, we're done, just need you to scap after yo deploy [00:50:02] Righto [00:50:07] * rdwrer does this thing [00:50:12] (03CR) 10MarkTraceur: [C: 032 V: 032] Enable ULS 'compact language links' Beta Feature on all normal wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/117344 (owner: 10Jforrester) [00:50:24] Two more and I'll start pulling [00:50:38] in a few minutes icinga will warn about elasticsearch being down.... [00:50:45] <^d> manybubbles: got your e-mail, ack. [00:50:52] cool [00:50:58] half elasticsearch nodes down [00:51:03] springle: https://en.wiktionary.org/wiki/Special:WhatLinksHere/Module:languages/data3/w?namespace=0 how did this not give me a gateway timeout? [00:51:05] (03Merged) 10jenkins-bot: Enable ULS 'compact language links' Beta Feature on all normal wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/117344 (owner: 10Jforrester) [00:51:11] jamesofur: Reedy apache-fast-test legalteam.url mw1078 [00:51:11] By the power of Thor and Mjolnir, I beseech the gods to guide this lightening deploy. [00:51:20] that's been spinning for like 30min in FF ;) [00:51:24] jamesofur: Reedy * 301 Moved Permanently https://legalteam.wikimedia.org/wiki/ [00:51:44] (03PS1) 10Chad: Revert "Remove cirrus jobs from priority list" [operations/puppet] - 10https://gerrit.wikimedia.org/r/117360 [00:51:58] !log [Elasticsearch upgrade] Upgrading Elasticsearch [00:52:05] (03CR) 10Chad: [C: 04-1] "Not yet, just prepping." [operations/puppet] - 10https://gerrit.wikimedia.org/r/117360 (owner: 10Chad) [00:52:06] Logged the message, Master [00:52:08] PROBLEM - ElasticSearch health check on elastic1005 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.112 [00:52:08] PROBLEM - ElasticSearch health check on elastic1008 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.140 [00:52:08] PROBLEM - ElasticSearch health check on elastic1015 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.12 [00:52:12] jamesofur: so.. what we fixed is making them all the same state [00:52:13] !log mholmquist updated /a/common to {{Gerrit|Iad8c84a7d}}: Don't use += with $wgJobTypesExcludedFromDefaultQueue [00:52:18] PROBLEM - ElasticSearch health check on elastic1003 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.110 [00:52:19] PROBLEM - ElasticSearch health check on elastic1004 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.111 [00:52:19] PROBLEM - ElasticSearch health check on elastic1002 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.109 [00:52:19] PROBLEM - ElasticSearch health check on elastic1011 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.143 [00:52:19] PROBLEM - ElasticSearch health check on elastic1001 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.108 [00:52:19] PROBLEM - ElasticSearch health check on elastic1006 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.113 [00:52:19] PROBLEM - ElasticSearch health check on elastic1010 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.142 [00:52:20] PROBLEM - ElasticSearch health check on elastic1014 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.11 [00:52:20] PROBLEM - ElasticSearch health check on elastic1016 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.13 [00:52:21] PROBLEM - ElasticSearch health check on elastic1013 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.10 [00:52:21] Logged the message, Master [00:52:21] PROBLEM - ElasticSearch health check on elastic1012 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.144 [00:52:27] jamesofur: but i also never saw the wiki , so maybe that was reverted [00:52:28] PROBLEM - ElasticSearch health check on elastic1009 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.141 [00:52:34] rdwrer I'm buying you a drink sometime for being the only one who follows the directions. [00:52:35] surprise! [00:52:43] Hehehe [00:53:08] PROBLEM - LVS HTTP IPv4 on search.svc.eqiad.wmnet is CRITICAL: Connection refused [00:53:25] mutante: Reedy yeah, they are certainly all doing the same thing but for me that means 'https://legalteam.wikimedia.org/' is showing wikimedia.org (though not actually redirecting) and /wiki is going to wikimediafoundation.org [00:54:28] uh, it isn't downloading the deb.... [00:54:35] 0% [1 elasticsearch 49.2 kB/18.5 MB 0%] [00:55:07] well... [00:55:07] there it goes [00:55:10] it was stuck..... [00:55:13] jamesofur: yea, uhm, i wasn't involved in adding the actual redirect, i just heard that something was reverted [00:55:23] Jamesofur: How about now? [00:55:32] \o/ [00:55:34] works [00:55:35] I purged some stuffs from apache/varnish [00:55:43] you've got shell, right? [00:55:47] I do yes [00:55:47] ah, varnish ? [00:55:53] AaronSchulz: there are a bunch of /* SpecialWhatLinksHere::showIndirectLinks Aaron Schulz */ queries logged all getting sniped at 5m, then reappear. would it retry endlessly? [00:55:53] I can create an account for myself [00:55:58] If pages show weirdly.. [00:55:59] echo "http://legalteam.wikimedia.org" | mwscript purgeList.php aawiki [00:56:07] Reedy: cool! works [00:56:10] nice [00:56:23] good old aawiki [00:56:23] try using the url in there (full path), it'll make sure it's purged [00:56:39] There's always a few that get caught up when we have a few servers playing up [00:57:19] needs logo [00:57:25] yup, I need to upload [00:57:30] :) [00:57:39] thank you very much both of you [00:57:40] hi [00:58:10] yw, i wonder how the "legal" logo will look like [00:58:25] !log Killing a duplicate Jenkins java process on gallium (init.d script sucks, I really need to get it fixed one day) [00:58:33] Logged the message, Master [00:58:36] With my choice? I can tell you, it'll be warm and fuzzy and look like a tiger ;) [00:58:40] we'll see if that lasts [00:58:43] rdwrer, let me know when scap is finished [00:59:28] jgonera: Still waiting on slowass^WJenkins [00:59:29] !log [Elasticsearch upgrade] Starting Elasticsearch [00:59:34] jamesofur: oooh, that makes sense, you also still have those mail redirects, his name in a bunch of spelling variants [00:59:37] Logged the message, Master [00:59:38] rdwrer: hah [00:59:46] rdwrer, no worries, I just need to test things when it's done [00:59:50] 'kay [00:59:53] !log [Elasticsearch upgrade] Verifying versions [01:00:00] Logged the message, Master [01:00:09] https://integration.wikimedia.org/zuul/ I'm waiting on 117350 [01:00:09] !log killed wrong jenkins process (+1 for 2am fix up). Restarting jenkins [01:00:12] mutante: heheh, when we get email at those it's great, usually someone takes lunch or free time and writes a response in character (with correct legal info and all) [01:00:15] Oh fuck [01:00:18] Logged the message, Master [01:00:36] surely handling a jenkins issue at 2am with X beers consumed is not the smartest idea [01:00:39] springle: I'm not refreshing on my end [01:00:39] our internal lca tools already use a picture of him [01:00:52] hashar: During the lightning deploy, while we're waiting for things to merge? maybe not [01:00:57] rdwrer: I killed Jenkins sorry. Will be back short [01:00:59] * rdwrer grumbles [01:00:59] greg-g: Do you have shell access? [01:01:02] !log [Elasticsearch upgrade] Wait for the cluster to recover. [01:01:06] meta-jenkins, go validate the status of the jenkis admin :) [01:01:06] rdwrer: sorry :/ Jenkins went wild btw [01:01:08] RECOVERY - LVS HTTP IPv4 on search.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 387 bytes in 0.002 second response time [01:01:09] Logged the message, Master [01:01:24] rdwrer: luckily it is restarting fast nowadays [01:01:27] Yeah [01:01:27] hashar: !:) hi, thanks, i think it already fixed itself after ^d restarted [01:01:40] bd808: yeah [01:01:57] hashar: Jenkins has been a slow jerk all day. ^d restarted it earlier [01:02:15] greg-g: on fluorine you can do this `tail -50f /a/mw-log/scap.log | python ~bd808/scaplog.py` [01:02:25] And watch scap happen irl [01:02:30] ^d: yeah so when restarting jenkins, it emits a kill to the jenkins process but for some interesting reason the java process never died and a SECOND Jenkins is started while the first slowly consume all the memory (ping mutante) [01:02:41] <^d> Blegh. [01:02:44] Jenkins piss me off [01:03:00] to be honest that is merely become I have Zero Java knowledge or I would fix i [01:03:00] !log [Elasticsearch upgrade] All primary shards have started. Waiting on secondary. [01:03:00] t [01:03:10] Logged the message, Master [01:03:26] !log Jenkins backup [01:03:30] !log Jenkins back up [01:03:31] !log [Elasticsearch upgrade] Reenabling puppet [01:03:33] Logged the message, Master [01:03:40] Logged the message, Master [01:03:45] rdwrer: sorry for the trouble during lightning deplot [01:03:47] Logged the message, Master [01:03:52] Welcome to the second hour of the lightning deploy, with your DJs greg-g and marktraceur... [01:04:02] ^d: https://test2.wikipedia.org/w/index.php?search=asdf&title=Special%3ASearch&go=Go&fulltext=1&srbackend=CirrusSearch worked [01:04:07] * hashar get yet another booze and listen to the DJs [01:04:11] And now a word from our sponsors: Greased Lightning [01:04:44] ^d: bd808: I have tricked some third parties in implementing a python job runner which should be able to replace most of Jenkins functionalities :-] [01:05:20] ^d: also good: https://commons.wikimedia.org/w/index.php?search=file:space&title=Special%3ASearch&go=Go&fulltext=1&srbackend=CirrusSearch [01:05:28] hashar: sweet. [01:05:46] FINALLY [01:05:48] Fuck [01:06:03] <^d> manybubbles: How many nodes are we on? [01:06:52] ^d: all six nodes are up. we have about 2000 shards left to start [01:06:58] but all primary shards have started [01:07:02] OK, scapping [01:07:03] mutante: a good thing is that the Icinga alarms for contint works properly :] [01:07:04] Like a boss [01:07:18] bd808: I thought I did, I haven't used it in a long time, and I'm getting permission denied :/ [01:07:21] LIGHTNING DEPLOY!!!!111!!!!!!ONE!!!!!ELEVEN!!!!12111 [01:07:21] mutante: thank you very much. You have saved my friday morning! [01:07:23] !log mholmquist Started scap: (no message) [01:07:32] Logged the message, Master [01:07:32] ...oh, I'm bad. [01:07:35] rdwrer: append BBQ !!111HOLY [01:07:36] yes.... [01:07:38] what a n00b [01:07:49] hashar: haha, nice how you always find the positive part, but yea, good to know you got mail [01:07:50] !log That scap was for ULS, VE, and MobileFrontend fixes and updates. [01:07:59] Logged the message, Master [01:07:59] hashar: cheers [01:08:00] ^d: I think it restores the primary shards at a higher priority [01:08:07] <^d> I want bbq now. [01:08:07] the last 2000 are going slower then the first 1000 [01:08:16] mutante: given our workload and all the crazy projects we have to handle, we better have to find the good parts or we would all shoot ourselves! :D [01:08:36] Hm... [01:09:05] Keep an eye on this one [01:10:06] springle: huh, << curl --head 'https://en.wiktionary.org/wiki/Special:WhatLinksHere/Module:languages/data3/w?namespace=0' >> gives me the expected 504 [01:10:19] (03PS1) 10Springle: ops ldap group too restrictive [operations/puppet] - 10https://gerrit.wikimedia.org/r/117361 [01:10:54] James_F: Did you set wmgULSCompactLinks anywhere? [01:11:08] It may be that I failed to gate properly on that patchset. [01:11:11] rdwrer: Yes… [01:11:20] James_F: Where? [01:11:36] Argh. [01:11:42] In InitialSettings.php. [01:11:46] Which didn't get into the commit. [01:11:48] (03CR) 10Springle: [C: 032] ops ldap group too restrictive [operations/puppet] - 10https://gerrit.wikimedia.org/r/117361 (owner: 10Springle) [01:11:51] * James_F sighs. [01:12:12] James_F: Excellent, well, at least we only need to sync one file [01:12:29] James_F: Do you want to fix it or should I? [01:12:46] ^d: seriously, it took ~3 minutes to restore all 1486 primary shards and it is taking that long for just 100 secondary ones.... [01:12:57] (03PS1) 10Jforrester: Follow-up: Icf0bef96306661 – missing file(!) from commit [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/117363 [01:13:00] <^d> :\ [01:13:01] rdwrer: ^^^ [01:13:18] (03CR) 10MarkTraceur: [C: 032] Follow-up: Icf0bef96306661 – missing file(!) from commit [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/117363 (owner: 10Jforrester) [01:13:25] Let's not leave things borked for too long, eh [01:13:26] (03Merged) 10jenkins-bot: Follow-up: Icf0bef96306661 – missing file(!) from commit [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/117363 (owner: 10Jforrester) [01:13:29] rdwrer: Sorry, was moving too quickly and didn't check. :-( [01:13:37] whatever, we can wait a few minutes. I think it might think that it is back in service so it doesn't want to swamp the disk [01:13:38] James_F: Ditto [01:13:46] No errors yet knockonwood [01:13:48] PROBLEM - Disk space on snapshot1003 is CRITICAL: DISK CRITICAL - free space: / 563 MB (2% inode=69%): [01:14:08] PROBLEM - Disk space on snapshot1001 is CRITICAL: DISK CRITICAL - free space: / 281 MB (1% inode=69%): [01:14:43] rdwrer: You weren't watching closely then :) snapshot1004 is out of disk space still [01:15:07] Errors in the scap, but you told me not to worry about those [01:15:12] ^d: 1500 [01:15:14] No errors on the site [01:15:14] no* errors* [01:15:53] (03CR) 10Cmjohnson: [C: 032] Adding dns entries for db1061-63 [operations/dns] - 10https://gerrit.wikimedia.org/r/117358 (owner: 10Cmjohnson) [01:16:16] jgonera: The sync to apaches is done, waiting on cdbs [01:16:28] rdwrer: *blushes* Thanks for the save. [01:16:41] James_F: It's fine, that's how LDs go :) [01:16:47] ^d: might want to try some stuff with the url parameter [01:16:53] rdwrer: (Clearly this was just part of my plan to rise up the total-commits-to-`mediawiki-config` table :-)) [01:17:03] Hah, obvs [01:17:06] yoyo manybubbles, checking in [01:17:08] PROBLEM - Disk space on snapshot1001 is CRITICAL: DISK CRITICAL - free space: / 1048 MB (3% inode=69%): [01:17:10] how goes it? [01:17:20] ottomata: everything is wonderful [01:17:34] yeah!? yeah! [01:17:34] we're coming back online [01:17:36] awesome [01:17:55] it'll take another 20 minutes for it to be fully full of redundancy but we're pretty good so far as I can tell [01:18:00] * greg-g has to run [01:18:12] greg-g: We should be fine, just one more sync-file to run [01:18:18] greg-g: Have a nice night [01:18:19] !log mholmquist Finished scap: (no message) (duration: 10m 55s) [01:18:24] !log mholmquist updated /a/common to {{Gerrit|I48e98d28f}}: Follow-up: Icf0bef96306661 – missing file(!) from commit [01:18:32] Logged the message, Master [01:18:38] Logged the message, Master [01:18:52] (03CR) 10Dzahn: "auth_name was still Ops" [operations/puppet] - 10https://gerrit.wikimedia.org/r/117361 (owner: 10Springle) [01:19:14] !log mholmquist synchronized wmf-config/InitialiseSettings.php 'Fix James_F's commit, follow-up, should gate ULS beta feature' [01:19:21] Logged the message, Master [01:19:26] jgonera: OK everything should be set now [01:19:40] James_F: Want to make sure things are working for you too? [01:20:08] rdwrer: Hmm. Damn it. [01:20:11] Damn it? [01:20:32] I don't like damn it [01:20:47] rdwrer: I didn't hide the BF hook in the commit, just the BF action. [01:20:56] Ah. [01:21:03] rdwrer: So it'll be exposed but not work on loginwiki and votewiki. [01:21:11] |log heading bed, have a nice evening folks. Jenkins seems happy right now. [01:21:16] *waves* [01:21:19] James_F: As mistakes go that seems pretty much OK [01:21:22] rdwrer: Oh well, it's not too bad. Not worth fixing in the LD. [01:21:30] Fix it in another commit, it'll go out next week [01:21:30] rdwrer: Will do a follow-up commit to ULS. [01:21:33] Yeah. [01:21:36] Follow-follow-up [01:22:12] thanks rdwrer [01:22:28]