[00:00:15] LD time, I'd like to deploy first [00:01:46] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Server Error - 1703 bytes in 7.627 second response time [00:05:45] !log reedy Finished scap: no op scap is no op? (duration: 39m 23s) [00:05:53] Logged the message, Master [00:06:25] (03CR) 10Hashar: "Pinged ops list to find out whether someone is available on monday morning :]" [operations/puppet] - 10https://gerrit.wikimedia.org/r/111917 (owner: 10BryanDavis) [00:07:04] !log maxsem synchronized php-1.23wmf15/extensions/MobileFrontend 'https://gerrit.wikimedia.org/r/114644' [00:07:12] Logged the message, Master [00:08:22] GOD DAMN IT [00:09:24] :) [00:09:46] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 474404 bytes in 8.214 second response time [00:10:13] greg-g: can we get into the LD today? we have one patch for 14 and 15 which fixes an issue where sometimes less than the requested number of items are returned in RecentChanges [00:10:15] !log maxsem synchronized php-1.23wmf14/includes/api/ApiCreateAccount.php [00:10:22] Logged the message, Master [00:10:24] (03PS1) 10Reedy: Tampa is silly and should go away [operations/puppet] - 10https://gerrit.wikimedia.org/r/114664 [00:10:48] didn't ptmpa go away long ago? [00:11:06] No [00:11:10] It's still breathing [00:11:16] its as gone as wikitext :P [00:11:19] ebernhardson: yeah, after MaxSem is done [00:11:23] greg-g: excellent, thanks [00:11:29] didn't get enough kicking in the kidneys [00:11:40] (03CR) 10Reedy: [C: 031] Remove old Tampa srv* and mw* apaches from dsh groups [operations/puppet] - 10https://gerrit.wikimedia.org/r/108070 (owner: 10Chad) [00:11:54] (03Abandoned) 10Reedy: Tampa is silly and should go away [operations/puppet] - 10https://gerrit.wikimedia.org/r/114664 (owner: 10Reedy) [00:12:14] !log maxsem synchronized php-1.23wmf14/extensions/ConfirmEdit/ [00:12:21] greg-g, I'm done [00:12:22] Logged the message, Master [00:12:28] <^d> Reedy: I think the people in Tampa might take issue with your commit summary. [00:12:29] MaxSem: tested? [00:12:34] <^d> If you're just taking out all of Tampa as a city :p [00:13:38] greg-g, https://en.wikipedia.org/wiki/Special:ApiSandbox#action=paraminfo&format=json&modules=createaccount [00:15:16] (03CR) 10Reedy: "The whole scap process includes the time taken to sync to tampa." [operations/puppet] - 10https://gerrit.wikimedia.org/r/108070 (owner: 10Chad) [00:15:56] ebernhardson: go forth [00:17:40] !log Reloading zuul to deploy I37ce89455724ed15 [00:17:48] Logged the message, Master [00:23:08] !log ebernhardson synchronized php-1.23wmf14/extensions/Flow/ [00:23:16] Logged the message, Master [00:24:15] gotta run [00:25:32] !log ebernhardson synchronized php-1.23wmf15/extensions/Flow/ [00:25:40] Logged the message, Master [00:26:43] (03PS1) 10coren: Revert "Reenable redis for keystone in eqiad" [operations/puppet] - 10https://gerrit.wikimedia.org/r/114670 [00:26:49] (03PS2) 10coren: Revert "Reenable redis for keystone in eqiad" [operations/puppet] - 10https://gerrit.wikimedia.org/r/114670 [00:27:58] during that last sync-dir new error i've never seen from sync-dir before: [00:28:02] snapshot3: rsync: mkdir "/usr/local/apache/common-local/php-1.23wmf15/extensions/Flow" failed: No such file or directory (2) [00:28:10] reoeated for snapshot{1,2,3,4} [00:28:15] (03CR) 10coren: [C: 032] "Revert." [operations/puppet] - 10https://gerrit.wikimedia.org/r/114670 (owner: 10coren) [00:28:39] and : snapshot3: rsync error: error in file IO (code 11) at main.c(595) [Receiver=3.0.7] [00:28:49] ignore those [00:28:56] ok, excellent [00:36:17] (all done, btw) [00:36:38] (03CR) 10Chad: [C: 031] "Everything Reedy said. Plus I'll expand on the "falling back trivially" bit." [operations/puppet] - 10https://gerrit.wikimedia.org/r/108070 (owner: 10Chad) [09:51:08] i'm going to be very naughty and sync a trivial js fix [09:51:17] don't tell anyone [09:53:19] !log ori synchronized php-1.23wmf14/extensions/MultimediaViewer/resources/mmv/mmv.performance.js 'I41b6e975353: Backport fix for stats.bandwidth == Infinity' [09:53:26] se4598: just saw your ping from earlier; i'll look now [09:53:27] Logged the message, Master [10:04:35] (03PS1) 10Ori.livneh: Set $wmfExtendedVersionNumber = $wmfVersionNumber [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/114718 [10:04:49] (03CR) 10Ori.livneh: [C: 032] Set $wmfExtendedVersionNumber = $wmfVersionNumber [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/114718 (owner: 10Ori.livneh) [10:05:05] (03Merged) 10jenkins-bot: Set $wmfExtendedVersionNumber = $wmfVersionNumber [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/114718 (owner: 10Ori.livneh) [10:05:35] !log ori updated /a/common to {{Gerrit|I10170d77c}}: Set $wmfExtendedVersionNumber = $wmfVersionNumber [10:05:42] Logged the message, Master [10:06:40] se4598: fixed; sorry about that [10:06:53] thanks [10:07:02] (03PS1) 10Yuvipanda: Add GlobalCssJs extension to betalabs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/114719 [10:07:14] legoktm: Gloria ^ [10:57:36] (03PS1) 10Alexandros Kosiaris: Add asw-d-eqiad to rancid [operations/puppet] - 10https://gerrit.wikimedia.org/r/114724 [11:06:39] (03CR) 10Alexandros Kosiaris: [C: 032] Add asw-d-eqiad to rancid [operations/puppet] - 10https://gerrit.wikimedia.org/r/114724 (owner: 10Alexandros Kosiaris) [11:11:55] (03PS1) 10Alexandros Kosiaris: Add asw-d-eqiad to torrus [operations/puppet] - 10https://gerrit.wikimedia.org/r/114727 [11:13:29] (03CR) 10Alexandros Kosiaris: [C: 032] Add asw-d-eqiad to torrus [operations/puppet] - 10https://gerrit.wikimedia.org/r/114727 (owner: 10Alexandros Kosiaris) [12:03:51] (03PS1) 10Tim Landscheidt: Fix indentation in role::labs::instance [operations/puppet] - 10https://gerrit.wikimedia.org/r/114734 [12:24:12] (03PS1) 10Alexandros Kosiaris: Populate network.pp with eqiad row D networks [operations/puppet] - 10https://gerrit.wikimedia.org/r/114735 [12:26:24] (03CR) 10Alexandros Kosiaris: "I took the liberty of allocating networks for Row D @eqiad. I have not proceded with any changes to the equipment yet. Please verify or re" [operations/puppet] - 10https://gerrit.wikimedia.org/r/114735 (owner: 10Alexandros Kosiaris) [12:42:06] * mark checks that :) [12:43:27] akosiaris: you just used the eqiad labs prefix [12:55:21] (03CR) 10Mark Bergsma: [C: 04-2] "That's the eqiad labs floating IP prefix..." [operations/puppet] - 10https://gerrit.wikimedia.org/r/114735 (owner: 10Alexandros Kosiaris) [12:57:19] :-( [13:01:19] eqiad ip space is pretty full [13:01:29] seems like a /27 is the best we can do until we move some stuff around [13:02:03] do we document the IP space somewhere ? [13:02:16] in dns [13:02:21] ahaha [13:02:21] and in private outlines [13:02:34] (03PS1) 10Tim Landscheidt: Fix paths in comments after modularization [operations/puppet] - 10https://gerrit.wikimedia.org/r/114736 [13:02:48] hmm I consulted observium before picking those. Seems like it is not enough [13:02:50] ? [13:02:54] (03PS1) 10Mark Bergsma: Allocate /27 for public1-d-eqiad [operations/dns] - 10https://gerrit.wikimedia.org/r/114738 [13:02:58] dns is pretty authoritative [13:03:15] yeah but too verbose as well [13:03:50] i have always used an outliner for planning this hierarchically [13:03:53] but it doesn't distribute well [13:04:16] wiki tables are not great either [13:04:30] but installing specialized software is a bit overkill too [13:04:37] like an IPAM ? [13:04:42] yes [13:05:15] there isn't a ton of ip space to manage ;) [13:06:48] (03PS1) 10Tim Landscheidt: Fix manage-keys-nfs misnomer in usage message [operations/puppet] - 10https://gerrit.wikimedia.org/r/114739 [13:06:48] observium says 208.80.155.65/26 Subnet sandbox1-b-eqiad, DNS says 208.80.155.64/28 Sandbox1-b-eqiad subnet [13:06:55] something is amiss here [13:07:01] better check the routers [13:07:11] if we're really using a /26 for sandbox, we're going to change that right now :) [13:08:17] seems like we do [13:08:47] (03PS1) 10Tim Landscheidt: ldap: Fix typo in usage messages [operations/puppet] - 10https://gerrit.wikimedia.org/r/114740 [13:09:06] fcs [13:09:21] ok [13:09:25] can you just change that on the routers [13:09:32] then ask the freenode people to adjust their one server later? [13:09:36] it shouldn't harm much [13:10:14] ok, doing it now then [13:10:29] perhaps even /29 is [13:10:30] but yeah [13:12:54] ;208.80.155.64/26 Sandbox subnet [13:12:54] ;208.80.155.64/28 Sandbox1-b-eqiad subnet [13:13:10] seems like the idea was to have a big one and split it up ? [13:13:14] overkill though [13:16:44] maybe they idea was sandbox subnets across DCs [13:18:43] i think the idea was sandbox subnets for multiple rows [13:18:53] but we don't have the ip space for that, we wouldn't do that [13:29:03] !log just resized 208.80.155.64/26 to 208.80.155.64/28. This is Sandbox1-b-eqiad subnet. dickson.freenode.net needs to have it's netmask changed. I will talk with coren, mutante [13:29:12] Logged the message, Master [13:31:53] they need to do ip changes anyway [13:31:58] for using the extra service ip [13:33:15] the extra service ip will be on the same /28 or another one ? [13:33:38] i just mailed coren, mutante about it. I also mentioned the extra service ip as well [13:34:54] within the same /28 [13:35:01] they indicated that they wanted to keep the irc address the current one [13:35:05] mark: the v6s and the private on https://gerrit.wikimedia.org/r/#/c/114735/1/manifests/network.pp were ok ? Can I just amend it the public IPv4 with 208.80.155.96/27? [13:35:06] and move the "system ip" to a new ip [13:35:33] hmm [13:35:58] (03CR) 10Mark Bergsma: Populate network.pp with eqiad row D networks (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/114735 (owner: 10Alexandros Kosiaris) [13:35:59] I am not sure what this will save them from. It will be the same ethernet interface that will be saturated on the next attack [13:36:17] it's easier for us to filter that [13:36:31] we can easily filter almost all traffic to the irc ip [13:36:40] without blocking their own access and stuff [13:37:27] * Coren wakes. [13:37:51] mark: Exactly. You cal literally block absolutely everything but the irc ports on the irc IP [13:38:04] yeah true. Still we will be a choke point but it is an improvement [13:39:32] (03CR) 10Alexandros Kosiaris: "Subnet analytics1-a-eqiad" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/114735 (owner: 10Alexandros Kosiaris) [13:41:56] sure [13:42:32] akosiaris: sigh. alright. then it's fine :) [13:44:20] (03PS2) 10Alexandros Kosiaris: Populate network.pp with eqiad row D networks [operations/puppet] - 10https://gerrit.wikimedia.org/r/114735 [13:54:32] (03CR) 10Alexandros Kosiaris: [C: 032] Populate network.pp with eqiad row D networks [operations/puppet] - 10https://gerrit.wikimedia.org/r/114735 (owner: 10Alexandros Kosiaris) [13:55:09] hmmm a patch with a +2 and a -2 ... gerrit seems to want some massaging to allow me to merge this [14:01:18] just remove my -2? [14:06:09] yeah it worked :-) [14:27:46] PROBLEM - Varnish HTCP daemon on cp1067 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:27:46] PROBLEM - Varnish HTTP text-backend on cp1067 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:27:56] PROBLEM - Varnish traffic logger on cp1067 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:28:56] RECOVERY - Varnish traffic logger on cp1067 is OK: PROCS OK: 2 processes with command name varnishncsa [14:29:37] RECOVERY - Varnish HTCP daemon on cp1067 is OK: PROCS OK: 1 process with UID = 111 (vhtcpd), args vhtcpd [14:29:37] RECOVERY - Varnish HTTP text-backend on cp1067 is OK: HTTP OK: HTTP/1.1 200 OK - 188 bytes in 0.001 second response time [14:44:12] mark -- the other day we talked about allowing labs<->labs communication… is that still something that might be possible? [14:58:16] andrewbogott: yes, on my task list [14:58:31] good enough for me :) [15:00:15] (03CR) 10Ottomata: "It will! This will be a git-submodule at that place in the ops/puppet repo." [operations/puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/110650 (owner: 10Ottomata) [15:10:06] PROBLEM - Host virt1002 is DOWN: PING CRITICAL - Packet loss = 100% [15:14:37] RECOVERY - Host virt1002 is UP: PING OK - Packet loss = 0%, RTA = 0.65 ms [15:30:16] !log dist-upgrade and reboot boron [15:30:23] Logged the message, Master [15:37:52] garg. that didn't go well... [16:21:25] http://lists.wikimedia.org/pipermail/wikimedia-l/ gives 403... [17:07:13] uuups [17:18:03] (03PS1) 10Ryan Lane: Use the Token keystone redis driver rather than the TokenNoList driver [operations/puppet] - 10https://gerrit.wikimedia.org/r/114756 [17:19:18] (03PS1) 10Ryan Lane: Revert "Revert "Reenable redis for keystone in eqiad"" [operations/puppet] - 10https://gerrit.wikimedia.org/r/114757 [17:20:45] (03CR) 10Ryan Lane: [C: 032] Use the Token keystone redis driver rather than the TokenNoList driver [operations/puppet] - 10https://gerrit.wikimedia.org/r/114756 (owner: 10Ryan Lane) [17:22:07] (03CR) 10Ryan Lane: [C: 032] Revert "Revert "Reenable redis for keystone in eqiad"" [operations/puppet] - 10https://gerrit.wikimedia.org/r/114757 (owner: 10Ryan Lane) [17:26:16] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Fri Feb 21 17:26:13 UTC 2014 [17:28:46] RECOVERY - Varnish HTTP text-backend on cp1054 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.011 second response time [17:31:38] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [17:31:46] PROBLEM - Varnish HTTP text-backend on cp1054 is CRITICAL: Connection refused [17:31:56] bblack: ^ it died again [17:32:29] magical anti-varnish fairies at work? [17:32:45] Feb 21 17:28:54 cp1054 varnishd[9281]: Child (15891) Panic message: Assert error in BAN_RefBan(), cache_ban.c line 481:#012 Condition(t1 == t0) not true.#012thread = (persistence)#012ident = Linux,3.2.0-48-generic,x86_64,-spersistent,-spersistent,-spersistent,-spersistent,-smalloc,-hcritbit,epoll#012Backtrace:#012 0x4337b5: /usr/sbin/varnishd() [0x4337b5]#012 0x416446: /usr/sbin/varnishd(BAN_RefBan+0x96) [0x416446]#012 0x4545bf: /usr/ [17:32:54] actually, looks like probably a corrupt persistent store [17:34:32] ottomata: an1021 has a Kafka Broker Messages In CRITICAL alert [17:34:59] (in case it wasn't obvious by me pinging people, I just opened Icinga's unhandled problems page :) [17:36:26] oo hmm thanks [17:37:04] hm! [17:37:09] an22 is the leader for all topics [17:37:10] WHY!? [17:39:21] manybubbles: hello [17:40:19] ottomata: btw, I did finish auditing the rdkafka code, but I didn't turn up anything substantial. One one-liner bugfix to an assert(), basically. There was lots of sloppy linenoise that made static analysis confusing, but once you get through that, most of the warnings didn't amount to real, actionable runtime bugs. [17:41:17] (that's not to say it's bug-free, but it certainly survives substantial analysis at the source level :) ) [17:41:27] bblack: if the cache is corrupted, then let's just rm it [17:41:27] aye ok, thanks bblack [17:41:29] much appreciated [17:41:56] paravoid: I am, but XFS and/or varnish is being retarded - the varnish proc is stuck and XFS is very very very slowly deallocating space for the old files, etc [17:42:02] I may just have to reboot + cleanup [17:42:15] oh, right, we've seen this before [17:42:33] last time I think we just mkfs'ed it again :) [17:42:59] well right now I can't even unmount, because the varnish proc is holding open the FS and can't be killed :P [17:43:06] nice [17:43:28] haha [17:43:33] nice [17:43:40] btw, maybe upgrade to 3.0.5 while we're at it? [17:43:40] on the other hand, it's about 25% done deallocating, letting it run might be less trouble than the reboot [17:44:06] paravoid: yeah, I started that two days ago late at night and hit some snags (5xx spikes), and aborted and downgraded the ones I had already hit [17:44:18] ooops [17:44:20] it's quite possible the cp1054 thing was related and the upgrades would've otherwise been fine [17:44:45] perhaps the corruption happened during the stop->upgrade->start [17:44:56] anyways, I'll retry again after sorting these out [17:46:10] re: cp4009, ever seen this before? [17:46:13] ERROR: Timeout while waiting for server to perform requested power action. [17:46:17] lol [17:46:21] ^ does that on racadam powercycle and racadm hardreset [17:46:22] I just saw that with another box [17:46:28] like literally a few minutes ago [17:47:04] * bblack tries racadm serveraction pull-plug-and-kick-machine [17:47:12] packages used in deployment discussion going on in -office :) [17:51:37] RECOVERY - NTP on bast4001 is OK: NTP OK: Offset 0.00358748436 secs [17:59:20] (03CR) 10PleaseStand: Fix CDB file generation (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/114686 (owner: 10Ori.livneh) [18:00:40] (03CR) 10PleaseStand: Swiched from using dat to json files for wikiversions (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/114687 (owner: 10Reedy) [18:11:46] RECOVERY - Varnish HTTP text-backend on cp1054 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.007 second response time [18:12:30] paravoid: re: racadm power issues - the crux in my case seems to be "Description: The system board fail-safe voltage is outside of range. [18:12:36] " in the SEL [18:12:44] oops [18:12:53] RT ticket under the ulsfo queue [18:13:10] some dell manuals recommend either pulling the hard AC power from the machine for 10s, or clearing the SEL (as if the presence of the SEL entry might persistently prevent it from powering up) [18:13:16] lol [18:13:22] I tried racadm racreset [hard], no avail [18:13:31] yeah I did the same with ms-be1005 [18:13:39] with the same result [18:13:43] tried clrsel -> racreset as well, and it regenerates the voltage message after racreset hard - so the condition persists [18:14:01] what's the command to read the SEL? [18:14:06] racadm getsel [18:14:12] heh [18:14:59] so either (a) after a glitchy power fail event, RAC can suck and need a true "pull the power plug" to reset something, or (b) the power is still borked somehow on these machines (PS damage?) [18:15:57] either way, I'm guessing site visit required [18:16:08] huh, no I can't access ms-be1005's mgmt at all [18:16:37] oh now I got in [18:16:37] Severity: Critical [18:16:37] Description: CPU 1 has a thermal trip (over-temperature) event. [18:16:41] fun [18:17:35] it sounds like, at least in some cases, critical events in the SEL prevent powerup, and "racadm clrsel" (+ maybe racadm racreset?) might allow it to try again.... [18:17:41] but then, you lose the SEL for history :P [18:19:12] !log cp1054 healthy now, rebuilding persistent cache from scratch there... [18:19:20] Logged the message, Master [18:19:35] bblack: RT ticket under ulsfo [18:19:46] working on it! :) [18:19:55] and the magical DC fairy will fix it [18:19:55] :P [18:19:57] isn't this cp1054, so eqiad? [18:20:08] different box [18:20:12] ok [18:20:16] two issues at once :) [18:20:29] the other one is cp4009 [18:20:33] down since the power outage [18:23:16] https://rt.wikimedia.org/Ticket/Display.html?id=6890 [18:24:49] paravoid: you pinged? sorry, I was out and didn't |away myself [18:24:55] !log initiating kafka preferred replica election to rebalance partition leaders [18:24:57] I did [18:25:02] Logged the message, Master [18:25:04] got a sec? [18:26:46] RECOVERY - Kafka Broker Messages In on analytics1021 is OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate OKAY: 1868.80318074 [18:27:58] manybubbles: my question was, IIRC way before we even installed ElasticSearch, there was the stated goal of using it for TTM too; is this still the plan? if so, is there any progress or roadmap for it? [18:28:26] paravoid: TTM is translation memory? [18:28:29] yes [18:28:47] yeah, that is the goal but it is something I imagined I'd help with rather then implement myself [18:29:00] no progress, so far as I know [18:29:19] manybubbles: I was just about to reply to your email, which I didn't understand much [18:29:33] https://www.mediawiki.org/wiki/Mentorship_programs/Possible_projects#One_stop_translation_search [18:30:09] This project is also about superseding solr, but Nikerabbit would take that part out unless there is you or Chad co-mentoring [18:30:24] (03PS1) 10Brion VIBBER: Revert low-res .ogv transcode enable; player is picking too-small version by default [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/114761 [18:31:11] (03PS2) 10Brion VIBBER: Revert low-res .ogv transcode enable; player is picking too-small version by default [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/114761 [18:31:30] at least we know they recode fast :D [18:32:36] Nemo_bis: my email? I'm not sure which one. [18:32:43] brion: this is temporary though, right? [18:32:50] in the sense that you'll reenable them? [18:32:50] also, I added myself as a co-mentor for the elasticsearch stuff [18:32:55] paravoid: that's the plan yeah [18:32:58] it shouldn't remove the existing ones [18:33:04] what I'm asking is whether we need to clean up the store or not [18:33:16] no leave them there and they'll get picked up again later [18:33:19] once we resolve the player issue [18:33:21] okay [18:34:03] manybubbles: great! I meant your email on Date: Tue, 11 Feb 2014 08:24:22 -0500 [18:34:09] hmm [18:34:31] manybubbles: so, I'm thinking of mailing Nikerabbit & you to move this forward (hopefully with subsequent comm in a public medium), unless you have a better idea? [18:34:39] basically the reason I'm pinging you now about this is [18:34:47] Solr [18:34:48] CRITICAL [18:34:53] Average request time is 1166.436 (gt 1000) [18:35:10] if avg req time is 1.1s, might just as well replace the damn thing :) [18:35:13] Nemo_bis: oh yeah, that one. I can describe in a bit [18:35:21] paravoid: huh [18:35:37] well, I mean, we should replace it but we don't have code for that [18:35:40] 40 days and counting, noone really cares about this service apparently [18:35:53] !log Forced update of /svr/scap to 6203585 across cluster [18:36:01] Logged the message, Master [18:36:09] bd808: backdoor and everything? [18:36:10] :P [18:36:33] (kidding) [18:36:37] paravoid: I'm saving that for the weekend [18:36:50] greg-g: ima gonna no-op scap now unless you changed your mind [18:36:55] paravoid: what do you mean nobody cares? nobody with the power to set an alert has set up one that goes to someone who cares [18:37:03] is my guess [18:37:16] paravoid: so is this average request time for ttmserver solr? [18:37:23] yes [18:37:28] just the ttm server? [18:37:35] what do you mean? [18:37:38] it's a solr instance [18:37:58] isn't there solr also for geocoordinates or something [18:38:02] different solr [18:38:25] which doesn't have the alert [18:38:28] paravoid: my personal monitoring system tells me the search has somehow worked in the last few months https://toolserver.org/~nemobis/tmp/SearchTranslations.log [18:38:55] last timeout in 2013-05-08 17.10 [18:38:59] ttmserver isn't that sensitive to slow queries, as long as it is not affecting anything else running on the server [18:39:38] so the problem is that ttmserver is building extremely inefficient queries [18:39:46] paravoid: off topic, but I'm reenabling puppet on elastic1007. It was disabled when it crashed and I ran it manually when it came back but I never reenabled it. [18:39:50] but yes it is a known issue that it has issues [huh] and if it is causing troubles [is it?] it should be given more priority to address it [18:39:54] manybubbles: ack [18:40:19] if it's not causing troubles, the alert is wrong [18:40:47] paravoid: I was going to say, if we're sure that the thing is working "ok" then can we bump up the alert time? [18:40:52] like 10 seconds or something silly? [18:40:56] but moreover, there's an underlying issue that the service isn't well-operated or properly supported/designed (just runs on one server...) [18:40:57] !log bd808 Started scap: no-diff scap to test script changes; expect l10n updates [18:41:04] Logged the message, Master [18:41:13] if there's a critical alert for 40 days and noone blinks an eye [18:41:41] so it could be that it's entirely our fault, but I doubt we'll ever fix this by caring more [18:41:48] it is unclear who is supposed to be monitoring that [18:41:55] I think we need to fix it to pass it on to our beloved search team ;) [18:42:04] *by passing it on [18:42:08] one of the reasons I wish to move to elasticsearch... less unlikely to go abandoned [18:42:13] yeah exactly [18:42:15] *likely [18:42:18] (03PS1) 10MaxSem: Okay, so TTM wants to do slow queries?:) [operations/puppet] - 10https://gerrit.wikimedia.org/r/114765 [18:42:31] I'm happy to help with the move. [18:42:38] I'm not sure beyond that what I should do though [18:43:27] it looks like to me that we're a bit deadlocked [18:43:31] one waiting for the other [18:43:45] Nikerabbit: what do you think should happen to move this forward? [18:44:06] paravoid: well, I haven't really been waiting, it has just been on the table given other priorities [18:44:09] (03CR) 10Nemo bis: "Do we know at what point it gets so slow as to cause timeouts for Special:SearchTranslations? Or are we just throwipng a random number bec" [operations/puppet] - 10https://gerrit.wikimedia.org/r/114765 (owner: 10MaxSem) [18:45:09] (03PS2) 10Faidon Liambotis: Bump TTM's avg req time alert to 5 seconds(!) [operations/puppet] - 10https://gerrit.wikimedia.org/r/114765 (owner: 10MaxSem) [18:45:11] I should likely ask some time to work on this issue and pull some people including manybubbles to replace solarium with elastice and think of ways to make it faster in general [18:45:37] (03CR) 10Faidon Liambotis: [C: 032] Bump TTM's avg req time alert to 5 seconds(!) [operations/puppet] - 10https://gerrit.wikimedia.org/r/114765 (owner: 10MaxSem) [18:45:46] ok [18:45:51] how can I help you with that? [18:47:05] e.g. I can drop an email saying that we (ops) really suck in supporting this setup and we'd rather see it replaced so it can be better supported with the help of the search team [18:47:14] (03PS1) 10BBlack: regularly compact memory on varnish caches [operations/puppet] - 10https://gerrit.wikimedia.org/r/114766 [18:47:32] would that help you with your internal prioritization? [18:47:41] Nikerabbit: I've blocked some time out on Monday for me to review the translation extension and put together a proposed plan of attack [18:48:02] I figure we can go from there [18:48:03] Ryan_Lane2: ping [18:48:16] paravoid: yes, I can quote you for that but of course it is good if it comes directly from you [18:48:28] Ryan_Lane2: your redis changeset is not merged, can I merge? [18:48:29] (03PS2) 10BBlack: regularly compact memory on varnish caches [operations/puppet] - 10https://gerrit.wikimedia.org/r/114766 [18:48:59] manybubbles: Monday is not my WMF-day, but I hope to be available if you have quick questions [18:49:25] Nikerabbit: I'll send something off on Monday. which days are wmf? [18:49:43] !log bd808 scap-1 failed on 4 hosts [18:49:48] Nikerabbit: okay, where should I mail that? [18:49:48] manybubbles: currently Tuesday and Thursday and randomly along the week on other days ;) [18:49:51] Logged the message, Master [18:50:28] paravoid: localisation-team@lists.wikimedia.org I think (it's moderated but reaches the whole team) [18:50:33] !log The 4 hosts that failed scap-1 were snapshot[1234]; all have old/bad python installs [18:50:35] awesome [18:50:38] manybubbles: I'll Cc you [18:50:40] Logged the message, Master [18:50:49] thanks to both [18:50:54] of you ;) [18:53:59] manybubbles: it's my evening on your regular worktimes (like now) so I'm likely to be around [18:54:28] !log bd808 scap-rebuild-cdbs failed on 4 hosts [18:54:36] !log bd808 Finished scap: no-diff scap to test script changes; expect l10n updates (duration: 13m 38s) [18:54:36] Logged the message, Master [18:54:44] Logged the message, Master [18:55:33] !log The 4 hosts that failed scap-rebuild-cdbs were snapshot[1234]; can we pull them from mediawiki-installation dsh group? [18:55:41] Logged the message, Master [18:56:33] !log catrope synchronized php-1.23wmf14/extensions/VisualEditor/modules/ve-mw/init/ve.init.mw.Target.js 'touch' [18:56:37] PROBLEM - Apache HTTP on mw1047 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - 50376 bytes in 0.023 second response time [18:56:40] Logged the message, Master [18:56:42] bd808: My understanding is they can't be. Talk to apergos for details [18:56:46] PROBLEM - Apache HTTP on mw1079 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - 50376 bytes in 0.006 second response time [18:56:51] !log catrope synchronized php-1.23wmf14/extensions/VisualEditor/modules/ve-mw/init/targets/ve.init.mw.ViewPageTarget.js 'touch' [18:57:03] Logged the message, Master [18:57:07] bd808: Although if snapshot 10NN did work, then maybe we can; but ask Ariel [18:57:09] not yet, I'll get em gone soon though [18:57:10] !log catrope synchronized php-1.23wmf15/extensions/VisualEditor/modules/ve-mw/init/ve.init.mw.Target.js 'touch' [18:57:17] Logged the message, Master [18:57:26] th sn100xs work, there's a few misc jobs I still need to get off [18:57:32] !log catrope synchronized php-1.23wmf15/extensions/VisualEditor/modules/ve-mw/init/targets/ve.init.mw.ViewPageTarget.js 'touch' [18:57:40] Logged the message, Master [18:57:50] apergos: snapshot1: rsync: change_dir#3 "/usr/local/apache/common-local/php-1.23wmf15/extensions/VisualEditor/modules/ve-mw/init" failed: No such file or directory (2) [18:57:51] apergos: The new scap scripts are failing there so snapshot[1234] aren't getting WM updates [18:58:01] Well, maybe that's because they don't have MW directories [18:58:26] thy o but maybe not there [18:58:30] RoanKattouw: The rsyncs have been failing for them since we changes scap to be python guts [18:58:34] Right [18:58:45] apergos: Is it OK if we stop deploying MW updates to the pmtpa snapshot hosts? [18:58:54] well this will just be he motivation I need to get em gone [18:58:57] Or should we fix them to receive those updates? [18:59:00] yeah go ahead and nuke em [18:59:04] Alright [18:59:14] \o/ [18:59:27] bd808: Eradicate snapshot{1..4} from mediawiki-installation then. You may need to touch puppet for that, I forget how this is managed these days [18:59:44] But leave the eqiad snapshot hosts (snapshot1001 and beyond) alone [19:00:04] files/dsh/group/.. in operations/puppet [19:00:09] RoanKattouw: Will do. I'm pretty sure it's a puppet change, but I will submit [19:00:12] Sweet [19:00:18] Thanks ori [19:00:23] Also, cool nick :) [19:00:24] (03CR) 10Faidon Liambotis: [C: 032] regularly compact memory on varnish caches [operations/puppet] - 10https://gerrit.wikimedia.org/r/114766 (owner: 10BBlack) [19:00:52] In other news: scap has a progress bar now. Will send an email to ops-l today telling people what to expect. [19:01:03] bd808: my hero [19:01:42] haha [19:01:50] greg-g: ori wrote the hard parts, i just put some lipstick on it [19:02:11] mark: what you missed from the core team channel yesterday: [19:02:11] 18:36 < greg-g> I loooove progress bars, it's what got me into computers [19:02:15]