[00:00:15] LD time, I'd like to deploy first [00:01:46] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Server Error - 1703 bytes in 7.627 second response time [00:05:45] !log reedy Finished scap: no op scap is no op? (duration: 39m 23s) [00:05:53] Logged the message, Master [00:06:25] (03CR) 10Hashar: "Pinged ops list to find out whether someone is available on monday morning :]" [operations/puppet] - 10https://gerrit.wikimedia.org/r/111917 (owner: 10BryanDavis) [00:07:04] !log maxsem synchronized php-1.23wmf15/extensions/MobileFrontend 'https://gerrit.wikimedia.org/r/114644' [00:07:12] Logged the message, Master [00:08:22] GOD DAMN IT [00:09:24] :) [00:09:46] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 474404 bytes in 8.214 second response time [00:10:13] greg-g: can we get into the LD today? we have one patch for 14 and 15 which fixes an issue where sometimes less than the requested number of items are returned in RecentChanges [00:10:15] !log maxsem synchronized php-1.23wmf14/includes/api/ApiCreateAccount.php [00:10:22] Logged the message, Master [00:10:24] (03PS1) 10Reedy: Tampa is silly and should go away [operations/puppet] - 10https://gerrit.wikimedia.org/r/114664 [00:10:48] didn't ptmpa go away long ago? [00:11:06] No [00:11:10] It's still breathing [00:11:16] its as gone as wikitext :P [00:11:19] ebernhardson: yeah, after MaxSem is done [00:11:23] greg-g: excellent, thanks [00:11:29] didn't get enough kicking in the kidneys [00:11:40] (03CR) 10Reedy: [C: 031] Remove old Tampa srv* and mw* apaches from dsh groups [operations/puppet] - 10https://gerrit.wikimedia.org/r/108070 (owner: 10Chad) [00:11:54] (03Abandoned) 10Reedy: Tampa is silly and should go away [operations/puppet] - 10https://gerrit.wikimedia.org/r/114664 (owner: 10Reedy) [00:12:14] !log maxsem synchronized php-1.23wmf14/extensions/ConfirmEdit/ [00:12:21] greg-g, I'm done [00:12:22] Logged the message, Master [00:12:28] <^d> Reedy: I think the people in Tampa might take issue with your commit summary. [00:12:29] MaxSem: tested? [00:12:34] <^d> If you're just taking out all of Tampa as a city :p [00:13:38] greg-g, https://en.wikipedia.org/wiki/Special:ApiSandbox#action=paraminfo&format=json&modules=createaccount [00:15:16] (03CR) 10Reedy: "The whole scap process includes the time taken to sync to tampa." [operations/puppet] - 10https://gerrit.wikimedia.org/r/108070 (owner: 10Chad) [00:15:56] ebernhardson: go forth [00:17:40] !log Reloading zuul to deploy I37ce89455724ed15 [00:17:48] Logged the message, Master [00:23:08] !log ebernhardson synchronized php-1.23wmf14/extensions/Flow/ [00:23:16] Logged the message, Master [00:24:15] gotta run [00:25:32] !log ebernhardson synchronized php-1.23wmf15/extensions/Flow/ [00:25:40] Logged the message, Master [00:26:43] (03PS1) 10coren: Revert "Reenable redis for keystone in eqiad" [operations/puppet] - 10https://gerrit.wikimedia.org/r/114670 [00:26:49] (03PS2) 10coren: Revert "Reenable redis for keystone in eqiad" [operations/puppet] - 10https://gerrit.wikimedia.org/r/114670 [00:27:58] during that last sync-dir new error i've never seen from sync-dir before: [00:28:02] snapshot3: rsync: mkdir "/usr/local/apache/common-local/php-1.23wmf15/extensions/Flow" failed: No such file or directory (2) [00:28:10] reoeated for snapshot{1,2,3,4} [00:28:15] (03CR) 10coren: [C: 032] "Revert." [operations/puppet] - 10https://gerrit.wikimedia.org/r/114670 (owner: 10coren) [00:28:39] and : snapshot3: rsync error: error in file IO (code 11) at main.c(595) [Receiver=3.0.7] [00:28:49] ignore those [00:28:56] ok, excellent [00:36:17] (all done, btw) [00:36:38] (03CR) 10Chad: [C: 031] "Everything Reedy said. Plus I'll expand on the "falling back trivially" bit." [operations/puppet] - 10https://gerrit.wikimedia.org/r/108070 (owner: 10Chad) [09:51:08] i'm going to be very naughty and sync a trivial js fix [09:51:17] don't tell anyone [09:53:19] !log ori synchronized php-1.23wmf14/extensions/MultimediaViewer/resources/mmv/mmv.performance.js 'I41b6e975353: Backport fix for stats.bandwidth == Infinity' [09:53:26] se4598: just saw your ping from earlier; i'll look now [09:53:27] Logged the message, Master [10:04:35] (03PS1) 10Ori.livneh: Set $wmfExtendedVersionNumber = $wmfVersionNumber [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/114718 [10:04:49] (03CR) 10Ori.livneh: [C: 032] Set $wmfExtendedVersionNumber = $wmfVersionNumber [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/114718 (owner: 10Ori.livneh) [10:05:05] (03Merged) 10jenkins-bot: Set $wmfExtendedVersionNumber = $wmfVersionNumber [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/114718 (owner: 10Ori.livneh) [10:05:35] !log ori updated /a/common to {{Gerrit|I10170d77c}}: Set $wmfExtendedVersionNumber = $wmfVersionNumber [10:05:42] Logged the message, Master [10:06:40] se4598: fixed; sorry about that [10:06:53] thanks [10:07:02] (03PS1) 10Yuvipanda: Add GlobalCssJs extension to betalabs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/114719 [10:07:14] legoktm: Gloria ^ [10:57:36] (03PS1) 10Alexandros Kosiaris: Add asw-d-eqiad to rancid [operations/puppet] - 10https://gerrit.wikimedia.org/r/114724 [11:06:39] (03CR) 10Alexandros Kosiaris: [C: 032] Add asw-d-eqiad to rancid [operations/puppet] - 10https://gerrit.wikimedia.org/r/114724 (owner: 10Alexandros Kosiaris) [11:11:55] (03PS1) 10Alexandros Kosiaris: Add asw-d-eqiad to torrus [operations/puppet] - 10https://gerrit.wikimedia.org/r/114727 [11:13:29] (03CR) 10Alexandros Kosiaris: [C: 032] Add asw-d-eqiad to torrus [operations/puppet] - 10https://gerrit.wikimedia.org/r/114727 (owner: 10Alexandros Kosiaris) [12:03:51] (03PS1) 10Tim Landscheidt: Fix indentation in role::labs::instance [operations/puppet] - 10https://gerrit.wikimedia.org/r/114734 [12:24:12] (03PS1) 10Alexandros Kosiaris: Populate network.pp with eqiad row D networks [operations/puppet] - 10https://gerrit.wikimedia.org/r/114735 [12:26:24] (03CR) 10Alexandros Kosiaris: "I took the liberty of allocating networks for Row D @eqiad. I have not proceded with any changes to the equipment yet. Please verify or re" [operations/puppet] - 10https://gerrit.wikimedia.org/r/114735 (owner: 10Alexandros Kosiaris) [12:42:06] * mark checks that :) [12:43:27] akosiaris: you just used the eqiad labs prefix [12:55:21] (03CR) 10Mark Bergsma: [C: 04-2] "That's the eqiad labs floating IP prefix..." [operations/puppet] - 10https://gerrit.wikimedia.org/r/114735 (owner: 10Alexandros Kosiaris) [12:57:19] :-( [13:01:19] eqiad ip space is pretty full [13:01:29] seems like a /27 is the best we can do until we move some stuff around [13:02:03] do we document the IP space somewhere ? [13:02:16] in dns [13:02:21] ahaha [13:02:21] and in private outlines [13:02:34] (03PS1) 10Tim Landscheidt: Fix paths in comments after modularization [operations/puppet] - 10https://gerrit.wikimedia.org/r/114736 [13:02:48] hmm I consulted observium before picking those. Seems like it is not enough [13:02:50] ? [13:02:54] (03PS1) 10Mark Bergsma: Allocate /27 for public1-d-eqiad [operations/dns] - 10https://gerrit.wikimedia.org/r/114738 [13:02:58] dns is pretty authoritative [13:03:15] yeah but too verbose as well [13:03:50] i have always used an outliner for planning this hierarchically [13:03:53] but it doesn't distribute well [13:04:16] wiki tables are not great either [13:04:30] but installing specialized software is a bit overkill too [13:04:37] like an IPAM ? [13:04:42] yes [13:05:15] there isn't a ton of ip space to manage ;) [13:06:48] (03PS1) 10Tim Landscheidt: Fix manage-keys-nfs misnomer in usage message [operations/puppet] - 10https://gerrit.wikimedia.org/r/114739 [13:06:48] observium says 208.80.155.65/26 Subnet sandbox1-b-eqiad, DNS says 208.80.155.64/28 Sandbox1-b-eqiad subnet [13:06:55] something is amiss here [13:07:01] better check the routers [13:07:11] if we're really using a /26 for sandbox, we're going to change that right now :) [13:08:17] seems like we do [13:08:47] (03PS1) 10Tim Landscheidt: ldap: Fix typo in usage messages [operations/puppet] - 10https://gerrit.wikimedia.org/r/114740 [13:09:06] fcs [13:09:21] ok [13:09:25] can you just change that on the routers [13:09:32] then ask the freenode people to adjust their one server later? [13:09:36] it shouldn't harm much [13:10:14] ok, doing it now then [13:10:29] perhaps even /29 is [13:10:30] but yeah [13:12:54] ;208.80.155.64/26 Sandbox subnet [13:12:54] ;208.80.155.64/28 Sandbox1-b-eqiad subnet [13:13:10] seems like the idea was to have a big one and split it up ? [13:13:14] overkill though [13:16:44] maybe they idea was sandbox subnets across DCs [13:18:43] i think the idea was sandbox subnets for multiple rows [13:18:53] but we don't have the ip space for that, we wouldn't do that [13:29:03] !log just resized 208.80.155.64/26 to 208.80.155.64/28. This is Sandbox1-b-eqiad subnet. dickson.freenode.net needs to have it's netmask changed. I will talk with coren, mutante [13:29:12] Logged the message, Master [13:31:53] they need to do ip changes anyway [13:31:58] for using the extra service ip [13:33:15] the extra service ip will be on the same /28 or another one ? [13:33:38] i just mailed coren, mutante about it. I also mentioned the extra service ip as well [13:34:54] within the same /28 [13:35:01] they indicated that they wanted to keep the irc address the current one [13:35:05] mark: the v6s and the private on https://gerrit.wikimedia.org/r/#/c/114735/1/manifests/network.pp were ok ? Can I just amend it the public IPv4 with 208.80.155.96/27? [13:35:06] and move the "system ip" to a new ip [13:35:33] hmm [13:35:58] (03CR) 10Mark Bergsma: Populate network.pp with eqiad row D networks (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/114735 (owner: 10Alexandros Kosiaris) [13:35:59] I am not sure what this will save them from. It will be the same ethernet interface that will be saturated on the next attack [13:36:17] it's easier for us to filter that [13:36:31] we can easily filter almost all traffic to the irc ip [13:36:40] without blocking their own access and stuff [13:37:27] * Coren wakes. [13:37:51] mark: Exactly. You cal literally block absolutely everything but the irc ports on the irc IP [13:38:04] yeah true. Still we will be a choke point but it is an improvement [13:39:32] (03CR) 10Alexandros Kosiaris: "Subnet analytics1-a-eqiad" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/114735 (owner: 10Alexandros Kosiaris) [13:41:56] sure [13:42:32] akosiaris: sigh. alright. then it's fine :) [13:44:20] (03PS2) 10Alexandros Kosiaris: Populate network.pp with eqiad row D networks [operations/puppet] - 10https://gerrit.wikimedia.org/r/114735 [13:54:32] (03CR) 10Alexandros Kosiaris: [C: 032] Populate network.pp with eqiad row D networks [operations/puppet] - 10https://gerrit.wikimedia.org/r/114735 (owner: 10Alexandros Kosiaris) [13:55:09] hmmm a patch with a +2 and a -2 ... gerrit seems to want some massaging to allow me to merge this [14:01:18] just remove my -2? [14:06:09] yeah it worked :-) [14:27:46] PROBLEM - Varnish HTCP daemon on cp1067 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:27:46] PROBLEM - Varnish HTTP text-backend on cp1067 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:27:56] PROBLEM - Varnish traffic logger on cp1067 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:28:56] RECOVERY - Varnish traffic logger on cp1067 is OK: PROCS OK: 2 processes with command name varnishncsa [14:29:37] RECOVERY - Varnish HTCP daemon on cp1067 is OK: PROCS OK: 1 process with UID = 111 (vhtcpd), args vhtcpd [14:29:37] RECOVERY - Varnish HTTP text-backend on cp1067 is OK: HTTP OK: HTTP/1.1 200 OK - 188 bytes in 0.001 second response time [14:44:12] mark -- the other day we talked about allowing labs<->labs communication… is that still something that might be possible? [14:58:16] andrewbogott: yes, on my task list [14:58:31] good enough for me :) [15:00:15] (03CR) 10Ottomata: "It will! This will be a git-submodule at that place in the ops/puppet repo." [operations/puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/110650 (owner: 10Ottomata) [15:10:06] PROBLEM - Host virt1002 is DOWN: PING CRITICAL - Packet loss = 100% [15:14:37] RECOVERY - Host virt1002 is UP: PING OK - Packet loss = 0%, RTA = 0.65 ms [15:30:16] !log dist-upgrade and reboot boron [15:30:23] Logged the message, Master [15:37:52] garg. that didn't go well... [16:21:25] http://lists.wikimedia.org/pipermail/wikimedia-l/ gives 403... [17:07:13] uuups [17:18:03] (03PS1) 10Ryan Lane: Use the Token keystone redis driver rather than the TokenNoList driver [operations/puppet] - 10https://gerrit.wikimedia.org/r/114756 [17:19:18] (03PS1) 10Ryan Lane: Revert "Revert "Reenable redis for keystone in eqiad"" [operations/puppet] - 10https://gerrit.wikimedia.org/r/114757 [17:20:45] (03CR) 10Ryan Lane: [C: 032] Use the Token keystone redis driver rather than the TokenNoList driver [operations/puppet] - 10https://gerrit.wikimedia.org/r/114756 (owner: 10Ryan Lane) [17:22:07] (03CR) 10Ryan Lane: [C: 032] Revert "Revert "Reenable redis for keystone in eqiad"" [operations/puppet] - 10https://gerrit.wikimedia.org/r/114757 (owner: 10Ryan Lane) [17:26:16] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Fri Feb 21 17:26:13 UTC 2014 [17:28:46] RECOVERY - Varnish HTTP text-backend on cp1054 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.011 second response time [17:31:38] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [17:31:46] PROBLEM - Varnish HTTP text-backend on cp1054 is CRITICAL: Connection refused [17:31:56] bblack: ^ it died again [17:32:29] magical anti-varnish fairies at work? [17:32:45] Feb 21 17:28:54 cp1054 varnishd[9281]: Child (15891) Panic message: Assert error in BAN_RefBan(), cache_ban.c line 481:#012 Condition(t1 == t0) not true.#012thread = (persistence)#012ident = Linux,3.2.0-48-generic,x86_64,-spersistent,-spersistent,-spersistent,-spersistent,-smalloc,-hcritbit,epoll#012Backtrace:#012 0x4337b5: /usr/sbin/varnishd() [0x4337b5]#012 0x416446: /usr/sbin/varnishd(BAN_RefBan+0x96) [0x416446]#012 0x4545bf: /usr/ [17:32:54] actually, looks like probably a corrupt persistent store [17:34:32] ottomata: an1021 has a Kafka Broker Messages In CRITICAL alert [17:34:59] (in case it wasn't obvious by me pinging people, I just opened Icinga's unhandled problems page :) [17:36:26] oo hmm thanks [17:37:04] hm! [17:37:09] an22 is the leader for all topics [17:37:10] WHY!? [17:39:21] manybubbles: hello [17:40:19] ottomata: btw, I did finish auditing the rdkafka code, but I didn't turn up anything substantial. One one-liner bugfix to an assert(), basically. There was lots of sloppy linenoise that made static analysis confusing, but once you get through that, most of the warnings didn't amount to real, actionable runtime bugs. [17:41:17] (that's not to say it's bug-free, but it certainly survives substantial analysis at the source level :) ) [17:41:27] bblack: if the cache is corrupted, then let's just rm it [17:41:27] aye ok, thanks bblack [17:41:29] much appreciated [17:41:56] paravoid: I am, but XFS and/or varnish is being retarded - the varnish proc is stuck and XFS is very very very slowly deallocating space for the old files, etc [17:42:02] I may just have to reboot + cleanup [17:42:15] oh, right, we've seen this before [17:42:33] last time I think we just mkfs'ed it again :) [17:42:59] well right now I can't even unmount, because the varnish proc is holding open the FS and can't be killed :P [17:43:06] nice [17:43:28] haha [17:43:33] nice [17:43:40] btw, maybe upgrade to 3.0.5 while we're at it? [17:43:40] on the other hand, it's about 25% done deallocating, letting it run might be less trouble than the reboot [17:44:06] paravoid: yeah, I started that two days ago late at night and hit some snags (5xx spikes), and aborted and downgraded the ones I had already hit [17:44:18] ooops [17:44:20] it's quite possible the cp1054 thing was related and the upgrades would've otherwise been fine [17:44:45] perhaps the corruption happened during the stop->upgrade->start [17:44:56] anyways, I'll retry again after sorting these out [17:46:10] re: cp4009, ever seen this before? [17:46:13] ERROR: Timeout while waiting for server to perform requested power action. [17:46:17] lol [17:46:21] ^ does that on racadam powercycle and racadm hardreset [17:46:22] I just saw that with another box [17:46:28] like literally a few minutes ago [17:47:04] * bblack tries racadm serveraction pull-plug-and-kick-machine [17:47:12] packages used in deployment discussion going on in -office :) [17:51:37] RECOVERY - NTP on bast4001 is OK: NTP OK: Offset 0.00358748436 secs [17:59:20] (03CR) 10PleaseStand: Fix CDB file generation (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/114686 (owner: 10Ori.livneh) [18:00:40] (03CR) 10PleaseStand: Swiched from using dat to json files for wikiversions (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/114687 (owner: 10Reedy) [18:11:46] RECOVERY - Varnish HTTP text-backend on cp1054 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.007 second response time [18:12:30] paravoid: re: racadm power issues - the crux in my case seems to be "Description: The system board fail-safe voltage is outside of range. [18:12:36] " in the SEL [18:12:44] oops [18:12:53] RT ticket under the ulsfo queue [18:13:10] some dell manuals recommend either pulling the hard AC power from the machine for 10s, or clearing the SEL (as if the presence of the SEL entry might persistently prevent it from powering up) [18:13:16] lol [18:13:22] I tried racadm racreset [hard], no avail [18:13:31] yeah I did the same with ms-be1005 [18:13:39] with the same result [18:13:43] tried clrsel -> racreset as well, and it regenerates the voltage message after racreset hard - so the condition persists [18:14:01] what's the command to read the SEL? [18:14:06] racadm getsel [18:14:12] heh [18:14:59] so either (a) after a glitchy power fail event, RAC can suck and need a true "pull the power plug" to reset something, or (b) the power is still borked somehow on these machines (PS damage?) [18:15:57] either way, I'm guessing site visit required [18:16:08] huh, no I can't access ms-be1005's mgmt at all [18:16:37] oh now I got in [18:16:37] Severity: Critical [18:16:37] Description: CPU 1 has a thermal trip (over-temperature) event. [18:16:41] fun [18:17:35] it sounds like, at least in some cases, critical events in the SEL prevent powerup, and "racadm clrsel" (+ maybe racadm racreset?) might allow it to try again.... [18:17:41] but then, you lose the SEL for history :P [18:19:12] !log cp1054 healthy now, rebuilding persistent cache from scratch there... [18:19:20] Logged the message, Master [18:19:35] bblack: RT ticket under ulsfo [18:19:46] working on it! :) [18:19:55] and the magical DC fairy will fix it [18:19:55] :P [18:19:57] isn't this cp1054, so eqiad? [18:20:08] different box [18:20:12] ok [18:20:16] two issues at once :) [18:20:29] the other one is cp4009 [18:20:33] down since the power outage [18:23:16] https://rt.wikimedia.org/Ticket/Display.html?id=6890 [18:24:49] paravoid: you pinged? sorry, I was out and didn't |away myself [18:24:55] !log initiating kafka preferred replica election to rebalance partition leaders [18:24:57] I did [18:25:02] Logged the message, Master [18:25:04] got a sec? [18:26:46] RECOVERY - Kafka Broker Messages In on analytics1021 is OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate OKAY: 1868.80318074 [18:27:58] manybubbles: my question was, IIRC way before we even installed ElasticSearch, there was the stated goal of using it for TTM too; is this still the plan? if so, is there any progress or roadmap for it? [18:28:26] paravoid: TTM is translation memory? [18:28:29] yes [18:28:47] yeah, that is the goal but it is something I imagined I'd help with rather then implement myself [18:29:00] no progress, so far as I know [18:29:19] manybubbles: I was just about to reply to your email, which I didn't understand much [18:29:33] https://www.mediawiki.org/wiki/Mentorship_programs/Possible_projects#One_stop_translation_search [18:30:09] This project is also about superseding solr, but Nikerabbit would take that part out unless there is you or Chad co-mentoring [18:30:24] (03PS1) 10Brion VIBBER: Revert low-res .ogv transcode enable; player is picking too-small version by default [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/114761 [18:31:11] (03PS2) 10Brion VIBBER: Revert low-res .ogv transcode enable; player is picking too-small version by default [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/114761 [18:31:30] at least we know they recode fast :D [18:32:36] Nemo_bis: my email? I'm not sure which one. [18:32:43] brion: this is temporary though, right? [18:32:50] in the sense that you'll reenable them? [18:32:50] also, I added myself as a co-mentor for the elasticsearch stuff [18:32:55] paravoid: that's the plan yeah [18:32:58] it shouldn't remove the existing ones [18:33:04] what I'm asking is whether we need to clean up the store or not [18:33:16] no leave them there and they'll get picked up again later [18:33:19] once we resolve the player issue [18:33:21] okay [18:34:03] manybubbles: great! I meant your email on Date: Tue, 11 Feb 2014 08:24:22 -0500 [18:34:09] hmm [18:34:31] manybubbles: so, I'm thinking of mailing Nikerabbit & you to move this forward (hopefully with subsequent comm in a public medium), unless you have a better idea? [18:34:39] basically the reason I'm pinging you now about this is [18:34:47] Solr [18:34:48] CRITICAL [18:34:53] Average request time is 1166.436 (gt 1000) [18:35:10] if avg req time is 1.1s, might just as well replace the damn thing :) [18:35:13] Nemo_bis: oh yeah, that one. I can describe in a bit [18:35:21] paravoid: huh [18:35:37] well, I mean, we should replace it but we don't have code for that [18:35:40] 40 days and counting, noone really cares about this service apparently [18:35:53] !log Forced update of /svr/scap to 6203585 across cluster [18:36:01] Logged the message, Master [18:36:09] bd808: backdoor and everything? [18:36:10] :P [18:36:33] (kidding) [18:36:37] paravoid: I'm saving that for the weekend [18:36:50] greg-g: ima gonna no-op scap now unless you changed your mind [18:36:55] paravoid: what do you mean nobody cares? nobody with the power to set an alert has set up one that goes to someone who cares [18:37:03] is my guess [18:37:16] paravoid: so is this average request time for ttmserver solr? [18:37:23] yes [18:37:28] just the ttm server? [18:37:35] what do you mean? [18:37:38] it's a solr instance [18:37:58] isn't there solr also for geocoordinates or something [18:38:02] different solr [18:38:25] which doesn't have the alert [18:38:28] paravoid: my personal monitoring system tells me the search has somehow worked in the last few months https://toolserver.org/~nemobis/tmp/SearchTranslations.log [18:38:55] last timeout in 2013-05-08 17.10 [18:38:59] ttmserver isn't that sensitive to slow queries, as long as it is not affecting anything else running on the server [18:39:38] so the problem is that ttmserver is building extremely inefficient queries [18:39:46] paravoid: off topic, but I'm reenabling puppet on elastic1007. It was disabled when it crashed and I ran it manually when it came back but I never reenabled it. [18:39:50] but yes it is a known issue that it has issues [huh] and if it is causing troubles [is it?] it should be given more priority to address it [18:39:54] manybubbles: ack [18:40:19] if it's not causing troubles, the alert is wrong [18:40:47] paravoid: I was going to say, if we're sure that the thing is working "ok" then can we bump up the alert time? [18:40:52] like 10 seconds or something silly? [18:40:56] but moreover, there's an underlying issue that the service isn't well-operated or properly supported/designed (just runs on one server...) [18:40:57] !log bd808 Started scap: no-diff scap to test script changes; expect l10n updates [18:41:04] Logged the message, Master [18:41:13] if there's a critical alert for 40 days and noone blinks an eye [18:41:41] so it could be that it's entirely our fault, but I doubt we'll ever fix this by caring more [18:41:48] it is unclear who is supposed to be monitoring that [18:41:55] I think we need to fix it to pass it on to our beloved search team ;) [18:42:04] *by passing it on [18:42:08] one of the reasons I wish to move to elasticsearch... less unlikely to go abandoned [18:42:13] yeah exactly [18:42:15] *likely [18:42:18] (03PS1) 10MaxSem: Okay, so TTM wants to do slow queries?:) [operations/puppet] - 10https://gerrit.wikimedia.org/r/114765 [18:42:31] I'm happy to help with the move. [18:42:38] I'm not sure beyond that what I should do though [18:43:27] it looks like to me that we're a bit deadlocked [18:43:31] one waiting for the other [18:43:45] Nikerabbit: what do you think should happen to move this forward? [18:44:06] paravoid: well, I haven't really been waiting, it has just been on the table given other priorities [18:44:09] (03CR) 10Nemo bis: "Do we know at what point it gets so slow as to cause timeouts for Special:SearchTranslations? Or are we just throwipng a random number bec" [operations/puppet] - 10https://gerrit.wikimedia.org/r/114765 (owner: 10MaxSem) [18:45:09] (03PS2) 10Faidon Liambotis: Bump TTM's avg req time alert to 5 seconds(!) [operations/puppet] - 10https://gerrit.wikimedia.org/r/114765 (owner: 10MaxSem) [18:45:11] I should likely ask some time to work on this issue and pull some people including manybubbles to replace solarium with elastice and think of ways to make it faster in general [18:45:37] (03CR) 10Faidon Liambotis: [C: 032] Bump TTM's avg req time alert to 5 seconds(!) [operations/puppet] - 10https://gerrit.wikimedia.org/r/114765 (owner: 10MaxSem) [18:45:46] ok [18:45:51] how can I help you with that? [18:47:05] e.g. I can drop an email saying that we (ops) really suck in supporting this setup and we'd rather see it replaced so it can be better supported with the help of the search team [18:47:14] (03PS1) 10BBlack: regularly compact memory on varnish caches [operations/puppet] - 10https://gerrit.wikimedia.org/r/114766 [18:47:32] would that help you with your internal prioritization? [18:47:41] Nikerabbit: I've blocked some time out on Monday for me to review the translation extension and put together a proposed plan of attack [18:48:02] I figure we can go from there [18:48:03] Ryan_Lane2: ping [18:48:16] paravoid: yes, I can quote you for that but of course it is good if it comes directly from you [18:48:28] Ryan_Lane2: your redis changeset is not merged, can I merge? [18:48:29] (03PS2) 10BBlack: regularly compact memory on varnish caches [operations/puppet] - 10https://gerrit.wikimedia.org/r/114766 [18:48:59] manybubbles: Monday is not my WMF-day, but I hope to be available if you have quick questions [18:49:25] Nikerabbit: I'll send something off on Monday. which days are wmf? [18:49:43] !log bd808 scap-1 failed on 4 hosts [18:49:48] Nikerabbit: okay, where should I mail that? [18:49:48] manybubbles: currently Tuesday and Thursday and randomly along the week on other days ;) [18:49:51] Logged the message, Master [18:50:28] paravoid: localisation-team@lists.wikimedia.org I think (it's moderated but reaches the whole team) [18:50:33] !log The 4 hosts that failed scap-1 were snapshot[1234]; all have old/bad python installs [18:50:35] awesome [18:50:38] manybubbles: I'll Cc you [18:50:40] Logged the message, Master [18:50:49] thanks to both [18:50:54] of you ;) [18:53:59] manybubbles: it's my evening on your regular worktimes (like now) so I'm likely to be around [18:54:28] !log bd808 scap-rebuild-cdbs failed on 4 hosts [18:54:36] !log bd808 Finished scap: no-diff scap to test script changes; expect l10n updates (duration: 13m 38s) [18:54:36] Logged the message, Master [18:54:44] Logged the message, Master [18:55:33] !log The 4 hosts that failed scap-rebuild-cdbs were snapshot[1234]; can we pull them from mediawiki-installation dsh group? [18:55:41] Logged the message, Master [18:56:33] !log catrope synchronized php-1.23wmf14/extensions/VisualEditor/modules/ve-mw/init/ve.init.mw.Target.js 'touch' [18:56:37] PROBLEM - Apache HTTP on mw1047 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - 50376 bytes in 0.023 second response time [18:56:40] Logged the message, Master [18:56:42] bd808: My understanding is they can't be. Talk to apergos for details [18:56:46] PROBLEM - Apache HTTP on mw1079 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - 50376 bytes in 0.006 second response time [18:56:51] !log catrope synchronized php-1.23wmf14/extensions/VisualEditor/modules/ve-mw/init/targets/ve.init.mw.ViewPageTarget.js 'touch' [18:57:03] Logged the message, Master [18:57:07] bd808: Although if snapshot 10NN did work, then maybe we can; but ask Ariel [18:57:09] not yet, I'll get em gone soon though [18:57:10] !log catrope synchronized php-1.23wmf15/extensions/VisualEditor/modules/ve-mw/init/ve.init.mw.Target.js 'touch' [18:57:17] Logged the message, Master [18:57:26] th sn100xs work, there's a few misc jobs I still need to get off [18:57:32] !log catrope synchronized php-1.23wmf15/extensions/VisualEditor/modules/ve-mw/init/targets/ve.init.mw.ViewPageTarget.js 'touch' [18:57:40] Logged the message, Master [18:57:50] apergos: snapshot1: rsync: change_dir#3 "/usr/local/apache/common-local/php-1.23wmf15/extensions/VisualEditor/modules/ve-mw/init" failed: No such file or directory (2) [18:57:51] apergos: The new scap scripts are failing there so snapshot[1234] aren't getting WM updates [18:58:01] Well, maybe that's because they don't have MW directories [18:58:26] thy o but maybe not there [18:58:30] RoanKattouw: The rsyncs have been failing for them since we changes scap to be python guts [18:58:34] Right [18:58:45] apergos: Is it OK if we stop deploying MW updates to the pmtpa snapshot hosts? [18:58:54] well this will just be he motivation I need to get em gone [18:58:57] Or should we fix them to receive those updates? [18:59:00] yeah go ahead and nuke em [18:59:04] Alright [18:59:14] \o/ [18:59:27] bd808: Eradicate snapshot{1..4} from mediawiki-installation then. You may need to touch puppet for that, I forget how this is managed these days [18:59:44] But leave the eqiad snapshot hosts (snapshot1001 and beyond) alone [19:00:04] files/dsh/group/.. in operations/puppet [19:00:09] RoanKattouw: Will do. I'm pretty sure it's a puppet change, but I will submit [19:00:12] Sweet [19:00:18] Thanks ori [19:00:23] Also, cool nick :) [19:00:24] (03CR) 10Faidon Liambotis: [C: 032] regularly compact memory on varnish caches [operations/puppet] - 10https://gerrit.wikimedia.org/r/114766 (owner: 10BBlack) [19:00:52] In other news: scap has a progress bar now. Will send an email to ops-l today telling people what to expect. [19:01:03] bd808: my hero [19:01:42] haha [19:01:50] greg-g: ori wrote the hard parts, i just put some lipstick on it [19:02:11] mark: what you missed from the core team channel yesterday: [19:02:11] 18:36 < greg-g> I loooove progress bars, it's what got me into computers [19:02:15] 18:36 greg-g wishes he was kidding [19:02:23] haha [19:02:28] http://progressquest.com/ [19:02:47] It looks like: scap-1: 0% (ok: 1; fail: 0; left: 426) [19:02:48] i do recall programming my first progress bars early in, in my basic programs [19:02:57] and getting off by one mistakes and all [19:03:03] … scap-1: 91% (ok: 386; fail: 4; left: 37) [19:03:26] ori: :) :) [19:03:36] bd808: yeah, that's awesome [19:03:51] does it also have a message "last eqiad/ulsfo host done!" [19:04:11] RECOVERY - Solr on zinc is OK: All OK [19:04:14] or non-pmtpa for short [19:04:38] Nope. Little known fact we update hosts in random order during the scap [19:05:09] In theory to avoid thundering herd on the rsync slaves [19:05:11] Ah. At some point that had been changed to leave pmtpa last, I guess it didn't work out [19:05:36] The file is sorted that way but then shuffled before running the batch [19:05:44] (03PS3) 10BBlack: regularly compact memory on varnish caches [operations/puppet] - 10https://gerrit.wikimedia.org/r/114766 [19:07:08] (03CR) 10BBlack: [C: 032 V: 032] regularly compact memory on varnish caches [operations/puppet] - 10https://gerrit.wikimedia.org/r/114766 (owner: 10BBlack) [19:10:41] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [19:10:45] (03PS1) 10BryanDavis: Remove snapshot[1234] from mediawiki-installation dsh group [operations/puppet] - 10https://gerrit.wikimedia.org/r/114768 [19:11:38] RoanKattouw, apergos: ^^ [19:12:30] yup [19:12:33] do it [19:12:47] * bd808 is not a root [19:13:40] (03CR) 10Ori.livneh: [C: 032] Remove snapshot[1234] from mediawiki-installation dsh group [operations/puppet] - 10https://gerrit.wikimedia.org/r/114768 (owner: 10BryanDavis) [19:14:13] ah. woops [19:15:03] sorry, most of my mind was elsewhere [19:15:56] apergos: No worries. Looks like Ori is on it [19:16:00] try this trick and spin it, yeah [19:16:07] yeah, thanks [19:16:19] * bd808 thought ori was on vacation today [19:16:53] heh, i am sick [19:16:59] but yes [19:17:05] ori? on vacation? [19:17:06] impossible. [19:17:20] Apparently [19:17:20] cold turkey didn't work, going to taper [19:19:30] mark, paravoid: got time for https://gerrit.wikimedia.org/r/#/c/113900/ ? [19:22:02] it looks okay to me on a first glance (apart from a probably useless "inline" keyword), but maybe bblack could do a proper review and deploy next week? [19:23:34] speaking of deployment systems, there is definite room in the varnish ecosystem for a VCL deployment tool. the ability to load VCL by string using varnishadm is neat. the tool i envision depools a varnish, pushes vcl with the commit short sha1 as the config name, runs some asserts, repools [19:26:46] rollback as easy as varnishadm vcl.load I10170d77c [19:27:05] something like that, anways [19:27:45] we still haven't fixed the issue of automatically depooling/pooling servers; this is a manual process [19:27:50] and for varnish, into multiple places [19:28:01] I have a few ideas, I was hoping to work with one of the three hires we're expecting on that [19:28:21] would be cool [19:28:52] basically: zookeeper or etcd (hopefully etcd), maybe integrate it with pybal and other stuff [19:29:14] i should learn zookeeper one day [19:29:38] etcd is prettier, but I'm not sure if it's stable enough yet [19:30:02] heh. "build: failing" [19:30:11] yeah, for example :) [19:30:37] so pybal right now fetches files over HTTP from noc.wm.org to pool/depool [19:30:40] that we edit using vi [19:30:45] and it just interprets them as python [19:30:57] it would be a simple change to switch to etcd or something like it [19:43:11] PROBLEM - Puppet freshness on virt1000 is CRITICAL: Last successful Puppet run was Fri 21 Feb 2014 04:42:42 PM UTC [19:55:03] (03CR) 10Dzahn: [C: 031] "per matanya's links above. it says it can "probably be removed"" [operations/puppet] - 10https://gerrit.wikimedia.org/r/114136 (owner: 10Matanya) [19:56:08] (03CR) 10Dzahn: [C: 032] "nice URL" [operations/puppet] - 10https://gerrit.wikimedia.org/r/114498 (owner: 10Odder) [19:57:24] (03PS2) 10Dzahn: swift: remove lookupvar and replace with fact @ var [operations/puppet] - 10https://gerrit.wikimedia.org/r/112885 (owner: 10Matanya) [20:01:22] (03CR) 10Legoktm: [C: 04-1] "If this is supposed to mimic the production config, it should be only enabled on CA wikis that aren't loginwiki." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/114719 (owner: 10Yuvipanda) [20:05:30] logmsgbot: aren't all wikis CA wikis? [20:10:53] yuvipanda: no... all private + fishbowl wikis aren't [20:11:00] legoktm: aah [20:11:09] legoktm: hmm, so that's a bit more complicated, I presume. [20:11:21] legoktm: Need to find out if those are on betalabs, and then how to exclude them [20:12:26] yuvipanda: you can do a 'wmgUseGlobalCssJs' => array( 'default' => true, 'private' => false, 'fishbowl' => false, 'loginwiki' => false, ), [20:12:30] that should work [20:12:39] legoktm: update patch? :) [20:14:32] sure [20:24:15] Mr. Liambotis, you around? [20:24:27] I am [20:24:44] can you spare 7 minutes on hangout? or at least on irc? [20:25:02] I can, what is this about? [20:25:19] NOT(hosting) [20:25:28] hm? [20:26:21] it's about the approach for showing contributory features on mdot/zerodot. wanted to run an idea ori and i talked about yesterday by you to see if it's passable. [20:26:38] okay, sure [20:26:45] k, i'll call you [20:26:54] do we need ori too? [20:27:23] probably not, although i try add him. ori, heads up. [20:28:30] stupid hangouts [20:28:34] greg-g: Can I get flight deck clearance for a no-op scap so I can record a screencast as Ori suggested? [20:30:43] bd808: ENTHUSIASTIC YES [20:30:52] sorry for yelling [20:30:58] neat [20:31:46] !log bd808 Started scap: no-diff scap; recording asciicast [20:31:55] Logged the message, Master [20:32:41] yuvipanda: does betalabs have a bits.wm.o equivalent? or does it just use the per-wiki load.php's? [20:32:56] /bits.beta.wmflabs.org/meta.wikimedia.beta.wmflabs.org/load.php [20:32:58] great. [20:33:05] legoktm: it does have bits [20:33:07] right [20:33:55] (03PS2) 10Legoktm: Add GlobalCssJs extension to betalabs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/114719 (owner: 10Yuvipanda) [20:34:52] yuvipanda: fixed ^ [20:34:58] legoktm: woo! :) [20:34:59] !log bd808 Finished scap: no-diff scap; recording asciicast (duration: 03m 13s) [20:35:03] legoktm: now to get hashar to merge it :) [20:35:07] Logged the message, Master [20:35:17] legoktm: I'm going to get some sleep now. have to wake up early tomorrow :( can you get that merged? [20:35:21] yuvipanda: legoktm I am not op :-D [20:35:31] hashar: it's to betalabs :D [20:35:51] yuvipanda: gnite. who normally "approves" extensions for beta labs? [20:35:55] legoktm: hashar [20:36:12] yuvipanda: well anyone should be able to review / +2 the change. Just have to make sure it is not going to kill production [20:36:35] yuvipanda: legoktm: also want to make sure GlobalCssJs is planned for production [20:36:45] hashar: yeah, it changes only labs-specific files, so should be ok [20:36:52] hashar: it is https://bugzilla.wikimedia.org/show_bug.cgi?id=57891 [20:38:16] legoktm: great :-] [20:38:23] legoktm: might want to poke Dan Garry about it [20:38:42] hashar: Dan responded on that bug saying it's ok :) [20:38:45] just today [20:38:55] anyway busy with other things [20:39:05] ok [20:39:06] so get dan to +1 change then any deployer can review and +2 [20:42:50] (03CR) 10Greg Grossmeier: [C: 031] "I'm not Dan, but I can play one if I need to. :)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/114719 (owner: 10Yuvipanda) [20:43:58] lol [20:44:02] greg-g: :D [20:44:48] :D [20:56:48] (03CR) 10Deskana: [C: 031] "Let the tests begin." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/114719 (owner: 10Yuvipanda) [21:02:06] (03CR) 10MarkTraceur: [C: 032] "Good luck, all :)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/114719 (owner: 10Yuvipanda) [21:02:14] (03Merged) 10jenkins-bot: Add GlobalCssJs extension to betalabs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/114719 (owner: 10Yuvipanda) [21:02:14] legoktm, yuvipanda ^^ [21:02:22] :D [21:02:37] * rdwrer watches betalabs [21:02:47] rdwrer: do you also need to sync it so there are no missing commits in the prod repo? [21:02:57] I don't think so. [21:03:02] okay [21:03:17] So, naive product manager question. When does this actually become live on Beta Labs for us to test? [21:03:19] rdwrer: \o/ [21:03:28] Deskana: in about 20-30s [21:03:28] Deskana: in a few minutes [21:03:29] Soon! [21:03:38] Deskana: maybe minutes [21:03:39] * rdwrer puts on Barcelona accent [21:03:41] Eventually! [21:04:00] Deskana: just keep refreshing http://meta.wikimedia.beta.wmflabs.org/wiki/Special:Version :P [21:04:40] i.love.long.urls.beta.wmflabs.org [21:05:44] Deskana: https://bits.beta.wmflabs.org/meta.wikimedia.beta.wmflabs.org/load.php is a real url ;) [21:05:57] Deskana: its live [21:06:06] [21:06:08] Oops [21:06:13] hmmm [21:06:33] I blame caching [21:06:34] Profiling error: in(Wikibase\PropertyParserFunction::getRenderer), out(Wikibase\PropertyParserFunction::doRender) [21:06:57] Reedy: not sure exactly where the bug is, probably Wikibae [21:07:01] *Wikibase [21:07:18] * hoo heart Wikibase [21:07:20] * hoo runs [21:08:47] yurik: you around? [21:09:10] it works! [21:09:31] wooo! [21:12:38] legoktm: Where in Special:Preferences is the preference? [21:12:45] there is no preference? [21:12:53] it got axed [21:13:04] I'll update the extension instructions then. :) [21:13:09] ah [21:13:12] I need to re-write that page [21:15:09] (03PS2) 10Jeremyb: close ukwikimedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113878 [21:17:04] legoktm: Looks pretty nifty so far. :) [21:17:13] :D [21:19:02] legoktm: SSL cert for bits.beta.wmflabs.org/meta.wikimedia.beta.wmflabs.org ?:) [21:19:47] mh, I'm the only one who wonders why there is no en-message for tog-enableglobalcssj? https://git.wikimedia.org/blob/mediawiki%2Fextensions%2FGlobalCssJs/0a52548ce76cbdf12e94fe5de8a6ac65d5e90159/GlobalCssJs.i18n.php [21:20:03] se4598: that message was removed, let me find the patch [21:20:05] i know it's not the entire thing:), but still was just wondering how that discussion turned out [21:20:08] The certificate is only valid for *.wmflabs.org [21:20:19] mutante: oh, I think I whitelisted it in my browser :/ [21:20:31] se4598: https://gerrit.wikimedia.org/r/#/c/101626/ [21:20:35] i suppose *.beta.wmflabs.org is needed [21:20:44] because can't have *.*. [21:20:45] thanks [21:21:19] certs etc.: https://bugzilla.wikimedia.org/show_bug.cgi?id=48501 [21:21:28] mutante^ [21:23:00] se4598: thanks, yea, i knew there was a lengthy ticket, i wondered about the end of it.. and yea. i see so we want but don't have.. [21:24:14] Coren: are you pokable for that one maybe? there is the special project handling the certs i know, i just haven't been involved so far [21:24:18] ^ [21:24:34] Can someone do a graceful restart of apache on mw1047 and mw1079? Those 2 hosts are throwing a PHP error that looks to be an APC problem. Line numbers in exception don't match files on disk. [21:24:42] * Coren reads scrollback. [21:24:51] Coren: it's about getting *.beta.wmflabs.org SSL cert [21:25:00] as opposed to just *.wmflabs.org [21:25:07] "I think we do want it, on a limited set of subdomains to keep the cost down." <- do it half right, so it's still broken for everyone else on a non-en-wiki? No money for a wildcard? [21:25:10] and how we handle it in the special restricted project [21:25:22] where the keys live etc... [21:25:46] Doable, but I haven't been in the buy-cert-chain before. I'll need to involve Rob anyways. [21:25:52] !log mw1047 and mw1079 throwing PHP exception that looks like APC corruption [21:26:00] Logged the message, Master [21:26:12] Is there an RT ticket or bz associated with this for context? [21:26:39] Coren: bug says [21:26:39] [reply] [−] Description Antoine "hashar" Musso 2013-05-15 10:00:13 CEST [21:26:39] We now have nginx SSL proxies in front of the beta caches (Bug 36648). We still have to fix the certificate (that is *.wmflabs.org for now). [21:26:39] We need certificates generated by 'Labs CA' for the entries listed in role::protoproxy::ssl::beta and some more. I guess the easiest would be to create *.beta.wmflabs.org cert that will also contains the following DNS entries: [21:26:39] *.wikimedia.beta.wmflabs.org [21:26:49] Coren: se4598: i would even argue we can use cacert.org for that because it's - no money but also not self signed, and people who test can be expected to import a CA root cert once, but paravoid said cacert is kind of dead.. well ..i still use it [21:27:32] sorry for the flood, something strange happened [21:27:32] if they are generated by Labs CA, then that's a different thing [21:27:44] and is not related to buying them [21:28:14] se4598: i'm not sure if you can have a *. that contains another *, but maybe [21:28:21] you can't have a *.* though [21:28:23] there was a bug for beta ssl certs, can't find it right now thuogh [21:28:26] Ah, okay, point me at it then. I'll handle it over the weekend unless it's an even bigger rush than this? [21:28:32] apergos: You should be asleep, but topic has your name. Apache graceful needed for mw1047 and mw1079 to clear APC corruption [21:28:51] bd808: doing it [21:28:59] mutante: Thanks [21:29:15] mutante: apache-graceful is only on fenari? Does that still work? [21:29:41] RECOVERY - Apache HTTP on mw1047 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.072 second response time [21:29:42] !log graceful'ing apache on mw1047 and mw1079 by request [21:29:50] Logged the message, Master [21:29:51] RECOVERY - Apache HTTP on mw1079 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.088 second response time [21:30:05] AaronSchulz: in this case i just did them manually on the boxes.. but last time i looked. yes [21:30:32] Coren: bugs say theres something at ticket https://rt.wikimedia.org/Ticket/Display.html?id=6116 [21:30:33] AaronSchulz: yes to both, still on fenari and needs to move, and yes, still worked.. but it's been a while [21:32:54] se4598: What BZ is this? I just read the RT [21:34:29] Coren: https://bugzilla.wikimedia.org/show_bug.cgi?id=48501 [21:36:31] topic has my name and it's not even my week [21:36:55] but if I had been here when pinged I woulda done it (thanks mutante for being willing and in a good tz) [21:38:04] se4598: mutante: The next step forward is not clear; self signed certs or non-default CAs are not an option because of the automation and the inability to distribute certs. [21:38:34] The BZ mentions the possibility of using only a few (selected) hostnames rather than wildcards. [21:40:15] !log mw1047 and mw1079 errors cleared after apache-graceful [21:40:23] Logged the message, Master [21:40:36] * bd808 continues to <3 logstash [21:49:13] Coren: gotcha, yea, you pretty much summed it up i think. so the one that made me bring it up was https://bits.beta.wmflabs.org/ having bits. seems reasonable [22:00:57] it seems that haproxy is installed on db10{23,33,56}, although not running. anyone know why? [22:08:17] jgage: probably sean experimenting -- he mentioned haproxy the other day on the list [22:08:20] Coren: ping? [22:08:37] jgage: (not in the context of installing it, just that it'd be a good idea to use it maybe) [22:12:17] cool, thanks paravoid. [22:16:13] How fast do the logs used by the logstash web interface (https://logstash.wikimedia.org/#/dashboard/elasticsearch/default) rotate? [22:16:23] Does anyone know if it's size or time-based? [22:16:30] 30 days [22:17:22] We write to a new elasticsearch index every day. There's a cron that runs each morning to drop indices that are 31 days old [22:17:34] Okay, thanks bd808. [22:17:39] yw [22:17:49] It turns out the count I thought was since rotation was every 30 seconds, which is... bad. [22:17:52] * bd808 needs to update the wikitech page on this stuff [22:18:38] hmm, logstash->elasticsearch. sounds like graylog2? [22:18:53] or actually, is there a wiki page for the new logstash stuff? [22:19:09] * gwicke listens up on hearing graylog2 [22:19:19] ebernhar1son: yes. https://wikitech.wikimedia.org/wiki/Logstash [22:19:20] I don't really get it, but I should probably go read the docs. [22:19:38] The "count per 30s" and "count per 1d" are very close. [22:19:56] bd808: excelent, thanks [22:19:57] I'd be glad to do hangout or something to talk through the UI [22:20:57] bd808, is our logstash instance prepared to receive graylog2 logs? [22:21:16] It is not, but that could be easily added [22:21:21] bd808, thanks. phuedx recommended I watch the video at http://logstash.net/ , so I'll finish what I'm working on, check that out, then ask if it still doesn't make sense. [22:22:08] superm401: If the count you are talking about is the (nnnnn hits) that is across the total time window you are searching [22:22:31] the "count per 30s" or whatever is the width of each bar in the histogram [22:22:42] I see, so the interval only affects the graph. [22:22:42] bd808: ok, we'll likely have a gelf backend for parsoid logging soon [22:22:47] if you hove a bar it should show you the count in that interval [22:23:00] *hover [22:23:14] gwicke: I'd be glad to help figure out how to get that into logstash [22:23:21] bd808, how do I check the total time window I'm searching? [22:23:40] Never mind, I found it (green filtering bar). [22:23:50] superm401: it's in the top middle of the page "an hour to a few seconds ago" or some such [22:24:16] Thanks, that looks easier to change. [22:24:22] bd808: cool, will ping you when we are ready [22:25:02] gwicke: excellent. The labs project would probably be the place to start. Your beta traffic into it until we get all the kinks worked out [22:26:21] bd808, we can also log both to a gelf backend and a local file for a bit [22:26:56] so it doesn't need to be stable right from the start [22:27:05] That sounds like a smart idea :) [22:27:37] It's just easier to piddle with the labs instance since it's not controlled by operations/puppet.git [22:27:52] yeah, very true [22:28:07] We can even spin up one just for you to dork around with [22:28:29] we could point our rt clients to it, those are currently in labs [22:28:51] but in any case the logging infrastructure patch needs to land before that [22:28:56] will ping you [22:29:02] * bd808 nods [22:37:00] paravoid: Ping? [22:37:08] hey [22:37:15] What be up? [22:37:22] (Or down, as the case may be) [22:37:35] can you have a look at the ecehttps://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=virt0&service=Certificate+expiration [22:37:43] at that :) [22:37:45] or this https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=virt1000&service=Certificate+expiration [22:37:58] this check has its one month anniversary [22:40:15] Deskana: I updated https://www.mediawiki.org/wiki/Extension:GlobalCssJs [22:40:42] legoktm: Great. :) [22:41:59] (03PS5) 10Matanya: removed orion shell account [operations/puppet] - 10https://gerrit.wikimedia.org/r/113637 [22:42:06] Coren: ack? [22:42:13] paravoid: virt0 is hearing the wail of the banshee anyways (and this isn't the canonical name); virt1000 will get the wikitech cert once we switch in ~2weeks. [22:42:37] paravoid: So I don't think renewing those is worthwhile. [22:42:44] what do you mean renewing? [22:42:47] the check is broken [22:42:51] it spews a traceback [22:42:53] for starters :) [22:43:00] Oh! [22:43:06] D'oh! Ignore me. :-) [22:43:14] I'll look into it. [22:43:16] then, virt0 has a proper certificate too [22:43:29] I mean https://wikitech works for me [22:43:32] Does it? I thought only the wikitech name had one. [22:43:33] so maybe we should fix the check? [22:43:40] Coren: paravoid it's using a different check command [22:43:47] check_command => "check_cert!${fqdn}!636!${ca_name [22:43:50] well, yes, then we should fix the check to check for wikitech then maybe? [22:43:56] i think it can just use the normal check_http [22:44:01] I mean, we /do/ care if wikitech's certificate expires [22:44:02] like it was used elsewhere [22:44:11] PROBLEM - Puppet freshness on virt1000 is CRITICAL: Last successful Puppet run was Fri 21 Feb 2014 04:42:42 PM UTC [22:44:44] Coren: more like $USER1$/check_http -H $ARG1$ -I $HOSTADDRESS$ --ssl --certificate=90 [22:44:48] paravoid: I didn't understand you the first time; I thought you wanted me to address the certs, not the check /itself/ :-) [22:44:52] templates/icinga/checkcommands.cfg.erb [22:46:14] (03CR) 10Dzahn: [C: 032] removed orion shell account [operations/puppet] - 10https://gerrit.wikimedia.org/r/113637 (owner: 10Matanya) [22:47:12] (03PS4) 10Matanya: remove smerritt shell account [operations/puppet] - 10https://gerrit.wikimedia.org/r/113638 [22:48:34] mutante: Yeah, check_http would work just as well. We already have command_name check_ssl_cert for that which we should use rather than check_cert [22:48:44] * Coren fixies. [22:49:10] Coren: +1 for not having separate ones [22:51:53] Ah, no, wait. That's the SSL cert for the /LDAP/ server! [22:52:38] that's why port 636 [22:52:46] (which explains the virt0 vs wikitech too [22:52:48] ) [22:52:50] yea.. hrmm.. but check_http also has option for ports [22:53:21] mutante: Yeah, but then it'll try to do a GET which is going to fail when talking to LDAP [22:53:47] So I need to figure out why the check_ssl is teh failz. [22:55:24] cd /etc/icinga [22:55:30] bah. [22:55:38] Coren: i see, well that would explain the separate scrit [22:56:06] (03CR) 10Dzahn: [C: 032] remove smerritt shell account [operations/puppet] - 10https://gerrit.wikimedia.org/r/113638 (owner: 10Matanya) [22:59:40] (03CR) 10Dzahn: "please fix path conflict" [operations/puppet] - 10https://gerrit.wikimedia.org/r/113636 (owner: 10Matanya) [23:04:31] Aha! The plugin sucks at error reporting, but now I know why it fails: LDAP is still trying to use the star certificate. [23:06:23] RobH: When you ordered replacement for wikitech, did you also do virt0 and virt1000? [23:06:26] * Coren guesses no. [23:09:44] We need our own CA for internal junk, so that we can simply distribute our root via puppet. [23:09:55] (03PS6) 10Dzahn: remove shell access and key for mgrover [operations/puppet] - 10https://gerrit.wikimedia.org/r/113636 (owner: 10Matanya) [23:10:02] mutante: What's our current process for only-used-internally certificates? [23:10:32] (03CR) 10jenkins-bot: [V: 04-1] remove shell access and key for mgrover [operations/puppet] - 10https://gerrit.wikimedia.org/r/113636 (owner: 10Matanya) [23:12:29] Coren: for virt0 and virt1000 they have their own cert [23:12:39] its installed via the nova whatevs iirc [23:12:45] i even updated them to use it [23:12:57] for ldap [23:13:09] again, iirc, it was weeks ago [23:13:35] https://rt.wikimedia.org/Ticket/Display.html?id=6592 [23:15:22] Coren: i dunno about the "internal" part [23:17:36] (03PS7) 10Dzahn: remove shell access and key for mgrover [operations/puppet] - 10https://gerrit.wikimedia.org/r/113636 (owner: 10Matanya) [23:18:14] (03CR) 10jenkins-bot: [V: 04-1] remove shell access and key for mgrover [operations/puppet] - 10https://gerrit.wikimedia.org/r/113636 (owner: 10Matanya) [23:19:00] bah [23:23:46] (03PS8) 10Dzahn: remove shell access and key for mgrover [operations/puppet] - 10https://gerrit.wikimedia.org/r/113636 (owner: 10Matanya) [23:26:28] (03CR) 10Dzahn: [C: 032] remove shell access and key for mgrover [operations/puppet] - 10https://gerrit.wikimedia.org/r/113636 (owner: 10Matanya) [23:30:16] RobH: It didn't take then, LDAP responds with the star certificate still. [23:30:40] well, someone undid it! [23:30:45] s:/serialNumber=3Te2KNVS3beWLBffkE0QtVQ4qxo3Ix10/C=US/O=*.wikimedia.org/OU=GT11518520/OU=See www.rapidssl.com/resources/cps (c)10/OU=Domain Control Validated - RapidSSL(R)/CN=*.wikimedia.org [23:30:49] cuz i shredded the wildcard certs and ensured puppet wasnt puttin git on [23:30:52] so wtf... [23:30:55] lets see. [23:31:15] Coren: uhh [23:31:24] there is no star.wikimedia.org cert even on virt0 [23:31:29] ...! [23:31:37] where are you pulling that? [23:31:42] Where in blazes is it fishing that certificate from? [23:31:46] i have no idea [23:31:53] virt0.wikimedia.org:636 [23:31:55] well, atleast there isnt in /etc/ssl [23:32:03] if ldap shoves it someplace odd.... [23:32:35] yea, there is no star certs on the /etc/ssl on those [23:32:37] * Coren hunts it down. [23:32:52] is it in LDAP itself? [23:33:41] i see ldap config refer to the hostname configs [23:33:51] (03CR) 10Dzahn: "notice: /Stage[main]/Accounts::Mgrover/Ssh_authorized_key[norbertgrover@Norberts-MacBook-Pro.local]/ensure: removed" [operations/puppet] - 10https://gerrit.wikimedia.org/r/113636 (owner: 10Matanya) [23:33:57] it says they live in /etc/ssl [23:34:08] TLS_CACERTDIR /etc/ssl/certs [23:35:43] Eeeeew. opendj is written in Java. The cert might be in the funky java keystore thing. [23:36:04] ack [23:36:18] keytool -list ? [23:36:23] yea, i just put the new cert in, and did a full restart but i didnt run anythign else [23:36:40] well, restarted the services i know, restarted system prolly not [23:36:46] but admin log would have it for that day [23:36:51] keytool -list -v -keystore keystore.jks [23:36:52] config.ldif:ds-cfg-ssl-cert-nickname: star.wikimedia.org [23:36:55] eww. [23:37:21] so yea, i gave both virt0 and virt1000 their own hostname based certificates [23:37:43] RobH: Okay, all that's left is fixing opendj to actually use those. [23:37:59] that in puppet or via command line? [23:38:22] RobH: I'm guessing that has to be done via command line, I'm pretty sure we don't manage the keystore with puppet [23:38:47] so where does it get the info in the first place? [23:39:12] The keystore is an implicit java mechanism/hack. But now that I look in it I see it's empty. Hm... [23:39:20] Oh! The config is /in ldap/ [23:40:06] cn=LDAPS Connection Handler,cn=Connection Handlers,cn=config [23:43:56] There's some stuff in puppet about certificates for the LDAP servers, but it's not clear how that actually updates the server. [23:47:46] RobH: Ah, yes, opendj does indeed use the keystore. [bleep] [bleep]ing java [bleep].