[00:14:44] !log reedy synchronized php-1.21wmf1/extensions/Collection/Collection.body.php [00:14:55] Logged the message, Master [00:22:42] tfinc: The mobile queue is starting to back up a little (9). I think you've been monitoring it though [00:23:01] RD: which mobile queue are you referring to ? [00:23:07] OTRS [00:25:56] RD: i can take a look at it tomorrow. i let my email dictate whats in the OTRS queue. 50% of the time its SPAM [00:25:58] more typically [00:26:21] Yeah. Can we do something about that? ;-) [00:26:26] I don't think the spam filter is working [00:26:59] Doesn't seem to have been for a while. The filters we set are, but the regular filters. Lots of shit is going to the queues. (but the 9 tickets in there now are legitimate) [00:27:56] RD: i have no clue. i know nothing about OTRS past that its a pain to use [00:28:04] Yes. [00:28:07] We should upgrade! [00:28:25] More spam-handing features. Amongst other things, but I've been whining at others about that for a while [00:28:43] RD: you know better then I then. go for it. i have no idea what it entails to upgrade OTRS to a new version [00:28:47] Nope [00:28:54] I don't know anything about it [00:29:37] New patchset: MaxSem; "Solr replication" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26571 [00:29:38] I'm an otrs admin, and know the current verison sucks and the new one has many improvements. There is progress being made on the upgrade tho - an RT ticket is somewhere I guess and moving slowly. [00:30:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26571 [00:30:52] aye, last time I heard (not long ago) we do finally have the ptrs creator on board to help with non disclosure signed etc [00:30:58] slow but they're trying [00:31:21] New review: MaxSem; "Not ready yet." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/26571 [00:56:24] New patchset: Dzahn; "planet - adding Arabic, French and Chinese translations." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26573 [00:57:24] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26573 [00:58:52] New review: Dzahn; "https://meta.wikimedia.org/wiki/Planet_Wikimedia/Names thanks Nemo_bis" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/26573 [00:58:53] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26573 [01:03:08] New patchset: Dzahn; "planet - and fix Spanish translation too" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26574 [01:04:06] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26574 [01:04:39] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26574 [01:21:19] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [01:21:19] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [01:42:11] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 236 seconds [01:42:29] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 253 seconds [01:46:50] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 29 seconds [01:51:47] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 11 seconds [02:27:47] PROBLEM - Apache HTTP on srv222 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:28:12] !log LocalisationUpdate completed (1.21wmf1) at Thu Oct 4 02:28:12 UTC 2012 [02:28:14] PROBLEM - Apache HTTP on srv224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:28:26] Logged the message, Master [02:28:32] PROBLEM - Apache HTTP on srv223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:28:59] PROBLEM - LVS HTTP IPv4 on rendering.svc.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:29:35] PROBLEM - Apache HTTP on srv220 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:29:53] PROBLEM - Apache HTTP on srv219 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:30:38] PROBLEM - Apache HTTP on srv221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:31:23] RECOVERY - Apache HTTP on srv219 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.576 second response time [02:31:59] RECOVERY - LVS HTTP IPv4 on rendering.svc.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 60696 bytes in 7.724 seconds [02:32:08] RECOVERY - Apache HTTP on srv221 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.711 second response time [02:32:44] RECOVERY - Apache HTTP on srv220 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.041 second response time [02:32:44] RECOVERY - Apache HTTP on srv222 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.934 second response time [02:32:53] RECOVERY - Apache HTTP on srv224 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.701 second response time [02:33:11] RECOVERY - Apache HTTP on srv223 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.416 second response time [02:48:38] PROBLEM - MySQL Idle Transactions on db63 is CRITICAL: CRIT longest blocking idle transaction sleeps for 1493 seconds [02:49:24] !log LocalisationUpdate completed (1.20wmf12) at Thu Oct 4 02:49:23 UTC 2012 [02:49:35] Logged the message, Master [03:05:26] PROBLEM - Apache HTTP on srv194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:06:47] RECOVERY - Apache HTTP on srv194 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.354 second response time [03:14:35] RECOVERY - Puppet freshness on knsq19 is OK: puppet ran at Thu Oct 4 03:14:30 UTC 2012 [03:25:59] RECOVERY - MySQL Idle Transactions on db63 is OK: OK longest blocking idle transaction sleeps for 34 seconds [03:30:29] PROBLEM - MySQL Idle Transactions on db63 is CRITICAL: CRIT longest blocking idle transaction sleeps for 4006 seconds [03:32:08] RECOVERY - MySQL Idle Transactions on db63 is OK: OK longest blocking idle transaction sleeps for 0 seconds [05:17:00] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [05:18:30] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 9.019 second response time on port 8123 [05:21:39] PROBLEM - Lucene on search1016 is CRITICAL: Connection timed out [05:23:36] RECOVERY - Lucene on search1016 is OK: TCP OK - 0.027 second response time on port 8123 [05:24:39] !log restarted lucene search on search1016 again [05:24:50] Logged the message, Master [05:46:24] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [05:46:24] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [05:46:24] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [05:46:24] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [05:46:24] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [06:20:39] PROBLEM - Puppet freshness on cp1040 is CRITICAL: Puppet has not run in the last 10 hours [06:37:01] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [07:47:56] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [08:05:56] PROBLEM - Apache HTTP on srv194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:07:17] RECOVERY - Apache HTTP on srv194 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.250 second response time [08:37:34] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [09:59:15] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [10:36:31] PROBLEM - Host cp1044 is DOWN: PING CRITICAL - Packet loss = 100% [11:10:15] New patchset: Dereckson; "(bug 40759) Namespace configuration for fi.wikisource" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/26601 [11:15:22] PROBLEM - Apache HTTP on srv194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:16:52] RECOVERY - Apache HTTP on srv194 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.326 second response time [11:22:06] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [11:22:06] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [12:25:55] PROBLEM - Apache HTTP on srv194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:27:09] New patchset: Mark Bergsma; "Mount the originals volume on all application servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26607 [12:27:16] RECOVERY - Apache HTTP on srv194 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.277 second response time [12:28:08] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26607 [12:48:59] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26607 [12:52:04] New patchset: Cmcmahon; "disabling AFTv4 completely for beta labs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/26611 [13:15:59] Ryan_Lane: kibana is now rewritten in Ruby and there's no jRuby packaging for it :-) [13:17:46] reducing weight on srv194 [13:18:30] notpeter: everything okay with the new php packages? [13:18:44] paravoid: no complaints yet! [13:18:46] thank you! [13:18:56] yeah, all of the apaches seem to be behaving nicely [13:19:05] Change merged: Matthias Mullie; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/26611 [13:19:18] thank you very much for building kick-ass packages :) [13:33:52] !log removing srv193 from bits pool for test upgrade to precise [13:33:59] er [13:34:03] Logged the message, notpeter [13:34:11] !log removing srv191 from bits pool for test upgrade to precise [13:34:23] Logged the message, notpeter [13:34:28] errrrr [13:34:37] Why is 193 in the bits pool? :/ [13:34:43] Oh [13:36:16] even corrected SAL :) [13:36:30] Reedy: the real answer is "because I'm only half-done with my coffee" [13:50:17] New patchset: Pyoungmeister; "removing srv191 from bits backends" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26618 [13:51:29] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26618 [14:07:01] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26618 [15:23:00] DOMAS [15:23:05] whatsup! [15:23:12] do you know what is cooler than a million users?!!!?!? [15:23:15] whatsdown! [15:23:26] p.defau.lt wasn't working yesterday! [15:23:32] data about a million users? :-P [15:24:12] thats cool too, I hear wikimedia is doing impression logging for everyone [15:24:27] ONE BILLION USERS [15:24:56] on p.defau.lt? [15:25:05] wow, that thing scales [15:26:33] New patchset: Reedy; "(bug 40736) Lift account creation limit" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/26460 [15:26:41] oh, you mean data on a billion users [15:26:43] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/26460 [15:26:57] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/26601 [15:27:31] I wonder how many of them are sock puppets [15:29:39] mark: it seems to work now [15:29:59] I want my money back for yesterday [15:29:59] mark: must have been some lithuanian network snafu, I heard about some ddoses in that part of the world :) [15:30:02] haha [15:30:02] I needed it [15:30:09] !log reedy synchronized wmf-config/ [15:30:12] you could've used ultra efficient one from Tim! [15:30:21] Logged the message, Master [15:30:25] i'm not aware of that one [15:38:26] LeslieCarr: can you find it in your mail? [15:38:41] i know i've seen it pass by in the past few days, but... [15:38:47] mark: ? [15:38:58] missed the "it" remark from above [15:38:58] doh [15:38:59] you have it [15:39:03] the legal company info [15:39:06] you forwarded it to me [15:39:08] ok [15:39:16] okay, i can resend [15:39:22] no i have it [15:39:29] i was just thinking it came from tony or so [15:39:30] resent [15:39:53] ah :) [15:42:48] A recent extract from the Commercial Trade Register or equivalent document proving the registration of the Member with the national authorities. [15:42:50] we need that [15:44:55] hrm [15:45:02] 501c3 documentation ? [15:45:31] http://upload.wikimedia.org/wikipedia/foundation/a/aa/501%28c%29%283%29_Letter.png [15:46:49] lesliecarr: asw2-d3-sdtpa 0/38 to csw1-sdtpa:12/1 is down? ok to connect? [15:47:02] when is wikipedia moving to 5.6? [15:47:04] mysql [15:47:04] lemme double check cmjohnson :) [15:47:11] I hear it has all the features that we had in our 4.0 patch finally [15:47:14] well, to be honest, not all [15:47:17] :-D [15:47:28] some statistics are still missing [15:47:34] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [15:47:34] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [15:47:34] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [15:47:34] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [15:47:34] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [15:48:49] domas: I think Asher was doing some MariaDB testing.. [15:48:58] cmjohnson1: ready to connect [15:49:02] It's still no MongoDB!! [15:49:04] cool [15:49:15] leslecarr: done [15:49:22] lesliecarr: ^ [15:50:36] 5.6 seems to be a solid release [15:51:09] When did we completely get off mysql4? ;) [15:52:10] cmjohnson1: yay [15:52:46] hrm [15:52:49] i see it as still down [15:53:43] roll it ? [15:57:40] lesliecarr: chk now [15:58:06] cmjohnson1: yay [15:58:12] it's alive! [15:58:43] :-] [15:59:32] New patchset: Mark Bergsma; "std.integer uses a 32 bit signed int, not large enough" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26634 [16:00:01] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:00:32] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26634 [16:00:36] ah :) very intersting mark :) [16:01:31] grrrr [16:02:23] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26634 [16:03:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.952 seconds [16:16:14] !log reedy synchronized wmf-config/CommonSettings.php [16:16:24] Logged the message, Master [16:21:37] PROBLEM - Puppet freshness on cp1040 is CRITICAL: Puppet has not run in the last 10 hours [16:23:28] New patchset: Andrew Bogott; "A few more security fixes for labs mediawiki:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26637 [16:24:29] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26637 [16:25:28] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26637 [16:29:34] PROBLEM - Host ms-be8 is DOWN: PING CRITICAL - Packet loss = 100% [16:29:45] paravoid: why doesn't that surprise me [16:29:58] why do ops people write so many things in ruby? [16:30:07] I think it's developers masquerading as ops [16:30:39] it's "devops" silly [16:31:05] <^demon> I'm a developer masquerading as ops. [16:31:08] <^demon> But I hate ruby ;-) [16:32:16] Q: Why is this Ruby instead of PHP now? [16:32:17] A: Closer integreation with logstash, Ruby is shiny. Its mostly javascript anyway. If you want it in something else, it shouldn't be too hard to port. [16:32:27] the answer kinda makes sense [16:32:35] I'm sold [16:32:43] it's an interface to logstash and logstash is in Ruby [16:33:00] <^demon> That's plausible. "Ruby is shiny" needs factchecking though [16:33:09] heh yes [16:34:11] the answer makes sense [16:34:16] that doesn't make me any happier ;) [16:36:55] I'm secretly hoping it'll get merged into logstash and be shipped together [16:37:04] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:37:23] I mean, the talk at puppetconf was presented by logstash's author I think and he was showing kibana screenshots [16:37:31] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [16:40:21] New patchset: Jgreen; "adding ,BannerControl to bannerImpressionudp2log filter so we'll catch 404's" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26639 [16:41:18] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26639 [16:41:31] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26639 [16:46:47] paravoid Ryan_Lane as I understand it, Ruby is popular for frameworks like puppet and rails because of native support for metaprogramming that is much more difficult or impossible in other languages. you can change the language itself to create some domain specific Ruby dialect to e.g. manage systems (puppet) or easily create web applications (rails). [16:48:00] chrismcmahon: yeah. it's doable in python, though too [16:48:21] paravoid: I also hope they get merged [16:50:25] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:50:52] PROBLEM - Varnish traffic logger on cp1026 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:51:10] PROBLEM - Varnish traffic logger on cp1036 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:51:10] PROBLEM - Varnish traffic logger on cp1030 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:51:10] PROBLEM - Varnish traffic logger on cp1028 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:51:28] PROBLEM - Varnish traffic logger on cp1022 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:51:28] PROBLEM - Varnish traffic logger on cp1021 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:51:28] PROBLEM - Varnish traffic logger on cp1029 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:51:37] PROBLEM - Varnish traffic logger on cp1033 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:52:40] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.884 seconds [16:53:34] RECOVERY - Varnish traffic logger on cp1023 is OK: PROCS OK: 3 processes with command name varnishncsa [16:54:01] RECOVERY - Varnish traffic logger on cp1026 is OK: PROCS OK: 3 processes with command name varnishncsa [16:54:19] RECOVERY - Varnish traffic logger on cp1036 is OK: PROCS OK: 3 processes with command name varnishncsa [16:54:19] RECOVERY - Varnish traffic logger on cp1030 is OK: PROCS OK: 3 processes with command name varnishncsa [16:54:19] RECOVERY - Varnish traffic logger on cp1028 is OK: PROCS OK: 3 processes with command name varnishncsa [16:54:37] RECOVERY - Varnish traffic logger on cp1022 is OK: PROCS OK: 3 processes with command name varnishncsa [16:54:37] RECOVERY - Varnish traffic logger on cp1021 is OK: PROCS OK: 3 processes with command name varnishncsa [16:54:37] RECOVERY - Varnish traffic logger on cp1029 is OK: PROCS OK: 3 processes with command name varnishncsa [16:55:06] RECOVERY - Varnish traffic logger on cp1033 is OK: PROCS OK: 3 processes with command name varnishncsa [16:57:21] New patchset: Matthias Mullie; "Make abusefilter emergency disable more sensible" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/25855 [17:03:29] apergos: AaronSchulz: can't we just move to Swift-only during the maintenance window, then pick up the deltas after we're done? [17:03:40] notpeter: ping [17:03:49] notpeter: can you please merge https://gerrit.wikimedia.org/r/26640 [17:04:27] leaving us with swift-only afterwards? [17:04:52] apergos: how long is the window exactly? [17:05:07] 3 hours, to give ourselves lots and lots of time (see the email) [17:05:09] sure [17:05:36] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26640 [17:05:48] * AaronSchulz was looking at the wiki paeg [17:05:49] *page [17:05:50] so I guess captcha isn't ready for ms7 to be gone yet; I'm still not sure of the status of itmeline and math [17:06:17] preilly: pushing live now [17:06:22] I'm still seeing some requests for wikipedia/xx/math/blah.png from ms7 (prolly cached pages) [17:07:11] notpeter: thanks! [17:08:18] although cp1044 is down.... [17:12:05] New patchset: Adminxor; "This will make every lab machine throw their logs to the logstash server for testing Please note that rsyslog will continue to log messages locally new file: files/rsyslog/z-logstash.conf modified: manifests/base.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26642 [17:13:06] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26642 [17:13:35] Change abandoned: Adminxor; "Abandoning this as there is an updated one: https://gerrit.wikimedia.org/r/26642" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26407 [17:19:06] PROBLEM - Varnish traffic logger on cp1033 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:19:33] PROBLEM - Varnish traffic logger on cp1027 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:19:42] PROBLEM - Varnish traffic logger on cp1043 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:19:42] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:19:51] PROBLEM - Varnish traffic logger on cp1031 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:19:51] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:19:51] PROBLEM - Varnish traffic logger on cp1029 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:19:51] PROBLEM - Varnish traffic logger on cp1021 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:19:52] apergos: I like how there is an nfs mount called "originals" that has math/timeline files too ;) [17:20:00] PROBLEM - Varnish traffic logger on cp1035 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:20:09] PROBLEM - Varnish traffic logger on cp1025 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:20:18] PROBLEM - Varnish traffic logger on cp1026 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:20:27] PROBLEM - Varnish traffic logger on cp1022 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:20:27] PROBLEM - Varnish traffic logger on cp1024 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:20:27] PROBLEM - Varnish traffic logger on cp1028 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:20:27] PROBLEM - Varnish traffic logger on cp1034 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:20:36] PROBLEM - Varnish traffic logger on cp1032 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:20:36] PROBLEM - Varnish traffic logger on cp1030 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:20:36] PROBLEM - Varnish traffic logger on cp1036 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:20:36] PROBLEM - Varnish traffic logger on cp1042 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:23:19] we're attempting to make as many copies of them as possible :-P [17:25:49] apergos - regarding the ms7 switch tomorrow [17:26:00] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:26:01] uh huh [17:26:12] u mentioned upload will be disabled .. .how long? [17:26:46] if things go smoothly, it shouldn't be that long [17:26:56] need to communicate a window [17:27:07] We'll start this on Friday Oct 5 at 11am UTC, to conclude at [17:27:08] 2pm UTC or earlier. [17:27:13] the email that went out gave a time frame of 3 hours [17:27:16] 3 hours is long [17:27:22] apergos: can't we just use syncFileBackend? [17:27:31] we expect that to be much longer than we need, but better to be too long than too short [17:27:53] syncFiileBackend to do what? [17:28:14] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23681 [17:28:16] this is just about the mount point, and config changes, we're not going to be moving over any more data during thattime [17:29:55] seems like we've got (at least) three options: [17:29:59] what is the expected downtime? [17:30:02] 1/2 hour? [17:30:43] option a) go read-only, sync NetApp, switch backend, go read-write [17:31:21] option b) go Swift-only, sync NetApp, go MultiWrite to NetApp, clean up [17:32:00] option c) hybrid approach. go with option A, then if that's taking longer than anticipated, switch to the option B strategy [17:32:31] New patchset: DamianZaremba; "Update the SNMP traps." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25758 [17:32:45] see I don't think there is a need for 'sync NetApp' as a separate step in there ... is there? [17:32:50] in option a [17:33:03] oh? [17:33:31] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25758 [17:33:49] seem like it should happen [17:33:50] metrics meeting starting....I'll be tuning out here [17:33:50] except for things that aren't served by swift, we could do one last run of those, but they aren't much [17:35:02] extdist captcha but those have already gone over once, we'd just be catching anything that's changed [17:36:43] captcha file don't change [17:36:53] so we wouldn't need it for those [17:36:56] (not unless an sysadmin changes them) [17:37:38] maybe we can switch captcha first, and then math/timeline [17:37:39] we'll be a little off sync with some images and math but since swift will have those we can deal with it after writes are back on... can't we? [17:37:45] it feel simpler to do less at once [17:38:33] these are just config var changes, right? [17:38:36] apergos: math/timeline files are based on content hashes and thus largely immutable (just new files being added) [17:38:48] I can do a copyFileBackend (diff of two listings) for those [17:39:04] that would sync them up, I already had to do that before due to some mistake [17:39:10] didn't take too long [17:39:46] apergos: for captcha it's just a var change [17:39:55] right [17:39:57] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.991 seconds [17:40:08] for math/timeline, basically a var change plus a copy script run to catch any file changes [17:40:51] RECOVERY - Varnish traffic logger on cp1036 is OK: PROCS OK: 3 processes with command name varnishncsa [17:40:51] RECOVERY - Varnish traffic logger on cp1030 is OK: PROCS OK: 3 processes with command name varnishncsa [17:40:51] RECOVERY - Varnish traffic logger on cp1032 is OK: PROCS OK: 3 processes with command name varnishncsa [17:40:52] and the copy script should run right afterwards? or can it wait a few hours (say for normal workking hours in SF)? [17:41:27] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [17:41:27] RECOVERY - Varnish traffic logger on cp1043 is OK: PROCS OK: 3 processes with command name varnishncsa [17:41:36] RECOVERY - Varnish traffic logger on cp1031 is OK: PROCS OK: 3 processes with command name varnishncsa [17:41:36] RECOVERY - Varnish traffic logger on cp1029 is OK: PROCS OK: 3 processes with command name varnishncsa [17:41:37] reads come from swift, so users won't notice if the copy script takes some time [17:41:45] RECOVERY - Varnish traffic logger on cp1021 is OK: PROCS OK: 3 processes with command name varnishncsa [17:41:54] RECOVERY - Varnish traffic logger on cp1035 is OK: PROCS OK: 3 processes with command name varnishncsa [17:41:54] RECOVERY - Varnish traffic logger on cp1025 is OK: PROCS OK: 3 processes with command name varnishncsa [17:41:54] RECOVERY - Varnish traffic logger on cp1023 is OK: PROCS OK: 3 processes with command name varnishncsa [17:42:03] RECOVERY - Varnish traffic logger on cp1024 is OK: PROCS OK: 3 processes with command name varnishncsa [17:42:05] heh, actually, the only reason timeline writes to both is a wmf branch hack (to counter the "noPushQuickOps" hack in config)... [17:42:21] RECOVERY - Varnish traffic logger on cp1042 is OK: PROCS OK: 3 processes with command name varnishncsa [17:42:21] RECOVERY - Varnish traffic logger on cp1033 is OK: PROCS OK: 3 processes with command name varnishncsa [17:42:26] * AaronSchulz should probably deal with that sometime... [17:42:30] RECOVERY - Varnish traffic logger on cp1034 is OK: PROCS OK: 3 processes with command name varnishncsa [17:42:48] RECOVERY - Varnish traffic logger on cp1027 is OK: PROCS OK: 3 processes with command name varnishncsa [17:43:43] :-) [17:43:48] New patchset: DamianZaremba; "Update the SNMP traps." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25758 [17:44:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25758 [17:45:02] maybe I could have thumbs write to dev/null and disable noPushQuickOps ;) [17:45:07] hahaha [17:45:08] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25758 [17:45:10] not tomorrow! [17:45:26] sooooo [17:45:46] looking at this list [17:45:49] http://wikitech.wikimedia.org/view/Media_server#Transitioning_from_one_nfs_mount_to_another_.28ie._ms7_to_netapp.29.2C_Oct_2012 [17:45:54] if I'mnot missing some things: [17:46:49] it loks like we still need to leave ms7 up for a while yet for requests for things in old cached pages that still go there, those should die off eventually [17:47:54] there's a change in filebackend.php and in extdist/svn-invoke.conf , (I don't know if that second file is used), besides the config changes [17:48:39] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [17:49:08] one change to a puppet file I think, which affects fenari, for extdist, so we don't have to wait around for that, we can do a puppet run immediately [17:49:22] so it seems to me we might as well do the whole set of changes at once [17:49:32] instead of eeking out config changes one at a time [17:51:36] how hard is it to do captcha? isn't that easy? [17:52:14] yes I think so, that's why we might as well do them together [17:52:34] * AaronSchulz doesn't understand that [17:52:49] you wanted to do the switchover in separate steps I thought [17:52:57] that seems like more work to me [17:53:43] here is where I wish my virtual self could walk over to your desk [17:56:36] RECOVERY - Varnish traffic logger on cp1028 is OK: PROCS OK: 3 processes with command name varnishncsa [17:56:36] RECOVERY - Varnish traffic logger on cp1022 is OK: PROCS OK: 3 processes with command name varnishncsa [18:03:57] hi guys. how do i get on operations-l? [18:04:05] i realize it's not listed. [18:08:48] !log catrope synchronized php-1.21wmf1/resources/mediawiki/mediawiki.feedback.js 'Ib8746ece3a34b5e577dc08596b0ff7b1b96b5d73' [18:08:59] Logged the message, Master [18:09:29] !log catrope synchronized php-1.21wmf1/languages/messages/MessagesEn.php 'Ib8746ece3a34b5e577dc08596b0ff7b1b96b5d73' [18:09:40] Logged the message, Master [18:12:06] ty [18:14:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:28:06] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.877 seconds [18:29:00] RECOVERY - Host analytics1007 is UP: PING OK - Packet loss = 0%, RTA = 26.55 ms [18:30:12] RECOVERY - Varnish traffic logger on cp1026 is OK: PROCS OK: 3 processes with command name varnishncsa [18:38:37] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [19:02:46] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:04:52] PROBLEM - Host analytics1007 is DOWN: PING CRITICAL - Packet loss = 100% [19:07:25] RECOVERY - Host srv266 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [19:10:43] PROBLEM - Memcached on srv266 is CRITICAL: Connection refused [19:18:17] New review: CSteipp; "I think it looks good. Have there been instances of rules being automatically disabled, that should ..." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/25855 [19:18:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.035 seconds [19:25:01] New review: Matthias Mullie; "The filters for ArticleFeedbackv5 have been known to auto-disable the past couple of months - only r..." [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/25855 [19:30:13] PROBLEM - NTP on srv266 is CRITICAL: NTP CRITICAL: No response from NTP server [19:30:18] New review: CSteipp; "Makes sense to me" [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/25855 [19:42:04] mutante: https://toolserver.org/~legoktm/wlm/stats.php You're no 1! [19:43:16] Reedy: woohoo. i have just been told. did not expect that:) [19:43:38] i heard just 3% of people who downloaded the app actually uploaded images using it?! that seems low [19:47:19] AaronSchulz: (when you have a minute) anything special I need to know about making the config changes or the filebackend.php changes tomorrow? I was just gonna follow the standard procedure on wikitech i.e. http://wikitech.wikimedia.org/view/How_to_do_a_configuration_change#Change_wiki_configuration [19:47:45] apergos: so you are going with the "read-only" plan? [19:48:42] um [19:48:48] I guess so [19:49:52] Reedy will be around but I hope we won't have to ask him for anything [19:51:04] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:52:00] AaronSchulz [19:52:34] RECOVERY - Memcached on srv266 is OK: TCP OK - 0.001 second response time on port 11000 [19:53:08] apergos: anyway, there is nothing special about deployments to that file [19:53:48] ok, thanks [19:56:04]