[00:14:44] !log reedy synchronized php-1.21wmf1/extensions/Collection/Collection.body.php [00:14:55] Logged the message, Master [00:22:42] tfinc: The mobile queue is starting to back up a little (9). I think you've been monitoring it though [00:23:01] RD: which mobile queue are you referring to ? [00:23:07] OTRS [00:25:56] RD: i can take a look at it tomorrow. i let my email dictate whats in the OTRS queue. 50% of the time its SPAM [00:25:58] more typically [00:26:21] Yeah. Can we do something about that? ;-) [00:26:26] I don't think the spam filter is working [00:26:59] Doesn't seem to have been for a while. The filters we set are, but the regular filters. Lots of shit is going to the queues. (but the 9 tickets in there now are legitimate) [00:27:56] RD: i have no clue. i know nothing about OTRS past that its a pain to use [00:28:04] Yes. [00:28:07] We should upgrade! [00:28:25] More spam-handing features. Amongst other things, but I've been whining at others about that for a while [00:28:43] RD: you know better then I then. go for it. i have no idea what it entails to upgrade OTRS to a new version [00:28:47] Nope [00:28:54] I don't know anything about it [00:29:37] New patchset: MaxSem; "Solr replication" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26571 [00:29:38] I'm an otrs admin, and know the current verison sucks and the new one has many improvements. There is progress being made on the upgrade tho - an RT ticket is somewhere I guess and moving slowly. [00:30:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26571 [00:30:52] aye, last time I heard (not long ago) we do finally have the ptrs creator on board to help with non disclosure signed etc [00:30:58] slow but they're trying [00:31:21] New review: MaxSem; "Not ready yet." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/26571 [00:56:24] New patchset: Dzahn; "planet - adding Arabic, French and Chinese translations." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26573 [00:57:24] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26573 [00:58:52] New review: Dzahn; "https://meta.wikimedia.org/wiki/Planet_Wikimedia/Names thanks Nemo_bis" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/26573 [00:58:53] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26573 [01:03:08] New patchset: Dzahn; "planet - and fix Spanish translation too" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26574 [01:04:06] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26574 [01:04:39] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26574 [01:21:19] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [01:21:19] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [01:42:11] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 236 seconds [01:42:29] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 253 seconds [01:46:50] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 29 seconds [01:51:47] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 11 seconds [02:27:47] PROBLEM - Apache HTTP on srv222 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:28:12] !log LocalisationUpdate completed (1.21wmf1) at Thu Oct 4 02:28:12 UTC 2012 [02:28:14] PROBLEM - Apache HTTP on srv224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:28:26] Logged the message, Master [02:28:32] PROBLEM - Apache HTTP on srv223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:28:59] PROBLEM - LVS HTTP IPv4 on rendering.svc.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:29:35] PROBLEM - Apache HTTP on srv220 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:29:53] PROBLEM - Apache HTTP on srv219 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:30:38] PROBLEM - Apache HTTP on srv221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:31:23] RECOVERY - Apache HTTP on srv219 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.576 second response time [02:31:59] RECOVERY - LVS HTTP IPv4 on rendering.svc.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 60696 bytes in 7.724 seconds [02:32:08] RECOVERY - Apache HTTP on srv221 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.711 second response time [02:32:44] RECOVERY - Apache HTTP on srv220 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.041 second response time [02:32:44] RECOVERY - Apache HTTP on srv222 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.934 second response time [02:32:53] RECOVERY - Apache HTTP on srv224 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.701 second response time [02:33:11] RECOVERY - Apache HTTP on srv223 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.416 second response time [02:48:38] PROBLEM - MySQL Idle Transactions on db63 is CRITICAL: CRIT longest blocking idle transaction sleeps for 1493 seconds [02:49:24] !log LocalisationUpdate completed (1.20wmf12) at Thu Oct 4 02:49:23 UTC 2012 [02:49:35] Logged the message, Master [03:05:26] PROBLEM - Apache HTTP on srv194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:06:47] RECOVERY - Apache HTTP on srv194 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.354 second response time [03:14:35] RECOVERY - Puppet freshness on knsq19 is OK: puppet ran at Thu Oct 4 03:14:30 UTC 2012 [03:25:59] RECOVERY - MySQL Idle Transactions on db63 is OK: OK longest blocking idle transaction sleeps for 34 seconds [03:30:29] PROBLEM - MySQL Idle Transactions on db63 is CRITICAL: CRIT longest blocking idle transaction sleeps for 4006 seconds [03:32:08] RECOVERY - MySQL Idle Transactions on db63 is OK: OK longest blocking idle transaction sleeps for 0 seconds [05:17:00] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [05:18:30] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 9.019 second response time on port 8123 [05:21:39] PROBLEM - Lucene on search1016 is CRITICAL: Connection timed out [05:23:36] RECOVERY - Lucene on search1016 is OK: TCP OK - 0.027 second response time on port 8123 [05:24:39] !log restarted lucene search on search1016 again [05:24:50] Logged the message, Master [05:46:24] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [05:46:24] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [05:46:24] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [05:46:24] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [05:46:24] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [06:20:39] PROBLEM - Puppet freshness on cp1040 is CRITICAL: Puppet has not run in the last 10 hours [06:37:01] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [07:47:56] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [08:05:56] PROBLEM - Apache HTTP on srv194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:07:17] RECOVERY - Apache HTTP on srv194 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.250 second response time [08:37:34] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [09:59:15] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [10:36:31] PROBLEM - Host cp1044 is DOWN: PING CRITICAL - Packet loss = 100% [11:10:15] New patchset: Dereckson; "(bug 40759) Namespace configuration for fi.wikisource" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/26601 [11:15:22] PROBLEM - Apache HTTP on srv194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:16:52] RECOVERY - Apache HTTP on srv194 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.326 second response time [11:22:06] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [11:22:06] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [12:25:55] PROBLEM - Apache HTTP on srv194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:27:09] New patchset: Mark Bergsma; "Mount the originals volume on all application servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26607 [12:27:16] RECOVERY - Apache HTTP on srv194 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.277 second response time [12:28:08] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26607 [12:48:59] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26607 [12:52:04] New patchset: Cmcmahon; "disabling AFTv4 completely for beta labs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/26611 [13:15:59] Ryan_Lane: kibana is now rewritten in Ruby and there's no jRuby packaging for it :-) [13:17:46] reducing weight on srv194 [13:18:30] notpeter: everything okay with the new php packages? [13:18:44] paravoid: no complaints yet! [13:18:46] thank you! [13:18:56] yeah, all of the apaches seem to be behaving nicely [13:19:05] Change merged: Matthias Mullie; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/26611 [13:19:18] thank you very much for building kick-ass packages :) [13:33:52] !log removing srv193 from bits pool for test upgrade to precise [13:33:59] er [13:34:03] Logged the message, notpeter [13:34:11] !log removing srv191 from bits pool for test upgrade to precise [13:34:23] Logged the message, notpeter [13:34:28] errrrr [13:34:37] Why is 193 in the bits pool? :/ [13:34:43] Oh [13:36:16] even corrected SAL :) [13:36:30] Reedy: the real answer is "because I'm only half-done with my coffee" [13:50:17] New patchset: Pyoungmeister; "removing srv191 from bits backends" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26618 [13:51:29] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26618 [14:07:01] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26618 [15:23:00] DOMAS [15:23:05] whatsup! [15:23:12] do you know what is cooler than a million users?!!!?!? [15:23:15] whatsdown! [15:23:26] p.defau.lt wasn't working yesterday! [15:23:32] data about a million users? :-P [15:24:12] thats cool too, I hear wikimedia is doing impression logging for everyone [15:24:27] ONE BILLION USERS [15:24:56] on p.defau.lt? [15:25:05] wow, that thing scales [15:26:33] New patchset: Reedy; "(bug 40736) Lift account creation limit" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/26460 [15:26:41] oh, you mean data on a billion users [15:26:43] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/26460 [15:26:57] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/26601 [15:27:31] I wonder how many of them are sock puppets [15:29:39] mark: it seems to work now [15:29:59] I want my money back for yesterday [15:29:59] mark: must have been some lithuanian network snafu, I heard about some ddoses in that part of the world :) [15:30:02] haha [15:30:02] I needed it [15:30:09] !log reedy synchronized wmf-config/ [15:30:12] you could've used ultra efficient one from Tim! [15:30:21] Logged the message, Master [15:30:25] i'm not aware of that one [15:38:26] LeslieCarr: can you find it in your mail? [15:38:41] i know i've seen it pass by in the past few days, but... [15:38:47] mark: ? [15:38:58] missed the "it" remark from above [15:38:58] doh [15:38:59] you have it [15:39:03] the legal company info [15:39:06] you forwarded it to me [15:39:08] ok [15:39:16] okay, i can resend [15:39:22] no i have it [15:39:29] i was just thinking it came from tony or so [15:39:30] resent [15:39:53] ah :) [15:42:48] A recent extract from the Commercial Trade Register or equivalent document proving the registration of the Member with the national authorities. [15:42:50] we need that [15:44:55] hrm [15:45:02] 501c3 documentation ? [15:45:31] http://upload.wikimedia.org/wikipedia/foundation/a/aa/501%28c%29%283%29_Letter.png [15:46:49] lesliecarr: asw2-d3-sdtpa 0/38 to csw1-sdtpa:12/1 is down? ok to connect? [15:47:02] when is wikipedia moving to 5.6? [15:47:04] mysql [15:47:04] lemme double check cmjohnson :) [15:47:11] I hear it has all the features that we had in our 4.0 patch finally [15:47:14] well, to be honest, not all [15:47:17] :-D [15:47:28] some statistics are still missing [15:47:34] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [15:47:34] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [15:47:34] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [15:47:34] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [15:47:34] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [15:48:49] domas: I think Asher was doing some MariaDB testing.. [15:48:58] cmjohnson1: ready to connect [15:49:02] It's still no MongoDB!! [15:49:04] cool [15:49:15] leslecarr: done [15:49:22] lesliecarr: ^ [15:50:36] 5.6 seems to be a solid release [15:51:09] When did we completely get off mysql4? ;) [15:52:10] cmjohnson1: yay [15:52:46] hrm [15:52:49] i see it as still down [15:53:43] roll it ? [15:57:40] lesliecarr: chk now [15:58:06] cmjohnson1: yay [15:58:12] it's alive! [15:58:43] :-] [15:59:32] New patchset: Mark Bergsma; "std.integer uses a 32 bit signed int, not large enough" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26634 [16:00:01] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:00:32] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26634 [16:00:36] ah :) very intersting mark :) [16:01:31] grrrr [16:02:23] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26634 [16:03:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.952 seconds [16:16:14] !log reedy synchronized wmf-config/CommonSettings.php [16:16:24] Logged the message, Master [16:21:37] PROBLEM - Puppet freshness on cp1040 is CRITICAL: Puppet has not run in the last 10 hours [16:23:28] New patchset: Andrew Bogott; "A few more security fixes for labs mediawiki:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26637 [16:24:29] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26637 [16:25:28] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26637 [16:29:34] PROBLEM - Host ms-be8 is DOWN: PING CRITICAL - Packet loss = 100% [16:29:45] paravoid: why doesn't that surprise me [16:29:58] why do ops people write so many things in ruby? [16:30:07] I think it's developers masquerading as ops [16:30:39] it's "devops" silly [16:31:05] <^demon> I'm a developer masquerading as ops. [16:31:08] <^demon> But I hate ruby ;-) [16:32:16] Q: Why is this Ruby instead of PHP now? [16:32:17] A: Closer integreation with logstash, Ruby is shiny. Its mostly javascript anyway. If you want it in something else, it shouldn't be too hard to port. [16:32:27] the answer kinda makes sense [16:32:35] I'm sold [16:32:43] it's an interface to logstash and logstash is in Ruby [16:33:00] <^demon> That's plausible. "Ruby is shiny" needs factchecking though [16:33:09] heh yes [16:34:11] the answer makes sense [16:34:16] that doesn't make me any happier ;) [16:36:55] I'm secretly hoping it'll get merged into logstash and be shipped together [16:37:04] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:37:23] I mean, the talk at puppetconf was presented by logstash's author I think and he was showing kibana screenshots [16:37:31] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [16:40:21] New patchset: Jgreen; "adding ,BannerControl to bannerImpressionudp2log filter so we'll catch 404's" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26639 [16:41:18] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26639 [16:41:31] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26639 [16:46:47] paravoid Ryan_Lane as I understand it, Ruby is popular for frameworks like puppet and rails because of native support for metaprogramming that is much more difficult or impossible in other languages. you can change the language itself to create some domain specific Ruby dialect to e.g. manage systems (puppet) or easily create web applications (rails). [16:48:00] chrismcmahon: yeah. it's doable in python, though too [16:48:21] paravoid: I also hope they get merged [16:50:25] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:50:52] PROBLEM - Varnish traffic logger on cp1026 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:51:10] PROBLEM - Varnish traffic logger on cp1036 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:51:10] PROBLEM - Varnish traffic logger on cp1030 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:51:10] PROBLEM - Varnish traffic logger on cp1028 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:51:28] PROBLEM - Varnish traffic logger on cp1022 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:51:28] PROBLEM - Varnish traffic logger on cp1021 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:51:28] PROBLEM - Varnish traffic logger on cp1029 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:51:37] PROBLEM - Varnish traffic logger on cp1033 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:52:40] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.884 seconds [16:53:34] RECOVERY - Varnish traffic logger on cp1023 is OK: PROCS OK: 3 processes with command name varnishncsa [16:54:01] RECOVERY - Varnish traffic logger on cp1026 is OK: PROCS OK: 3 processes with command name varnishncsa [16:54:19] RECOVERY - Varnish traffic logger on cp1036 is OK: PROCS OK: 3 processes with command name varnishncsa [16:54:19] RECOVERY - Varnish traffic logger on cp1030 is OK: PROCS OK: 3 processes with command name varnishncsa [16:54:19] RECOVERY - Varnish traffic logger on cp1028 is OK: PROCS OK: 3 processes with command name varnishncsa [16:54:37] RECOVERY - Varnish traffic logger on cp1022 is OK: PROCS OK: 3 processes with command name varnishncsa [16:54:37] RECOVERY - Varnish traffic logger on cp1021 is OK: PROCS OK: 3 processes with command name varnishncsa [16:54:37] RECOVERY - Varnish traffic logger on cp1029 is OK: PROCS OK: 3 processes with command name varnishncsa [16:55:06] RECOVERY - Varnish traffic logger on cp1033 is OK: PROCS OK: 3 processes with command name varnishncsa [16:57:21] New patchset: Matthias Mullie; "Make abusefilter emergency disable more sensible" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/25855 [17:03:29] apergos: AaronSchulz: can't we just move to Swift-only during the maintenance window, then pick up the deltas after we're done? [17:03:40] notpeter: ping [17:03:49] notpeter: can you please merge https://gerrit.wikimedia.org/r/26640 [17:04:27] leaving us with swift-only afterwards? [17:04:52] apergos: how long is the window exactly? [17:05:07] 3 hours, to give ourselves lots and lots of time (see the email) [17:05:09] sure [17:05:36] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26640 [17:05:48] * AaronSchulz was looking at the wiki paeg [17:05:49] *page [17:05:50] so I guess captcha isn't ready for ms7 to be gone yet; I'm still not sure of the status of itmeline and math [17:06:17] preilly: pushing live now [17:06:22] I'm still seeing some requests for wikipedia/xx/math/blah.png from ms7 (prolly cached pages) [17:07:11] notpeter: thanks! [17:08:18] although cp1044 is down.... [17:12:05] New patchset: Adminxor; "This will make every lab machine throw their logs to the logstash server for testing Please note that rsyslog will continue to log messages locally new file: files/rsyslog/z-logstash.conf modified: manifests/base.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26642 [17:13:06] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26642 [17:13:35] Change abandoned: Adminxor; "Abandoning this as there is an updated one: https://gerrit.wikimedia.org/r/26642" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26407 [17:19:06] PROBLEM - Varnish traffic logger on cp1033 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:19:33] PROBLEM - Varnish traffic logger on cp1027 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:19:42] PROBLEM - Varnish traffic logger on cp1043 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:19:42] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:19:51] PROBLEM - Varnish traffic logger on cp1031 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:19:51] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:19:51] PROBLEM - Varnish traffic logger on cp1029 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:19:51] PROBLEM - Varnish traffic logger on cp1021 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:19:52] apergos: I like how there is an nfs mount called "originals" that has math/timeline files too ;) [17:20:00] PROBLEM - Varnish traffic logger on cp1035 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:20:09] PROBLEM - Varnish traffic logger on cp1025 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:20:18] PROBLEM - Varnish traffic logger on cp1026 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:20:27] PROBLEM - Varnish traffic logger on cp1022 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:20:27] PROBLEM - Varnish traffic logger on cp1024 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:20:27] PROBLEM - Varnish traffic logger on cp1028 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:20:27] PROBLEM - Varnish traffic logger on cp1034 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:20:36] PROBLEM - Varnish traffic logger on cp1032 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:20:36] PROBLEM - Varnish traffic logger on cp1030 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:20:36] PROBLEM - Varnish traffic logger on cp1036 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:20:36] PROBLEM - Varnish traffic logger on cp1042 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:23:19] we're attempting to make as many copies of them as possible :-P [17:25:49] apergos - regarding the ms7 switch tomorrow [17:26:00] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:26:01] uh huh [17:26:12] u mentioned upload will be disabled .. .how long? [17:26:46] if things go smoothly, it shouldn't be that long [17:26:56] need to communicate a window [17:27:07] We'll start this on Friday Oct 5 at 11am UTC, to conclude at [17:27:08] 2pm UTC or earlier. [17:27:13] the email that went out gave a time frame of 3 hours [17:27:16] 3 hours is long [17:27:22] apergos: can't we just use syncFileBackend? [17:27:31] we expect that to be much longer than we need, but better to be too long than too short [17:27:53] syncFiileBackend to do what? [17:28:14] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23681 [17:28:16] this is just about the mount point, and config changes, we're not going to be moving over any more data during thattime [17:29:55] seems like we've got (at least) three options: [17:29:59] what is the expected downtime? [17:30:02] 1/2 hour? [17:30:43] option a) go read-only, sync NetApp, switch backend, go read-write [17:31:21] option b) go Swift-only, sync NetApp, go MultiWrite to NetApp, clean up [17:32:00] option c) hybrid approach. go with option A, then if that's taking longer than anticipated, switch to the option B strategy [17:32:31] New patchset: DamianZaremba; "Update the SNMP traps." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25758 [17:32:45] see I don't think there is a need for 'sync NetApp' as a separate step in there ... is there? [17:32:50] in option a [17:33:03] oh? [17:33:31] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25758 [17:33:49] seem like it should happen [17:33:50] metrics meeting starting....I'll be tuning out here [17:33:50] except for things that aren't served by swift, we could do one last run of those, but they aren't much [17:35:02] extdist captcha but those have already gone over once, we'd just be catching anything that's changed [17:36:43] captcha file don't change [17:36:53] so we wouldn't need it for those [17:36:56] (not unless an sysadmin changes them) [17:37:38] maybe we can switch captcha first, and then math/timeline [17:37:39] we'll be a little off sync with some images and math but since swift will have those we can deal with it after writes are back on... can't we? [17:37:45] it feel simpler to do less at once [17:38:33] these are just config var changes, right? [17:38:36] apergos: math/timeline files are based on content hashes and thus largely immutable (just new files being added) [17:38:48] I can do a copyFileBackend (diff of two listings) for those [17:39:04] that would sync them up, I already had to do that before due to some mistake [17:39:10] didn't take too long [17:39:46] apergos: for captcha it's just a var change [17:39:55] right [17:39:57] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.991 seconds [17:40:08] for math/timeline, basically a var change plus a copy script run to catch any file changes [17:40:51] RECOVERY - Varnish traffic logger on cp1036 is OK: PROCS OK: 3 processes with command name varnishncsa [17:40:51] RECOVERY - Varnish traffic logger on cp1030 is OK: PROCS OK: 3 processes with command name varnishncsa [17:40:51] RECOVERY - Varnish traffic logger on cp1032 is OK: PROCS OK: 3 processes with command name varnishncsa [17:40:52] and the copy script should run right afterwards? or can it wait a few hours (say for normal workking hours in SF)? [17:41:27] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [17:41:27] RECOVERY - Varnish traffic logger on cp1043 is OK: PROCS OK: 3 processes with command name varnishncsa [17:41:36] RECOVERY - Varnish traffic logger on cp1031 is OK: PROCS OK: 3 processes with command name varnishncsa [17:41:36] RECOVERY - Varnish traffic logger on cp1029 is OK: PROCS OK: 3 processes with command name varnishncsa [17:41:37] reads come from swift, so users won't notice if the copy script takes some time [17:41:45] RECOVERY - Varnish traffic logger on cp1021 is OK: PROCS OK: 3 processes with command name varnishncsa [17:41:54] RECOVERY - Varnish traffic logger on cp1035 is OK: PROCS OK: 3 processes with command name varnishncsa [17:41:54] RECOVERY - Varnish traffic logger on cp1025 is OK: PROCS OK: 3 processes with command name varnishncsa [17:41:54] RECOVERY - Varnish traffic logger on cp1023 is OK: PROCS OK: 3 processes with command name varnishncsa [17:42:03] RECOVERY - Varnish traffic logger on cp1024 is OK: PROCS OK: 3 processes with command name varnishncsa [17:42:05] heh, actually, the only reason timeline writes to both is a wmf branch hack (to counter the "noPushQuickOps" hack in config)... [17:42:21] RECOVERY - Varnish traffic logger on cp1042 is OK: PROCS OK: 3 processes with command name varnishncsa [17:42:21] RECOVERY - Varnish traffic logger on cp1033 is OK: PROCS OK: 3 processes with command name varnishncsa [17:42:26] * AaronSchulz should probably deal with that sometime... [17:42:30] RECOVERY - Varnish traffic logger on cp1034 is OK: PROCS OK: 3 processes with command name varnishncsa [17:42:48] RECOVERY - Varnish traffic logger on cp1027 is OK: PROCS OK: 3 processes with command name varnishncsa [17:43:43] :-) [17:43:48] New patchset: DamianZaremba; "Update the SNMP traps." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25758 [17:44:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25758 [17:45:02] maybe I could have thumbs write to dev/null and disable noPushQuickOps ;) [17:45:07] hahaha [17:45:08] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25758 [17:45:10] not tomorrow! [17:45:26] sooooo [17:45:46] looking at this list [17:45:49] http://wikitech.wikimedia.org/view/Media_server#Transitioning_from_one_nfs_mount_to_another_.28ie._ms7_to_netapp.29.2C_Oct_2012 [17:45:54] if I'mnot missing some things: [17:46:49] it loks like we still need to leave ms7 up for a while yet for requests for things in old cached pages that still go there, those should die off eventually [17:47:54] there's a change in filebackend.php and in extdist/svn-invoke.conf , (I don't know if that second file is used), besides the config changes [17:48:39] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [17:49:08] one change to a puppet file I think, which affects fenari, for extdist, so we don't have to wait around for that, we can do a puppet run immediately [17:49:22] so it seems to me we might as well do the whole set of changes at once [17:49:32] instead of eeking out config changes one at a time [17:51:36] how hard is it to do captcha? isn't that easy? [17:52:14] yes I think so, that's why we might as well do them together [17:52:34] * AaronSchulz doesn't understand that [17:52:49] you wanted to do the switchover in separate steps I thought [17:52:57] that seems like more work to me [17:53:43] here is where I wish my virtual self could walk over to your desk [17:56:36] RECOVERY - Varnish traffic logger on cp1028 is OK: PROCS OK: 3 processes with command name varnishncsa [17:56:36] RECOVERY - Varnish traffic logger on cp1022 is OK: PROCS OK: 3 processes with command name varnishncsa [18:03:57] hi guys. how do i get on operations-l? [18:04:05] i realize it's not listed. [18:08:48] !log catrope synchronized php-1.21wmf1/resources/mediawiki/mediawiki.feedback.js 'Ib8746ece3a34b5e577dc08596b0ff7b1b96b5d73' [18:08:59] Logged the message, Master [18:09:29] !log catrope synchronized php-1.21wmf1/languages/messages/MessagesEn.php 'Ib8746ece3a34b5e577dc08596b0ff7b1b96b5d73' [18:09:40] Logged the message, Master [18:12:06] ty [18:14:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:28:06] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.877 seconds [18:29:00] RECOVERY - Host analytics1007 is UP: PING OK - Packet loss = 0%, RTA = 26.55 ms [18:30:12] RECOVERY - Varnish traffic logger on cp1026 is OK: PROCS OK: 3 processes with command name varnishncsa [18:38:37] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [19:02:46] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:04:52] PROBLEM - Host analytics1007 is DOWN: PING CRITICAL - Packet loss = 100% [19:07:25] RECOVERY - Host srv266 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [19:10:43] PROBLEM - Memcached on srv266 is CRITICAL: Connection refused [19:18:17] New review: CSteipp; "I think it looks good. Have there been instances of rules being automatically disabled, that should ..." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/25855 [19:18:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.035 seconds [19:25:01] New review: Matthias Mullie; "The filters for ArticleFeedbackv5 have been known to auto-disable the past couple of months - only r..." [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/25855 [19:30:13] PROBLEM - NTP on srv266 is CRITICAL: NTP CRITICAL: No response from NTP server [19:30:18] New review: CSteipp; "Makes sense to me" [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/25855 [19:42:04] mutante: https://toolserver.org/~legoktm/wlm/stats.php You're no 1! [19:43:16] Reedy: woohoo. i have just been told. did not expect that:) [19:43:38] i heard just 3% of people who downloaded the app actually uploaded images using it?! that seems low [19:47:19] AaronSchulz: (when you have a minute) anything special I need to know about making the config changes or the filebackend.php changes tomorrow? I was just gonna follow the standard procedure on wikitech i.e. http://wikitech.wikimedia.org/view/How_to_do_a_configuration_change#Change_wiki_configuration [19:47:45] apergos: so you are going with the "read-only" plan? [19:48:42] um [19:48:48] I guess so [19:49:52] Reedy will be around but I hope we won't have to ask him for anything [19:51:04] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:52:00] AaronSchulz [19:52:34] RECOVERY - Memcached on srv266 is OK: TCP OK - 0.001 second response time on port 11000 [19:53:08] apergos: anyway, there is nothing special about deployments to that file [19:53:48] ok, thanks [19:56:04] apergos: still, we could dump the filejournal positions, rsync again, switch mounts, and run the php sync script [19:56:20] though the read only thing could be quick if nothing blows ups [19:56:24] rsync which? [19:56:37] PROBLEM - Apache HTTP on srv266 is CRITICAL: Connection refused [19:56:37] rsync netapp to ms7 [19:56:45] *against ms7 [19:56:47] cause running an rsync across all of ms7 takes a long time (many hours) even when it's caught up [20:00:40] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [20:00:56] that's not necessarily a problem [20:00:58] we still have swift [20:01:08] it's just as backup, right [20:01:30] well what I mean is I woudn't want to rsync (again) then switch mounts [20:01:43] you can rsync after switching mounts [20:02:52] yes, I was expecting we would do that [20:03:00] there is the php sync script also [20:05:28] apergos: hmm, yeah, just doing rsync after should work (and you don't need the php script) [20:05:50] well [20:05:56] though... [20:06:01] we'll rsync over stuff that [20:06:23] either way, we should minimize disabling uploads if we don't need it [20:06:25] the "don't touch this if it changed" flag of rsync would have to be on [20:06:26] might get deleted or updated a tiny bit of it by users during the hours it takes [20:06:40] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.036 seconds [20:07:29] yes, if we can get by without disabling, so much the better [20:12:55] RECOVERY - Apache HTTP on srv266 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.044 second response time [20:18:24] !log olivneh synchronized php-1.20wmf12/extensions/E3Experiments [20:18:30] apergos: so does that sound ok then? [20:18:35] Logged the message, Master [20:19:17] tell me one more time what we'd be doing AaronSchulz [20:21:44] heh, so we'd 1) switch MW to use the netapp mount 2) run rsync --delete (ms7->netapp) again with the flags set so it won't override stuff with a high timestamp than the source [20:23:09] * apergos checks something [20:24:44] they key is that MW can tolerate netapp being out of date since a) reads come from swift b) write ops first automatically resync if needed before they happen [20:25:08] yep, on the MW side it will be ok [20:25:15] the "resyncs" will show up in the backend logs [20:25:25] just thinking carefully about the rsync [20:25:32] for sure :) [20:25:40] the flags have to be right [20:25:45] oh yeah they do [20:26:23] -u, --update skip files that are newer on the receiver [20:27:03] * AaronSchulz goes off for a while [20:27:21] hmmm [20:28:29] user uploads copyrighted file tonight, ms7 has it, it didn't make it into the rsync. tomorrow it's deleted after the mount switch. we start the rsync.... [20:28:56] "if the source has a directory where the destination has a file, the transfer would occur regardless of the timestamps" (update documentation) [20:29:45] in our case the target will have nothing, the source will have a file [20:30:22] I guess it would get copied over, which we don't want [20:30:35] AaronSchulz: when you get back what do you think about that scenario ^^ [20:36:26] hey guys, there is an external contractor working with dario who wants me to install MongoDB on stat1 for his project [20:36:44] I don't mind doing this at all, but I was wondering if there would be objections from ops when I ask for a review [20:37:01] do you have an opinion on this, mark? [20:37:43] not really, if you support it it should be fine [20:38:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:39:56] ookey dokey, cool, tahnks! [20:41:23] !log olivneh synchronized php-1.20wmf12/extensions/ClickTracking/modules/jquery.clickTracking.js [20:41:33] Logged the message, Master [20:46:10] Eloquence: are the videos loading for you now? [20:46:52] lemme give it a try [20:51:33] trying https://commons.wikimedia.org/wiki/File:Showcase_-_San_Francisco_Wikipedia_Hackathon_2012.ogv , not playing [20:51:37] let me try HTTP [20:52:08] no dice [20:52:40] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.174 seconds [20:53:26] mhhh, actually, it does play, just not after you press the OggHandler play button [20:53:35] after playing the browser's native HTML5 player button, it starts [20:53:47] this is not normal behavior, but it looks like it's at least getting a stream [20:56:11] apergos: file removals a trickier, since there is no tombstone to mark them, only the php sync script would catch those [20:56:35] ok [20:57:00] so I think we wil have to follow the rsync with your sync script (well you will get to follow it) [20:57:18] hmm [20:57:57] mark - on the other hand, i am having problems with some videos that were running fine earlier .. [20:58:13] hms, even that behavior is not consistent. of course with ogg theora video it's always hard to tell where the problems are coming from :p [20:58:26] nothing is consistent about this problem [21:00:31] i wrote this little script to debug the problem: http://dpaste.org/aCYGD/ and so far it's giving me good results [21:00:36] wait, seems to be a browser thing for me .. [21:00:44] well, I am no longer seeing the long load delays + 503s that I was seeing before, so that's great. [21:00:48] woosters: if you have a straight video url which is not working for you, please let me know [21:02:02] chrome and firefox behave differently .. so let me check my plugins [21:02:06] there are also some issues with missing symlinks for old deployment slots causing the player not to load correctly on some cached pages (this is a recurring issue, we really need git-deploy or a similar system so we don't constantly change the pointers to static resources) [21:03:08] in chrome, I cannot play https://upload.wikimedia.org/wikipedia/commons/c/c2/Customizing_Wikipedia_with_Javascript_-_User_Scripts_and_Gadgets_-_San_Francisco_Wikipedia_Hackathon_2012.ogv at all [21:03:31] hm, is loading instantly for me in ff [21:03:34] but it loads fine in FF, so that may be a browser-level / encoding issue [21:04:02] bleh, ^W [21:05:44] most of these videos are now cached too of course, so not all of it is improved due to streaming [21:06:00] heh [21:06:09] I think they consume a good portion of the caches ;) [21:06:21] it's a good stress test of the other new varnish code too, the persistence layer [21:06:34] varnish instantly has to come up with 1+ GB of space for these videos to stream [21:08:38] eloquence - yep, experiencing same behavior on chrome like u [21:09:21] mark, so - FF experience seems to work pretty reliably. ogghandler seems a bit flaky, but using the native player (direct access to file) seems to always work. other issues are likely browser/codec issues. [21:09:38] strange that it would matter vs squid [21:10:08] yeah, although - it may be that chrome had the same issues and we just didn't notice. [21:10:22] I hope so [21:10:26] don't want to rollback again ;) [21:11:15] it would be nice if we had a straightforward way to test use of squid to compare transport behavior / is that at all feasible? [21:11:36] changing /etc/hosts is easiest atm [21:11:56] hmm [21:12:10] we could setup an extra hostname for testing this I guess, will need config changes though [21:12:29] locally changing /etc/hosts would work? if so, can you give me the relevant settings? [21:12:56] yes, update /etc/hosts to the ip addresses of either upload-lb.pmtpa.wikimedia.org upload-lb.eqiad.wikimedia.org [21:13:14] i have right now: [21:13:14] 2620:0:862:ed1a::b upload.wikimedia.org [21:13:15] 208.80.154.235 upload.wikimedia.org [21:13:21] (so eqiad) [21:13:41] tampa v4 ip is 208.80.152.211 [21:15:32] also european users are still on squid in esams, so if we notice a difference between european users vs the rest of the world... [21:23:07] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [21:23:07] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [21:26:22] mutante: hey, I just made a change to mw-config, can you walk me through the pushing process ? [21:26:35] i forgot about UTC with the account limit and people are like "i can't make an account!" [21:26:52] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:26:53] or anyone else who is good with that [21:27:17] apergos: when did the first netapp rsync start? [21:27:36] LeslieCarr: pushing to where? [21:27:43] apaches [21:27:49] LeslieCarr: sure, go to fenari [21:28:02] done, it's already updated in /home/w/common/wmf-config [21:28:04] sync-file wmf-config/FooBarFile.php I changed osmethings [21:28:07] a several days ago [21:28:20] kewlio [21:28:43] New review: Ryan Lane; "Inline comments." [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/26642 [21:28:50] AaronSchulz [21:28:58] !log lcarr synchronized wmf-config/throttle.php 'I updated a throttle' [21:29:02] yay [21:29:03] thanks :) [21:29:09] Logged the message, Master [21:29:22] no training wheels! [21:29:37] :) [21:29:56] then you can git commit, set a commit summary, then do git push origin [21:30:20] mark, the notable difference between squid/varnish is that an OGV played with FF does not show the full video length and does not permit seeks if played via varnish. [21:30:35] hm. [21:30:39] Reedy: really? i just merge in gerrit, then git pull on fenari, and sync and thats all [21:30:50] yeah [21:30:54] you can do it either way round [21:30:56] ok thanks, i'll debug that [21:31:09] Reedy: alright [21:31:22] weird, it's not working [21:31:25] anyone have an idea why ? [21:31:25] seeks probably means "range requests" [21:31:33] I don't think that works in varnish [21:31:36] still throttling account creation [21:31:41] apergos: is there a SAL entry? [21:31:46] we really need to get a special hostname for videos... [21:31:55] um [21:32:43] yeah. this is a pretty big deal as it applies to all videos [21:32:46] I see one on sept 29 [21:32:47] * AaronSchulz checks the emails [21:32:54] 10:39 mark: Started rsync of /export/upload (respecting /etc/rsync.includes) to /mnt (nas1-a:vol/originals/) on ms7 in a screen session [21:32:58] so no seeking makes for a much worse player experience [21:33:05] range requests probably work for _cached_ videos [21:33:19] and therefore also for videos that are not streamed (< 64 MB) [21:33:21] but yes, this sucks [21:33:39] i was gonna be at VUG this weekend [21:33:50] i'm now sad I canceled it due to being busy :) [21:34:01] i'll instruct faidon though [21:34:13] VUG = varnish user/developer group meeting [21:34:32] trying with an 18MB video at http://commons.wikimedia.org/wiki/File:Jimmy_Wales_answers_the_question_about_internet_censorship_in_China.ogv I get the same issue (no video length / no seeking) [21:35:22] * AaronSchulz does 14 days for good measure [21:35:48] (works fine with squid) [21:35:50] alright [21:35:55] apergos: I'm dumping file log positions to me home dir now :) [21:36:00] i'll check this tomorrow [21:36:04] cool, thanks. [21:36:06] if it's not gonna be a quick fix i'll do a rollback then [21:36:09] I'll close the old bug and open a new one [21:36:13] thanks [21:36:16] ah for the script? awesome [21:39:45] hrm [21:40:10] perhaps it's because videos are nearly always fetched from the backend varnish by the frontends [21:40:53] !log olivneh synchronized php-1.20wmf12/extensions/E3Experiments [21:41:04] Logged the message, Master [21:41:33] mark, some old discussions about range requests + streaming here : http://www.gossamer-threads.com/lists/varnish/dev/21653 / http://www.mail-archive.com/varnish-dev@varnish-cache.org/msg00737.html [21:41:42] not sure the latter patch(es) landed in the streaming branch ultimately [21:42:09] yes [21:42:16] i'll check it and discuss it with the devs [21:42:37] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.034 seconds [21:42:39] :) [21:43:53] if varnish supports range requests now at all, it will only be for ranges that have already come in while fetching the entire file [21:44:16] i've read the code, there's nothing in there that supports fetching partial objects [21:44:36] so that means that in some cases, clients need to wait for a range while the entire object is being fetched [21:44:38] not ideal at all [21:46:17] anyway... i'm going to bed [21:46:22] g'night mark :) [21:46:25] good night! [21:46:51] night [21:52:10] New patchset: Adminxor; "Did the clean up modified: manifests/base.pp deleted: files/rsyslog/z-logstash.conf" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26719 [21:53:06] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/26719 [22:06:44] !log powercycling downed cp1044 [22:06:55] Logged the message, Master [22:09:50] RECOVERY - Host cp1044 is UP: PING OK - Packet loss = 0%, RTA = 26.51 ms [22:11:27] New patchset: DamianZaremba; "First hash at starting to puppetize bots in labs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26441 [22:12:32] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26441 [22:13:08] PROBLEM - Varnish HTTP mobile-frontend on cp1044 is CRITICAL: Connection refused [22:13:44] PROBLEM - Varnish traffic logger on cp1044 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [22:16:44] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:25:21] AaronSchulz: feel free to change the list to reflec the stuff in yer email [22:25:25] I gotta get somesleep [22:25:30] woosters: AaronSchulz, apergos and others. here's what I'm about to put on Commons Village Pump: [22:25:31] thanks for looking at it [22:25:34] "Hi everyone, we have some emergency maintenance that will require a brief period where we stop uploads. We plan to do this starting 18:00 UTC, hopefully brief, but potentially lasting up to 3 hours (until 21:00 UTC) tomorrow. Our backup NFS server is nearly full, and we're having hardware issues with our new Swift servers, so rather than run without a real-time backup, we plan to use a newer NFS server (nas1) which has more capacity. Sorry f [22:25:37] you have to sleep sometime :) [22:25:50] it's true, I do! [22:25:58] sleep? what is this [22:26:01] "Sorry f" [22:26:06] (it was truncated) [22:26:10] orry for the inconvenience, and thank you for your patience!" [22:26:26] 18:00? [22:26:28] uh [22:26:29] what if the person is impatient? [22:26:32] 11:00 utc [22:26:37] oh? [22:26:39] what are we thanking them for in that case? [22:26:39] to 2pm utc [22:27:00] yeah mark and I and reedy [22:27:23] apergos: ah, good thing I checked [22:27:28] yep :-) [22:27:42] * apergos wonders where the 18:00 utc came from [22:27:56] faulty timezone conversion on my part [22:28:00] robla - looks good [22:28:01] heh [22:28:28] it's unclear we wil actually have to stop uploads [22:28:30] but we may [22:28:31] this is why it's good for you all to put stuff here instead of leaving it to me: :) http://wikitech.wikimedia.org/view/Deployments [22:29:17] yeah I looked at that but it's software deployments, I didn't know what we do about actual maintenance [22:29:29] that's why I renamed it today [22:29:34] hahaha [22:29:38] sneaky! [22:29:56] we've been putting hardware stuff on there for a while [22:30:03] I didn't know that [22:30:24] New patchset: Adminxor; "Did the clean up modified: manifests/base.pp deleted: files/rsyslog/z-logstash.conf" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26719 [22:31:10] ok do you guys need anything more form me? cause otherwise I really am gone to bed [22:31:17] all done, thanks! [22:31:25] thank you! night [22:32:07] AaronSchulz: we're thanking them for their patience because we're going to pretend they're patient, even if they're not :) [22:32:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.031 seconds [22:33:14] RECOVERY - Varnish HTTP mobile-frontend on cp1044 is OK: HTTP OK HTTP/1.1 200 OK - 643 bytes in 0.054 seconds [22:33:59] RECOVERY - Varnish traffic logger on cp1044 is OK: PROCS OK: 3 processes with command name varnishncsa [22:43:41] Change abandoned: Adminxor; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26642 [23:05:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:18:06] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.089 seconds [23:20:10] New patchset: Ryan Lane; "Sending logs to logstash" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26719 [23:21:05] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/26719 [23:21:34] New patchset: Ryan Lane; "Sending logs to logstash" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26719 [23:22:29] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/26719 [23:22:38] New patchset: Ryan Lane; "Sending logs to logstash" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26719 [23:23:30] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/26719 [23:26:22] New patchset: Ryan Lane; "Sending logs to logstash" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26719 [23:27:16] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26719 [23:27:23] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26719 [23:36:45] New patchset: Dzahn; "re-enable shell for Brion :)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26725 [23:37:44] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26725 [23:40:40] New patchset: Dzahn; "re-enable shell for Brion :)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26725 [23:41:36] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26725 [23:41:56] New patchset: Dzahn; "re-enable shell for Brion :)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26725 [23:42:47] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26725 [23:53:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds