[00:05:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.806 seconds [00:40:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:46:39] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [00:46:39] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [00:52:53] New patchset: Andrew Bogott; "Rearrange LocalSettings yet again." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42502 [00:54:32] New review: Andrew Bogott; "Silke --" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/42502 [00:55:02] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.031 seconds [01:10:38] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 262 seconds [01:12:26] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 9 seconds [01:26:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:35:32] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 212 seconds [01:35:59] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 228 seconds [01:41:05] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.020 seconds [01:59:23] !log tstarling synchronized php-1.21wmf6/includes/DefaultSettings.php [01:59:34] Logged the message, Master [01:59:46] !log tstarling synchronized php-1.21wmf6/includes/Article.php [01:59:55] Logged the message, Master [02:00:11] !log tstarling synchronized php-1.21wmf6/includes/OutputPage.php [02:00:25] Logged the message, Master [02:00:34] !log tstarling synchronized php-1.21wmf6/includes/ImagePage.php [02:00:44] Logged the message, Master [02:01:04] !log tstarling synchronized php-1.21wmf7/includes/DefaultSettings.php [02:01:15] Logged the message, Master [02:01:38] !log tstarling synchronized php-1.21wmf7/includes/OutputPage.php [02:01:49] Logged the message, Master [02:01:56] !log tstarling synchronized php-1.21wmf7/includes/Article.php [02:02:06] Logged the message, Master [02:02:12] !log tstarling synchronized php-1.21wmf7/includes/ImagePage.php [02:02:22] Logged the message, Master [02:08:32] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [02:08:33] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [02:08:33] PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours [02:08:33] PROBLEM - Puppet freshness on ms-fe1003 is CRITICAL: Puppet has not run in the last 10 hours [02:08:33] PROBLEM - Puppet freshness on ms-fe1004 is CRITICAL: Puppet has not run in the last 10 hours [02:08:33] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [02:14:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:17:17] New patchset: Tim Starling; "Enable $wgEnableCanonicalServerLink on uzwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42505 [02:20:30] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42505 [02:23:42] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [02:25:30] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.005 second response time on port 11000 [02:26:41] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.027 seconds [02:28:59] !log LocalisationUpdate completed (1.21wmf6) at Mon Jan 7 02:28:58 UTC 2013 [02:29:10] Logged the message, Master [02:31:03] !log tstarling synchronized wmf-config/InitialiseSettings.php [02:31:12] Logged the message, Master [02:31:38] PROBLEM - Puppet freshness on ms-be11 is CRITICAL: Puppet has not run in the last 10 hours [02:31:40] Are you going to manually purge the Squid cache or let it expire naturally? [02:34:19] New patchset: Ryan Lane; "Adding reactor support" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42506 [02:34:56] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [02:35:32] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [02:36:29] New patchset: Ryan Lane; "Adding reactor support" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42506 [02:46:28] Susan: manually purge [02:51:11] !log LocalisationUpdate completed (1.21wmf7) at Mon Jan 7 02:51:10 UTC 2013 [02:51:22] Logged the message, Master [02:52:38] PROBLEM - Puppet freshness on ms-be9 is CRITICAL: Puppet has not run in the last 10 hours [02:57:35] PROBLEM - Puppet freshness on ms-be12 is CRITICAL: Puppet has not run in the last 10 hours [03:17:14] RECOVERY - Puppet freshness on search30 is OK: puppet ran at Mon Jan 7 03:17:06 UTC 2013 [04:08:08] !log tstarling synchronized php-1.21wmf6/maintenance/purgeList.php [04:08:18] Logged the message, Master [04:31:31] PROBLEM - Varnish HTTP upload-backend on cp1034 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:31:31] PROBLEM - Varnish traffic logger on cp1034 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:31:49] PROBLEM - Varnish HTCP daemon on cp1034 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:32:17] PROBLEM - SSH on cp1034 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:32:34] PROBLEM - Varnish HTTP upload-frontend on cp1034 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:53:16] PROBLEM - NTP on cp1034 is CRITICAL: NTP CRITICAL: No response from NTP server [05:56:08] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [06:05:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:07:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.342 seconds [06:43:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:48:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.814 seconds [07:25:17] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:29:02] PROBLEM - Puppet freshness on silver is CRITICAL: Puppet has not run in the last 10 hours [07:29:03] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [07:37:44] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.179 seconds [07:46:14] New review: Lupo; "The idea is to add this here and once it's deployed to remove the corresponding default in UploadWiz..." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/39026 [07:51:05] PROBLEM - Puppet freshness on ssl1001 is CRITICAL: Puppet has not run in the last 10 hours [08:12:18] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:22:57] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.303 seconds [08:26:19] good morning [08:27:44] Morning [08:30:34] Reedy: have you switched to European timezone? :D [08:36:14] Not quite [08:36:26] For today at least ;) [08:38:04] New patchset: Hashar; "Revert "configure pep8 to lint erb templates"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42516 [08:44:34] !g Ib377b732930da687504c3a78cdf921143c5c52d1 [08:44:34] https://gerrit.wikimedia.org/r/#q,Ib377b732930da687504c3a78cdf921143c5c52d1,n,z [08:51:06] !log Jenkins: update all jobs. That bring whitespace checking for MediaWiki extensions. See {{gerrit|37803}} [08:51:20] Logged the message, Master [08:56:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:10:06] !log Jenkins: whitelisted Bryan Tong Minh in Zuul {{gerrit|39850}} [09:10:19] Logged the message, Master [09:10:39] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.032 seconds [09:17:51] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [09:33:19] New patchset: Nikerabbit; "Updated ttmserver Solr schema" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42519 [09:42:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:45:28] !log Jenkins: EventLogging now receives pep8 linting, {{gerrit|42517}} & {{gerrit|42518}} [09:45:37] Logged the message, Master [09:53:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.252 seconds [10:29:52] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:34:06] New patchset: Hashar; "Revert "Kill static-master"" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42526 [10:40:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.359 seconds [10:47:51] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [10:47:52] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [10:58:18] New patchset: Hashar; "Revert "Kill static-master"" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42530 [10:58:52] New patchset: Hashar; "Revert "Kill static-master"" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42526 [10:58:57] Change abandoned: Hashar; "Dupe of https://gerrit.wikimedia.org/r/#/c/42526/" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42530 [10:59:28] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42526 [11:02:01] New review: Hashar; "Fixed up https://bugzilla.wikimedia.org/show_bug.cgi?id=43692" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42526 [11:07:35] !log pulled on fenari {{gerrit|42526}} 'Revert "Kill static-master"' thought NOT synced. That simply bring back some symbolic links for 'beta' so that is harmful. [11:07:45] Logged the message, Master [11:08:14] New review: Hashar; "Pulled on fenari though not synced:" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42526 [11:14:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:20:33] PROBLEM - MySQL Slave Delay on db1007 is CRITICAL: CRIT replication delay 211 seconds [11:21:00] PROBLEM - MySQL Replication Heartbeat on db1007 is CRITICAL: CRIT replication delay 222 seconds [11:28:48] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.034 seconds [11:37:57] RECOVERY - MySQL Slave Delay on db1007 is OK: OK replication delay 0 seconds [11:38:25] RECOVERY - MySQL Replication Heartbeat on db1007 is OK: OK replication delay 0 seconds [12:00:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:09:45] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [12:09:46] PROBLEM - Puppet freshness on ms-fe1004 is CRITICAL: Puppet has not run in the last 10 hours [12:09:46] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [12:09:46] PROBLEM - Puppet freshness on ms-fe1003 is CRITICAL: Puppet has not run in the last 10 hours [12:09:46] PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours [12:09:46] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [12:16:06] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.030 seconds [12:20:51] New patchset: Matthias Mullie; "AFTv5: skip rollbacker group on wikis without that" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42402 [12:22:33] New patchset: Matthias Mullie; "AFT test group permissions have been removed already; these lines no longer make sense" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42538 [12:32:26] PROBLEM - Puppet freshness on ms-be11 is CRITICAL: Puppet has not run in the last 10 hours [12:48:03] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:53:26] PROBLEM - Puppet freshness on ms-be9 is CRITICAL: Puppet has not run in the last 10 hours [12:58:33] PROBLEM - Puppet freshness on ms-be12 is CRITICAL: Puppet has not run in the last 10 hours [13:00:39] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.533 seconds [13:34:06] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:40:09] New review: Silke Meyer; "Yes, like this, I get what I need." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/42502 [13:44:44] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.715 seconds [14:10:03] !log reedy synchronized wmf-config/ [14:10:16] Logged the message, Master [14:20:12] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:22:27] PROBLEM - Puppet freshness on cp1034 is CRITICAL: Puppet has not run in the last 10 hours [14:29:23] hiii paravoid! [14:29:27] am I on duty this week or next? [14:31:00] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.517 seconds [14:32:53] ottomata: you are I think [14:32:58] this week [14:33:08] yeehaw [14:33:19] ok, so, that means, we change the room topic to have my name in it [14:33:22] and people will ask me questions [14:33:31] yes [14:33:31] and I will do my best to answer them, orrrr find someone who can [14:33:35] yes [14:33:40] okey dokey [14:33:57] do you have topic change powers? [14:35:15] you can too [14:37:31] oo, ok [14:39:06] hm, should I also be watching RT for new tickets? [14:41:46] New patchset: Hashar; "(bug 43339) deployment roles for beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42549 [14:53:11] ottomata: sit there mashing F5 [14:53:12] :D [14:56:18] haha, ok [15:05:34] ottomata: Hi there :-] Would you mind merging a tiny change for me please https://gerrit.wikimedia.org/r/#/c/42516/ ? That is a revert of a commit that made the python linter pep8 to lint our ERB templates. That does not work as expected so should be removed :-] [15:05:56] haha, certainly, PEP8 for ruby? [15:06:07] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:06:07] isn't pep just a python thing? [15:06:20] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42516 [15:06:40] ottomata: I thought it was smart enough to parse erb templates that hold python code :-] [15:06:51] thanks! [15:06:52] ah [15:06:53] yup! [15:06:56] merged on sockpuppet too [15:07:53] !! [15:10:23] New patchset: Ottomata; "Adding Orange Morocco Wikipedia Zero filter on oxygen." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42552 [15:12:34] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42552 [15:18:34] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.363 seconds [15:52:00] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:57:24] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [16:06:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.028 seconds [16:21:40] New patchset: Ottomata; "Fixing undefined variable error for Redis module when $redis_replication was not set." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42566 [16:24:16] New patchset: ArielGlenn; "quickie interwiki setup tool for folks with local copies of our dumps" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/42567 [16:27:28] "This will be fixed or classified as not an issue as soon as the author finds out if the multiple values for one key is a bug in the original file or a feature.", hehe [16:37:35] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/42567 [16:37:54] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:50:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.166 seconds [16:58:14] AaronSchulz: ping? [16:58:25] AaronSchulz: how temporary are the contents of "temp" containers? [16:58:45] AaronSchulz: or to ask exactly what I'm looking for: do we need to sync these across to eqiad? [17:00:27] AaronSchulz: I'm seeing something like 15% of our storage being temp containers, which I find kind of strange [17:21:37] sbernardin: are you in data center? [17:22:19] Will be back after lunch? [17:22:57] Cmjohnson1: or if you need something now ...I can head back [17:23:46] ms-be5 needs to come off the rack and a new 720xd racked and cfg'd asap [17:25:54] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:27:32] New patchset: Ottomata; "Adding stats user to analytics nodes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42572 [17:28:20] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42572 [17:30:25] PROBLEM - Puppet freshness on silver is CRITICAL: Puppet has not run in the last 10 hours [17:30:25] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [17:31:25] cmjohnson1: will get that done right away...will message you back when I'm done [17:31:52] okay..the ticket will be in your queue [17:36:42] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.735 seconds [17:39:15] PROBLEM - Apache HTTP on srv219 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:39:51] PROBLEM - LVS HTTP IPv4 on rendering.svc.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:39:52] PROBLEM - Apache HTTP on srv221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:39:52] PROBLEM - Apache HTTP on srv222 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:39:52] PROBLEM - Apache HTTP on srv224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:41:08] probably hung converts, shall I shoot em? [17:41:12] PROBLEM - Apache HTTP on srv223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:41:13] paravoid: they are there until they get cleared out by some script and I don't think it's on cron [17:41:30] PROBLEM - Apache HTTP on srv220 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:41:53] doing so [17:41:58] done [17:42:53] RECOVERY - Apache HTTP on srv223 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.057 second response time [17:42:59] ah just when I was about to sweat [17:43:09] RECOVERY - Apache HTTP on srv220 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.976 second response time [17:43:19] RECOVERY - Apache HTTP on srv222 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.213 second response time [17:43:19] yeah it's recovering [17:43:22] memory spike again [17:43:27] RECOVERY - Apache HTTP on srv224 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.701 second response time [17:43:29] I can't wait for j^'s patches [17:43:42] AaronSchulz: can we do that? [17:44:14] AaronSchulz: considering we're copying data from pmtpa to eqiad like crazy, getting rid of 15% of our storage wouldn't hurt us at all [17:44:26] AaronSchulz: it'd probably help with the c2100 replacements too [17:45:54] also as more and more uploads go throw UW it will consume more space [17:46:26] awwww crap [17:46:32] I know why imagescalers are freaked out [17:47:17] http://ganglia.wikimedia.org/latest/graph.php?h=ms-fe1.pmtpa.wmnet&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2&st=1357580821&g=network_report&z=medium&c=Swift%20pmtpa [17:47:55] that's a filled up gigabit [17:48:19] wth is wrong with varnish [17:48:42] RECOVERY - Apache HTTP on srv221 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.465 second response time [17:48:43] PROBLEM - Apache HTTP on srv220 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:48:43] RECOVERY - LVS HTTP IPv4 on rendering.svc.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 67674 bytes in 7.558 seconds [17:48:56] paravoid: wow... that's not awesome [17:49:20] I've noticed that 2h ago [17:49:25] the varnish issue [17:50:21] RECOVERY - Apache HTTP on srv220 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.518 second response time [17:51:05] hmm, I see quite a bit of traffic from imgscalers, that might be what brought swift to its knees [17:51:42] RECOVERY - Apache HTTP on srv219 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.058 second response time [17:52:27] PROBLEM - Puppet freshness on ssl1001 is CRITICAL: Puppet has not run in the last 10 hours [17:52:49] 112M tiffs [17:52:51] ffs [17:53:01] and increasing [17:57:14] Constitution_of_the_United_States,_page_2.tif [17:57:42] the page has thumbnails for 4 pages [17:57:45] of 130MB each [17:57:46] how nice [17:57:50] wooo! [17:57:54] but, this is good to know now [17:58:01] as video will only make things worse.... [17:58:14] is asher around? [17:58:20] he's not in the office [17:59:18] the gigabit cutoff is getting better, but we need to fix whatever it is that makes varnish not cache enough [17:59:46] maybe we have dont_cache_enough=1 set? [17:59:49] ;) [18:05:03] New review: Demon; "On second thought--since this class is only included on fenari, perhaps we could just delete all of ..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/41976 [18:08:38] <^demon> AaronSchulz: I think I'm done rewriting ExtDist now. It fetches the archive and sha1 names properly now, and validates if the branch/tag exists. [18:09:15] yeah, I'll get to it after looking through a few things [18:09:21] <^demon> k, thanks :) [18:09:43] <^demon> We might want to increase cache duration--that was my only concern since we have to make 2 http requests to get all of the data. [18:09:55] <^demon> (Granted, this is ExtDist, so it's not like a billion people use it) [18:10:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:13:39] ^demon: more people would be using it if only there was ever been a time when it worked [18:13:47] *had only been [18:13:59] <^demon> The new version works quite well. [18:14:04] <^demon> I think people will like it. [18:14:13] * Nemo_bis surprised [18:14:24] <^demon> I'm writing an e-mail to wikitech-l about it now. [18:14:59] ooh maybe we'll even be able to readd it to {{extension}} :D [18:15:11] ^demon: do we need squid changes? [18:15:25] good because I am tired of every couple days having to svn clean and chown over there [18:15:29] it was getting old [18:23:58] <^demon> paravoid: Oh man...apaches can't make external http requests can they? [18:24:20] they can't without setting a proxy, no [18:24:23] why? [18:24:33] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.022 seconds [18:24:38] <^demon> The extension makes some requests to the github api. [18:24:46] oh? [18:24:50] <^demon> The new version. [18:25:23] look at how the Flickr plugin does it [18:25:38] upload by url [18:27:17] and also be sure that the GitHub API's terms of service are trivial and if they're not check with legal? :) [18:27:48] <^demon> Well, luckily it's configurable, we can point it at any dumb service that can respond like github's api. [18:27:58] brb, lunchy time! [18:39:41] !log reedy synchronized php-1.21wmf6/extensions/EducationProgram [18:39:51] Logged the message, Master [18:40:42] !log reedy synchronized php-1.21wmf7/extensions/EducationProgram [18:40:52] Logged the message, Master [18:49:33] New review: Pyoungmeister; "I have no opinion on this, but will deploy when you are happy with it." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/42519 [18:56:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:07:13] ottomata: ping? [19:07:21] yeah, i know [19:07:27] hehe [19:07:31] i wish I was there! there's a big ol' analytics meeting right now with sue [19:07:32] that I have to go to [19:07:37] oh, okay [19:07:53] i tried to get them to make it at a different time, but there are like 15 people in this meeting, so hard to change [19:08:05] i will open up etherpad though [19:08:15] chat to me if I need to chime in on anything [19:08:49] great work on RT btw [19:08:57] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.356 seconds [19:09:23] binasher: I have an important varnish-related(?) issue for you when you have the time... [19:09:43] oh, great [19:10:08] heh [19:10:25] We interrupt this program with an important varnish-related announcement.. [19:12:26] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42566 [19:14:52] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: fishbowl, private and closed to wmf7 [19:15:02] Logged the message, Master [19:19:28] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [19:28:56] sbernardin: are you back yet? [19:29:27] ottomata: ori-l ./slot4/00015.stat1.wikimedia.org._a_eventlogging.0 [19:29:39] that means i see backups on tridge [19:29:44] mutante: woot. much much obliged. [19:29:55] yw [19:29:55] cool! [19:30:15] while you are at it, hook up a few 1PB disks so we can back up /a too [19:30:37] it's not about disk space [19:30:45] oh, its not? [19:30:46] it's about the size of virtual tapes [19:30:49] no, never been [19:30:51] oh [19:31:00] hook up a few 1PB tapes then? :p [19:31:17] configure amanda to let a backup span multiple virtual tapes [19:31:26] link is on ticket, heh [19:31:30] oh yeah that's right, i remembrer [19:31:50] or.. just add those directories separately i guess [19:31:52] hmmm, so if I made a bunch of /a/ backups, for each one we actually want to back up [19:31:54] yeahhhhh [19:31:57] like we just did with eventloggging [19:32:02] we just had the same idea :) [19:32:03] yep [19:32:13] cool, bumping that up on my todo list then [19:32:19] :) [19:35:43] notpeter, we've got problem: [19:35:45] maxsem@fenari:~$ curl 'http://solr1001:8983/solr/select/?q=*%3A*&version=2.2&start=0&rows=5&indent=on' [19:35:45] curl: (52) Empty reply from server [19:36:22] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: special, wikinews and wikivoyages to 1.21wmf7 [19:36:32] Logged the message, Master [19:37:15] MaxSem: I can try restarting it [19:37:22] oh, no, I got a response [19:37:28] it was just slow as shit [19:37:32] I get it sometimes too [19:37:48] the server looks absolutely not under load [19:38:04] sometimes, it returns quickly [19:38:07] sometimes, not [19:38:14] sometimes, it times out [19:42:10] paravoid: can you write up a full description of the upload varnish issue? [19:42:17] yes [19:42:17] sec [19:43:34] binasher: so, it looks like that after about Christmas the requests to swift have been quadrupled [19:43:39] MaxSem: ok, the box looks alright. but I odn't really know the internals of solr. what can I do to help? [19:44:15] maybem there's something in the logs? [19:44:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:44:47] I would've disabled updates, too - but there's another deployment window ATM [19:45:19] it's lock contention [19:45:27] I will copy the logs to your home dir on fenari [19:46:42] binasher: bandwidth has went up from ~300mbps per box to almost a gigabit [19:47:13] MaxSem: they should be there now [19:47:27] but yeah.... hella stack traces that are all lock times [19:47:31] paravoid: that's bandwidth on ms-fe hosts? [19:47:33] *timeoutes [19:48:08] binasher: heh I'm good at multitasking but not _that_ good :) [19:48:12] notpeter, thanks [19:48:19] so, yeah, that's to ms-fe boxes [19:48:43] the time we had the imagescaler pages today we had surpassed the gigabit (straight line on the gigabit threshold) [19:49:01] I did a few GETs and objects seem to expire after < 10mins [19:49:22] we now have ~1.6kreq/s up from 300-400 in something like two weeks [19:49:36] notpeter, whee - looks like it's because wikis other than en: are now using it too [19:49:49] I was looking at the git log, saw your 40x TTL changes and re-reviewed them, couldn't see anything wrong [19:49:58] they seem to match date-wise though [19:50:27] binasher: look at ms-fe1.pmtpa.wmnet month view in ganglia [19:51:19] so I guess I'll have to either switch to updates via cron or introduce multicore earlier than I expected: https://gerrit.wikimedia.org/r/#/c/29827/ [19:51:34] multicore needs more investigation though [19:51:53] binasher: a random upload varnish in ganglia also shows different trends the past two weeks [19:52:40] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikibooks and wikquote to 1.21wmf7 [19:52:50] Logged the message, Master [19:53:20] binasher: 354307ddadb0e3f74b23c6979d99f23bb7359d2b could explain it, but I don't see anything wrong with it [19:53:47] (well, it should be < 400, rather than <= 400, but still, that's not it) [19:54:23] that came to mind as the the only thing i know of changing around then [19:54:57] New patchset: MaxSem; "Enable Solr distributed search" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42594 [19:55:06] yeah [19:55:09] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.983 seconds [19:55:19] there's also the SHM change these days but I can't see how's related either [19:55:45] MaxSem: sounds reasonable [19:56:32] paravoid: i'll dig in after this meeting [19:57:02] binasher: thanks [19:57:14] I haven't had much time to debug this, but I think I can find some time today as well [19:57:22] do you confirm the findings some far? [19:57:28] so far even [19:58:07] New patchset: MaxSem; "Disable GeoData jobs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42595 [19:59:04] paravoid: yeah.. but what does today mean for you, it's 21:00 for you, isn't it? [19:59:33] it's 22:00, but I'll work late too today, I need to catch up with ceph people too [20:02:27] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikisource, wikiversity and remaining wikimedia to 1.21wmf7 [20:02:37] Logged the message, Master [20:03:11] !log authdns-update to add tellurium.wikimedia.org [20:03:21] Logged the message, Master [20:03:49] New patchset: Hashar; "(bug 43339) deployment roles for beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42549 [20:04:42] New review: Hashar; "Patchset 2:" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/42549 [20:06:59] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wiktionary to 1.21wmf7 [20:07:08] Logged the message, Master [20:07:54] New patchset: Reedy; "1.21wmf7 phase 2 deployment" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42598 [20:08:46] New review: Ryan Lane; "Some problems here." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42549 [20:12:17] Reedy, can I squeeze in a quick configuration change? [20:12:30] Yeah, I think I'm done.. [20:12:48] thanks! [20:13:13] Change merged: MaxSem; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42595 [20:15:05] Reedy, there's an undeployed wikiversions.dat change in master [20:15:13] paravoid: was looking at the backend instance on cp1027, it appeared not to cache anything beyond 10 seconds [20:15:14] No there isn't [20:15:23] It's deployed, it's committed [20:15:24] binasher: yeah [20:15:27] it's just not approved and updated back in [20:15:43] paravoid: restarted it with no config change, and now its behaving normally [20:15:50] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42598 [20:15:57] the object i was testing with has been cached for > 2min (since restarting) [20:15:57]