[00:05:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.806 seconds [00:40:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:46:39] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [00:46:39] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [00:52:53] New patchset: Andrew Bogott; "Rearrange LocalSettings yet again." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42502 [00:54:32] New review: Andrew Bogott; "Silke --" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/42502 [00:55:02] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.031 seconds [01:10:38] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 262 seconds [01:12:26] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 9 seconds [01:26:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:35:32] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 212 seconds [01:35:59] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 228 seconds [01:41:05] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.020 seconds [01:59:23] !log tstarling synchronized php-1.21wmf6/includes/DefaultSettings.php [01:59:34] Logged the message, Master [01:59:46] !log tstarling synchronized php-1.21wmf6/includes/Article.php [01:59:55] Logged the message, Master [02:00:11] !log tstarling synchronized php-1.21wmf6/includes/OutputPage.php [02:00:25] Logged the message, Master [02:00:34] !log tstarling synchronized php-1.21wmf6/includes/ImagePage.php [02:00:44] Logged the message, Master [02:01:04] !log tstarling synchronized php-1.21wmf7/includes/DefaultSettings.php [02:01:15] Logged the message, Master [02:01:38] !log tstarling synchronized php-1.21wmf7/includes/OutputPage.php [02:01:49] Logged the message, Master [02:01:56] !log tstarling synchronized php-1.21wmf7/includes/Article.php [02:02:06] Logged the message, Master [02:02:12] !log tstarling synchronized php-1.21wmf7/includes/ImagePage.php [02:02:22] Logged the message, Master [02:08:32] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [02:08:33] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [02:08:33] PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours [02:08:33] PROBLEM - Puppet freshness on ms-fe1003 is CRITICAL: Puppet has not run in the last 10 hours [02:08:33] PROBLEM - Puppet freshness on ms-fe1004 is CRITICAL: Puppet has not run in the last 10 hours [02:08:33] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [02:14:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:17:17] New patchset: Tim Starling; "Enable $wgEnableCanonicalServerLink on uzwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42505 [02:20:30] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42505 [02:23:42] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [02:25:30] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.005 second response time on port 11000 [02:26:41] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.027 seconds [02:28:59] !log LocalisationUpdate completed (1.21wmf6) at Mon Jan 7 02:28:58 UTC 2013 [02:29:10] Logged the message, Master [02:31:03] !log tstarling synchronized wmf-config/InitialiseSettings.php [02:31:12] Logged the message, Master [02:31:38] PROBLEM - Puppet freshness on ms-be11 is CRITICAL: Puppet has not run in the last 10 hours [02:31:40] Are you going to manually purge the Squid cache or let it expire naturally? [02:34:19] New patchset: Ryan Lane; "Adding reactor support" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42506 [02:34:56] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [02:35:32] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [02:36:29] New patchset: Ryan Lane; "Adding reactor support" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42506 [02:46:28] Susan: manually purge [02:51:11] !log LocalisationUpdate completed (1.21wmf7) at Mon Jan 7 02:51:10 UTC 2013 [02:51:22] Logged the message, Master [02:52:38] PROBLEM - Puppet freshness on ms-be9 is CRITICAL: Puppet has not run in the last 10 hours [02:57:35] PROBLEM - Puppet freshness on ms-be12 is CRITICAL: Puppet has not run in the last 10 hours [03:17:14] RECOVERY - Puppet freshness on search30 is OK: puppet ran at Mon Jan 7 03:17:06 UTC 2013 [04:08:08] !log tstarling synchronized php-1.21wmf6/maintenance/purgeList.php [04:08:18] Logged the message, Master [04:31:31] PROBLEM - Varnish HTTP upload-backend on cp1034 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:31:31] PROBLEM - Varnish traffic logger on cp1034 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:31:49] PROBLEM - Varnish HTCP daemon on cp1034 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:32:17] PROBLEM - SSH on cp1034 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:32:34] PROBLEM - Varnish HTTP upload-frontend on cp1034 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:53:16] PROBLEM - NTP on cp1034 is CRITICAL: NTP CRITICAL: No response from NTP server [05:56:08] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [06:05:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:07:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.342 seconds [06:43:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:48:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.814 seconds [07:25:17] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:29:02] PROBLEM - Puppet freshness on silver is CRITICAL: Puppet has not run in the last 10 hours [07:29:03] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [07:37:44] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.179 seconds [07:46:14] New review: Lupo; "The idea is to add this here and once it's deployed to remove the corresponding default in UploadWiz..." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/39026 [07:51:05] PROBLEM - Puppet freshness on ssl1001 is CRITICAL: Puppet has not run in the last 10 hours [08:12:18] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:22:57] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.303 seconds [08:26:19] good morning [08:27:44] Morning [08:30:34] Reedy: have you switched to European timezone? :D [08:36:14] Not quite [08:36:26] For today at least ;) [08:38:04] New patchset: Hashar; "Revert "configure pep8 to lint erb templates"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42516 [08:44:34] !g Ib377b732930da687504c3a78cdf921143c5c52d1 [08:44:34] https://gerrit.wikimedia.org/r/#q,Ib377b732930da687504c3a78cdf921143c5c52d1,n,z [08:51:06] !log Jenkins: update all jobs. That bring whitespace checking for MediaWiki extensions. See {{gerrit|37803}} [08:51:20] Logged the message, Master [08:56:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:10:06] !log Jenkins: whitelisted Bryan Tong Minh in Zuul {{gerrit|39850}} [09:10:19] Logged the message, Master [09:10:39] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.032 seconds [09:17:51] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [09:33:19] New patchset: Nikerabbit; "Updated ttmserver Solr schema" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42519 [09:42:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:45:28] !log Jenkins: EventLogging now receives pep8 linting, {{gerrit|42517}} & {{gerrit|42518}} [09:45:37] Logged the message, Master [09:53:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.252 seconds [10:29:52] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:34:06] New patchset: Hashar; "Revert "Kill static-master"" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42526 [10:40:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.359 seconds [10:47:51] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [10:47:52] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [10:58:18] New patchset: Hashar; "Revert "Kill static-master"" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42530 [10:58:52] New patchset: Hashar; "Revert "Kill static-master"" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42526 [10:58:57] Change abandoned: Hashar; "Dupe of https://gerrit.wikimedia.org/r/#/c/42526/" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42530 [10:59:28] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42526 [11:02:01] New review: Hashar; "Fixed up https://bugzilla.wikimedia.org/show_bug.cgi?id=43692" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42526 [11:07:35] !log pulled on fenari {{gerrit|42526}} 'Revert "Kill static-master"' thought NOT synced. That simply bring back some symbolic links for 'beta' so that is harmful. [11:07:45] Logged the message, Master [11:08:14] New review: Hashar; "Pulled on fenari though not synced:" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42526 [11:14:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:20:33] PROBLEM - MySQL Slave Delay on db1007 is CRITICAL: CRIT replication delay 211 seconds [11:21:00] PROBLEM - MySQL Replication Heartbeat on db1007 is CRITICAL: CRIT replication delay 222 seconds [11:28:48] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.034 seconds [11:37:57] RECOVERY - MySQL Slave Delay on db1007 is OK: OK replication delay 0 seconds [11:38:25] RECOVERY - MySQL Replication Heartbeat on db1007 is OK: OK replication delay 0 seconds [12:00:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:09:45] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [12:09:46] PROBLEM - Puppet freshness on ms-fe1004 is CRITICAL: Puppet has not run in the last 10 hours [12:09:46] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [12:09:46] PROBLEM - Puppet freshness on ms-fe1003 is CRITICAL: Puppet has not run in the last 10 hours [12:09:46] PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours [12:09:46] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [12:16:06] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.030 seconds [12:20:51] New patchset: Matthias Mullie; "AFTv5: skip rollbacker group on wikis without that" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42402 [12:22:33] New patchset: Matthias Mullie; "AFT test group permissions have been removed already; these lines no longer make sense" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42538 [12:32:26] PROBLEM - Puppet freshness on ms-be11 is CRITICAL: Puppet has not run in the last 10 hours [12:48:03] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:53:26] PROBLEM - Puppet freshness on ms-be9 is CRITICAL: Puppet has not run in the last 10 hours [12:58:33] PROBLEM - Puppet freshness on ms-be12 is CRITICAL: Puppet has not run in the last 10 hours [13:00:39] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.533 seconds [13:34:06] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:40:09] New review: Silke Meyer; "Yes, like this, I get what I need." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/42502 [13:44:44] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.715 seconds [14:10:03] !log reedy synchronized wmf-config/ [14:10:16] Logged the message, Master [14:20:12] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:22:27] PROBLEM - Puppet freshness on cp1034 is CRITICAL: Puppet has not run in the last 10 hours [14:29:23] hiii paravoid! [14:29:27] am I on duty this week or next? [14:31:00] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.517 seconds [14:32:53] ottomata: you are I think [14:32:58] this week [14:33:08] yeehaw [14:33:19] ok, so, that means, we change the room topic to have my name in it [14:33:22] and people will ask me questions [14:33:31] yes [14:33:31] and I will do my best to answer them, orrrr find someone who can [14:33:35] yes [14:33:40] okey dokey [14:33:57] do you have topic change powers? [14:35:15] you can too [14:37:31] oo, ok [14:39:06] hm, should I also be watching RT for new tickets? [14:41:46] New patchset: Hashar; "(bug 43339) deployment roles for beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42549 [14:53:11] ottomata: sit there mashing F5 [14:53:12] :D [14:56:18] haha, ok [15:05:34] ottomata: Hi there :-] Would you mind merging a tiny change for me please https://gerrit.wikimedia.org/r/#/c/42516/ ? That is a revert of a commit that made the python linter pep8 to lint our ERB templates. That does not work as expected so should be removed :-] [15:05:56] haha, certainly, PEP8 for ruby? [15:06:07] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:06:07] isn't pep just a python thing? [15:06:20] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42516 [15:06:40] ottomata: I thought it was smart enough to parse erb templates that hold python code :-] [15:06:51] thanks! [15:06:52] ah [15:06:53] yup! [15:06:56] merged on sockpuppet too [15:07:53] !! [15:10:23] New patchset: Ottomata; "Adding Orange Morocco Wikipedia Zero filter on oxygen." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42552 [15:12:34] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42552 [15:18:34] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.363 seconds [15:52:00] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:57:24] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [16:06:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.028 seconds [16:21:40] New patchset: Ottomata; "Fixing undefined variable error for Redis module when $redis_replication was not set." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42566 [16:24:16] New patchset: ArielGlenn; "quickie interwiki setup tool for folks with local copies of our dumps" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/42567 [16:27:28] "This will be fixed or classified as not an issue as soon as the author finds out if the multiple values for one key is a bug in the original file or a feature.", hehe [16:37:35] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/42567 [16:37:54] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:50:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.166 seconds [16:58:14] AaronSchulz: ping? [16:58:25] AaronSchulz: how temporary are the contents of "temp" containers? [16:58:45] AaronSchulz: or to ask exactly what I'm looking for: do we need to sync these across to eqiad? [17:00:27] AaronSchulz: I'm seeing something like 15% of our storage being temp containers, which I find kind of strange [17:21:37] sbernardin: are you in data center? [17:22:19] Will be back after lunch? [17:22:57] Cmjohnson1: or if you need something now ...I can head back [17:23:46] ms-be5 needs to come off the rack and a new 720xd racked and cfg'd asap [17:25:54] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:27:32] New patchset: Ottomata; "Adding stats user to analytics nodes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42572 [17:28:20] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42572 [17:30:25] PROBLEM - Puppet freshness on silver is CRITICAL: Puppet has not run in the last 10 hours [17:30:25] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [17:31:25] cmjohnson1: will get that done right away...will message you back when I'm done [17:31:52] okay..the ticket will be in your queue [17:36:42] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.735 seconds [17:39:15] PROBLEM - Apache HTTP on srv219 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:39:51] PROBLEM - LVS HTTP IPv4 on rendering.svc.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:39:52] PROBLEM - Apache HTTP on srv221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:39:52] PROBLEM - Apache HTTP on srv222 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:39:52] PROBLEM - Apache HTTP on srv224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:41:08] probably hung converts, shall I shoot em? [17:41:12] PROBLEM - Apache HTTP on srv223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:41:13] paravoid: they are there until they get cleared out by some script and I don't think it's on cron [17:41:30] PROBLEM - Apache HTTP on srv220 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:41:53] doing so [17:41:58] done [17:42:53] RECOVERY - Apache HTTP on srv223 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.057 second response time [17:42:59] ah just when I was about to sweat [17:43:09] RECOVERY - Apache HTTP on srv220 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.976 second response time [17:43:19] RECOVERY - Apache HTTP on srv222 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.213 second response time [17:43:19] yeah it's recovering [17:43:22] memory spike again [17:43:27] RECOVERY - Apache HTTP on srv224 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.701 second response time [17:43:29] I can't wait for j^'s patches [17:43:42] AaronSchulz: can we do that? [17:44:14] AaronSchulz: considering we're copying data from pmtpa to eqiad like crazy, getting rid of 15% of our storage wouldn't hurt us at all [17:44:26] AaronSchulz: it'd probably help with the c2100 replacements too [17:45:54] also as more and more uploads go throw UW it will consume more space [17:46:26] awwww crap [17:46:32] I know why imagescalers are freaked out [17:47:17] http://ganglia.wikimedia.org/latest/graph.php?h=ms-fe1.pmtpa.wmnet&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2&st=1357580821&g=network_report&z=medium&c=Swift%20pmtpa [17:47:55] that's a filled up gigabit [17:48:19] wth is wrong with varnish [17:48:42] RECOVERY - Apache HTTP on srv221 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.465 second response time [17:48:43] PROBLEM - Apache HTTP on srv220 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:48:43] RECOVERY - LVS HTTP IPv4 on rendering.svc.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 67674 bytes in 7.558 seconds [17:48:56] paravoid: wow... that's not awesome [17:49:20] I've noticed that 2h ago [17:49:25] the varnish issue [17:50:21] RECOVERY - Apache HTTP on srv220 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.518 second response time [17:51:05] hmm, I see quite a bit of traffic from imgscalers, that might be what brought swift to its knees [17:51:42] RECOVERY - Apache HTTP on srv219 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.058 second response time [17:52:27] PROBLEM - Puppet freshness on ssl1001 is CRITICAL: Puppet has not run in the last 10 hours [17:52:49] 112M tiffs [17:52:51] ffs [17:53:01] and increasing [17:57:14] Constitution_of_the_United_States,_page_2.tif [17:57:42] the page has thumbnails for 4 pages [17:57:45] of 130MB each [17:57:46] how nice [17:57:50] wooo! [17:57:54] but, this is good to know now [17:58:01] as video will only make things worse.... [17:58:14] is asher around? [17:58:20] he's not in the office [17:59:18] the gigabit cutoff is getting better, but we need to fix whatever it is that makes varnish not cache enough [17:59:46] maybe we have dont_cache_enough=1 set? [17:59:49] ;) [18:05:03] New review: Demon; "On second thought--since this class is only included on fenari, perhaps we could just delete all of ..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/41976 [18:08:38] <^demon> AaronSchulz: I think I'm done rewriting ExtDist now. It fetches the archive and sha1 names properly now, and validates if the branch/tag exists. [18:09:15] yeah, I'll get to it after looking through a few things [18:09:21] <^demon> k, thanks :) [18:09:43] <^demon> We might want to increase cache duration--that was my only concern since we have to make 2 http requests to get all of the data. [18:09:55] <^demon> (Granted, this is ExtDist, so it's not like a billion people use it) [18:10:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:13:39] ^demon: more people would be using it if only there was ever been a time when it worked [18:13:47] *had only been [18:13:59] <^demon> The new version works quite well. [18:14:04] <^demon> I think people will like it. [18:14:13] * Nemo_bis surprised [18:14:24] <^demon> I'm writing an e-mail to wikitech-l about it now. [18:14:59] ooh maybe we'll even be able to readd it to {{extension}} :D [18:15:11] ^demon: do we need squid changes? [18:15:25] good because I am tired of every couple days having to svn clean and chown over there [18:15:29] it was getting old [18:23:58] <^demon> paravoid: Oh man...apaches can't make external http requests can they? [18:24:20] they can't without setting a proxy, no [18:24:23] why? [18:24:33] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.022 seconds [18:24:38] <^demon> The extension makes some requests to the github api. [18:24:46] oh? [18:24:50] <^demon> The new version. [18:25:23] look at how the Flickr plugin does it [18:25:38] upload by url [18:27:17] and also be sure that the GitHub API's terms of service are trivial and if they're not check with legal? :) [18:27:48] <^demon> Well, luckily it's configurable, we can point it at any dumb service that can respond like github's api. [18:27:58] brb, lunchy time! [18:39:41] !log reedy synchronized php-1.21wmf6/extensions/EducationProgram [18:39:51] Logged the message, Master [18:40:42] !log reedy synchronized php-1.21wmf7/extensions/EducationProgram [18:40:52] Logged the message, Master [18:49:33] New review: Pyoungmeister; "I have no opinion on this, but will deploy when you are happy with it." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/42519 [18:56:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:07:13] ottomata: ping? [19:07:21] yeah, i know [19:07:27] hehe [19:07:31] i wish I was there! there's a big ol' analytics meeting right now with sue [19:07:32] that I have to go to [19:07:37] oh, okay [19:07:53] i tried to get them to make it at a different time, but there are like 15 people in this meeting, so hard to change [19:08:05] i will open up etherpad though [19:08:15] chat to me if I need to chime in on anything [19:08:49] great work on RT btw [19:08:57] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.356 seconds [19:09:23] binasher: I have an important varnish-related(?) issue for you when you have the time... [19:09:43] oh, great [19:10:08] heh [19:10:25] We interrupt this program with an important varnish-related announcement.. [19:12:26] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42566 [19:14:52] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: fishbowl, private and closed to wmf7 [19:15:02] Logged the message, Master [19:19:28] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [19:28:56] sbernardin: are you back yet? [19:29:27] ottomata: ori-l ./slot4/00015.stat1.wikimedia.org._a_eventlogging.0 [19:29:39] that means i see backups on tridge [19:29:44] mutante: woot. much much obliged. [19:29:55] yw [19:29:55] cool! [19:30:15] while you are at it, hook up a few 1PB disks so we can back up /a too [19:30:37] it's not about disk space [19:30:45] oh, its not? [19:30:46] it's about the size of virtual tapes [19:30:49] no, never been [19:30:51] oh [19:31:00] hook up a few 1PB tapes then? :p [19:31:17] configure amanda to let a backup span multiple virtual tapes [19:31:26] link is on ticket, heh [19:31:30] oh yeah that's right, i remembrer [19:31:50] or.. just add those directories separately i guess [19:31:52] hmmm, so if I made a bunch of /a/ backups, for each one we actually want to back up [19:31:54] yeahhhhh [19:31:57] like we just did with eventloggging [19:32:02] we just had the same idea :) [19:32:03] yep [19:32:13] cool, bumping that up on my todo list then [19:32:19] :) [19:35:43] notpeter, we've got problem: [19:35:45] maxsem@fenari:~$ curl 'http://solr1001:8983/solr/select/?q=*%3A*&version=2.2&start=0&rows=5&indent=on' [19:35:45] curl: (52) Empty reply from server [19:36:22] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: special, wikinews and wikivoyages to 1.21wmf7 [19:36:32] Logged the message, Master [19:37:15] MaxSem: I can try restarting it [19:37:22] oh, no, I got a response [19:37:28] it was just slow as shit [19:37:32] I get it sometimes too [19:37:48] the server looks absolutely not under load [19:38:04] sometimes, it returns quickly [19:38:07] sometimes, not [19:38:14] sometimes, it times out [19:42:10] paravoid: can you write up a full description of the upload varnish issue? [19:42:17] yes [19:42:17] sec [19:43:34] binasher: so, it looks like that after about Christmas the requests to swift have been quadrupled [19:43:39] MaxSem: ok, the box looks alright. but I odn't really know the internals of solr. what can I do to help? [19:44:15] maybem there's something in the logs? [19:44:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:44:47] I would've disabled updates, too - but there's another deployment window ATM [19:45:19] it's lock contention [19:45:27] I will copy the logs to your home dir on fenari [19:46:42] binasher: bandwidth has went up from ~300mbps per box to almost a gigabit [19:47:13] MaxSem: they should be there now [19:47:27] but yeah.... hella stack traces that are all lock times [19:47:31] paravoid: that's bandwidth on ms-fe hosts? [19:47:33] *timeoutes [19:48:08] binasher: heh I'm good at multitasking but not _that_ good :) [19:48:12] notpeter, thanks [19:48:19] so, yeah, that's to ms-fe boxes [19:48:43] the time we had the imagescaler pages today we had surpassed the gigabit (straight line on the gigabit threshold) [19:49:01] I did a few GETs and objects seem to expire after < 10mins [19:49:22] we now have ~1.6kreq/s up from 300-400 in something like two weeks [19:49:36] notpeter, whee - looks like it's because wikis other than en: are now using it too [19:49:49] I was looking at the git log, saw your 40x TTL changes and re-reviewed them, couldn't see anything wrong [19:49:58] they seem to match date-wise though [19:50:27] binasher: look at ms-fe1.pmtpa.wmnet month view in ganglia [19:51:19] so I guess I'll have to either switch to updates via cron or introduce multicore earlier than I expected: https://gerrit.wikimedia.org/r/#/c/29827/ [19:51:34] multicore needs more investigation though [19:51:53] binasher: a random upload varnish in ganglia also shows different trends the past two weeks [19:52:40] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikibooks and wikquote to 1.21wmf7 [19:52:50] Logged the message, Master [19:53:20] binasher: 354307ddadb0e3f74b23c6979d99f23bb7359d2b could explain it, but I don't see anything wrong with it [19:53:47] (well, it should be < 400, rather than <= 400, but still, that's not it) [19:54:23] that came to mind as the the only thing i know of changing around then [19:54:57] New patchset: MaxSem; "Enable Solr distributed search" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42594 [19:55:06] yeah [19:55:09] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.983 seconds [19:55:19] there's also the SHM change these days but I can't see how's related either [19:55:45] MaxSem: sounds reasonable [19:56:32] paravoid: i'll dig in after this meeting [19:57:02] binasher: thanks [19:57:14] I haven't had much time to debug this, but I think I can find some time today as well [19:57:22] do you confirm the findings some far? [19:57:28] so far even [19:58:07] New patchset: MaxSem; "Disable GeoData jobs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42595 [19:59:04] paravoid: yeah.. but what does today mean for you, it's 21:00 for you, isn't it? [19:59:33] it's 22:00, but I'll work late too today, I need to catch up with ceph people too [20:02:27] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikisource, wikiversity and remaining wikimedia to 1.21wmf7 [20:02:37] Logged the message, Master [20:03:11] !log authdns-update to add tellurium.wikimedia.org [20:03:21] Logged the message, Master [20:03:49] New patchset: Hashar; "(bug 43339) deployment roles for beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42549 [20:04:42] New review: Hashar; "Patchset 2:" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/42549 [20:06:59] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wiktionary to 1.21wmf7 [20:07:08] Logged the message, Master [20:07:54] New patchset: Reedy; "1.21wmf7 phase 2 deployment" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42598 [20:08:46] New review: Ryan Lane; "Some problems here." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42549 [20:12:17] Reedy, can I squeeze in a quick configuration change? [20:12:30] Yeah, I think I'm done.. [20:12:48] thanks! [20:13:13] Change merged: MaxSem; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42595 [20:15:05] Reedy, there's an undeployed wikiversions.dat change in master [20:15:13] paravoid: was looking at the backend instance on cp1027, it appeared not to cache anything beyond 10 seconds [20:15:14] No there isn't [20:15:23] It's deployed, it's committed [20:15:24] binasher: yeah [20:15:27] it's just not approved and updated back in [20:15:43] paravoid: restarted it with no config change, and now its behaving normally [20:15:50] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42598 [20:15:57] the object i was testing with has been cached for > 2min (since restarting) [20:15:57] 21:49 < paravoid> I did a few GETs and objects seem to expire after < 10mins [20:16:06] Reedy, git diff HEAD origin disagrees:) [20:16:16] It's not undeployed [20:16:17] secs vs. mins heh [20:16:20] paravoid: going directly to the backend instances, i got expire at 10 seconds [20:16:21] It's perfectly well deployed [20:16:22] aha [20:16:23] Anyway, fixed [20:16:40] frontends are behaving differently, possibly normally [20:16:40] New patchset: Hashar; "(bug 43339) deployment roles for beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42549 [20:17:07] binasher: yeah, I saw 10 *secs* too [20:17:12] sorry for the confusion [20:17:17] ah, ok [20:17:46] still seeing it on cp1024 [20:17:49] !log maxsem synchronized wmf-config/InitialiseSettings.php 'Disable GeoData jobs' [20:17:58] Logged the message, Master [20:18:31] binasher: cp1024 cache hit is ~21%, cp1027 is now ~80% [20:18:40] cp1026 doesn't have to prob, but its backed varnishd has only been running since jan 3 [20:18:48] Reedy, thanks - I'm done [20:19:32] cp1026 hitrate is 97%(!) [20:19:38] (all 10sec ones) [20:19:49] doing a vcl reload on cp1024 didn't help, now did a restart [20:20:08] and now cp1024 looks ok [20:20:10] yep, seeing the hitrate increase [20:20:12] hrm [20:20:34] you can't complain, I always come to you with somewhat peculiar issues, don't I [20:20:38] notpeter, now that I've disabled updates, it immediately resumed working [20:21:55] binasher: cp1026 was the one that I restarted when we were both in a meeting [20:22:03] paravoid: i wonder if this is a bug in the persistent backend [20:22:46] do you think the config reload of yours was a coincidence? [20:23:02] possibly not [20:23:05] it might have triggered this somehow [20:23:12] yeah [20:24:00] you should probably !log the restarts btw [20:24:26] seems to be a bug considering a vcl reload didn't fix the behavior, but maybe one that isn't easily encountered [20:24:35] nod [20:24:46] paravoid: want to just restart all of the upload backends? [20:26:00] shall we not debug this further? [20:26:49] New review: Umherirrender; "Change by Matthias Mullie looks good to me (but I will not vote on my own commit)" [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/42402 [20:28:55] MaxSem: gotcha [20:28:56] ok [20:28:56] i wish vcl.show showed the entire running vcl / followed includes [20:29:19] now I'm out of a meeting. where do things stand? do you need any help from me? still want to deploy today? [20:30:51] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:31:45] PROBLEM - MySQL Slave Delay on db1007 is CRITICAL: CRIT replication delay 201 seconds [20:32:39] PROBLEM - MySQL Replication Heartbeat on db1007 is CRITICAL: CRIT replication delay 209 seconds [20:32:41] notpeter, yes please - merge https://gerrit.wikimedia.org/r/#/c/40569/ and https://gerrit.wikimedia.org/r/#/c/42112/ then force a puppet run on solr1001 then on other hosts [20:34:27] RECOVERY - MySQL Replication Heartbeat on db1007 is OK: OK replication delay 0 seconds [20:35:21] RECOVERY - MySQL Slave Delay on db1007 is OK: OK replication delay 0 seconds [20:35:42] paravoid: on cp1023, it looks like *every* allocation request to SMP.main-sda3 is failing [20:36:25] paravoid: and shortlived 10.000000 [s] [20:36:49] can't save object -> transient storage with the shortlived ttl [20:39:44] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40569 [20:39:52] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42112 [20:40:32] binasher: can I merge your redis stuff? [20:41:01] yup [20:41:07] ok [20:42:23] MaxSem: ok, puppet is currently in its dead state. I'll make sure puppet runs on solr1001-1003 asap [20:42:33] New patchset: Aaron Schulz; "Periodically prune the upload stash rows/files." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42600 [20:43:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.037 seconds [20:44:03] RECOVERY - Puppet freshness on stat1 is OK: puppet ran at Mon Jan 7 20:43:33 UTC 2013 [20:44:28] maplebed: sup [20:44:31] yo [20:44:34] how are you? [20:44:43] good! came by with a question though... [20:44:49] sure [20:44:52] what's up with the 403 on http://ganglia.wikimedia.org/latest/ ? [20:45:05] maplebed, ZOMGXSS [20:45:10] lol [20:46:09] so if you're not with the cabal, you're not allowed to hack it [20:46:32] MaxSem: ok, puppet has run. should be live now [20:46:37] how shall we test? [20:46:45] PROBLEM - Host ms-be5 is DOWN: PING CRITICAL - Packet loss = 100% [20:47:30] notpeter, curl http://solr1001:8983/solr/admin/cores?action=STATUS [20:47:36] !log replacing ms-be5 with R720xd [20:47:46] Logged the message, Master [20:47:52] shows that the slave has as many docs as master does [20:48:24] notpeter, you've dep;oyed it on ly to solr100x? [20:48:36] yeah [20:48:51] I'm not actually sure that solr [1-3] is up... [20:48:52] ok, I'll need to tweak my MW config [20:48:52] lemme check [20:49:04] notpeter, only solr1 seems down [20:49:08] ah [20:49:09] ok [20:49:28] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [20:49:29] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [20:49:41] forcing puppet runs on 2 and 3 now [20:49:45] will see what's up with 1 [20:49:53] looks like it just never had puppet run on it... [20:50:30] yeah [20:50:31] lame [20:50:35] lemme fire it up now [20:51:35] binasher: sorry, was in a meeting [20:52:18] Solr's insta-replication: https://ganglia.wikimedia.org/latest/graph.php?h=solr1001.eqiad.wmnet&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2&st=1357591872&g=network_report&z=medium&c=Miscellaneous%20eqiad [20:52:55] pushin' mad bits, yo [20:53:37] MaxSem: woo! [20:55:36] MaxSem: ok, solr1 should also now actually be up..... [20:55:43] need anything else from me? [20:56:30] notpeter, not yet. I'll run a load test [20:56:38] ok, cool [21:00:55] New patchset: MaxSem; "Enable Solr distributed search" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42594 [21:01:09] PROBLEM - SSH on lvs6 is CRITICAL: Server answer: [21:01:38] Change merged: MaxSem; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42594 [21:02:49] RECOVERY - SSH on lvs6 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:05:34] paravoid: i managed to crash the main varnishd worker process on cp1023 , the parent respawned it, and that also fixed the storage allocation failures + caching issue [21:07:12] at this point, i'd just go for restarting all of the backends unless you want to troubleshoot further [21:08:44] !log maxsem synchronized wmf-config/CommonSettings.php 'Multi-server solr' [21:08:55] Logged the message, Master [21:10:45] RECOVERY - Puppet freshness on silver is OK: puppet ran at Mon Jan 7 21:10:34 UTC 2013 [21:11:30] RECOVERY - Puppet freshness on zhen is OK: puppet ran at Mon Jan 7 21:11:18 UTC 2013 [21:14:12] New review: Nikerabbit; "It would be great if you could merge & deploy this before the i18n deployment window tomorrow." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/42519 [21:17:12] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:18:42] !log jenkins: splitting whitespace check out of mediawiki linting job. Will make whitespace a non voting change to avoid rejecting legitimate changes. [21:18:53] Logged the message, Master [21:23:37] hmm, why do Solr servers in pmtpa and eqiad have a different number of cores? [21:28:12] MaxSem: probably because they were purchased at different times ? [21:28:27] mhm, I thought they were identical [21:28:29] pmtpa is oool [21:28:31] d [21:28:46] oh nm, solr1 is pretty ne [21:28:47] new [21:29:28] maybe, HT is not enabled there? [21:29:30] lesliecarr: yep...looking at it now...if i recall they were ordered at the same time [21:29:45] maxsem: that is my thought as well [21:29:49] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.056 seconds [21:30:13] that's bizarre .... [21:30:33] maybe -- because on solr1 it has a mhz of 2400, but on solr1001 each core is showing 1200 [21:30:41] eek [21:31:00] cpu throttling for power saving? [21:31:02] New patchset: MaxSem; "Reduce the load on Solr servers in pmtpa" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42673 [21:32:22] New patchset: MaxSem; "Reduce the load on Solr servers in pmtpa" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42673 [21:32:56] Change merged: MaxSem; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42673 [21:34:22] !log maxsem synchronized wmf-config/CommonSettings.php 'Reduce the load on Solr servers in pmtpa' [21:34:33] Logged the message, Master [21:39:44] New patchset: MaxSem; "pmtpa is even slower" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42676 [21:40:54] Change merged: MaxSem; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42676 [21:42:05] !log maxsem synchronized wmf-config/CommonSettings.php [21:42:15] Logged the message, Master [21:43:09] !log restarting backend varnish on cp1030 (previously restarted cp102[346]) [21:43:19] Logged the message, Master [21:46:20] so, turns out that solr1-3 is overall 4 times slower than solr1001-1003 [21:47:12] Nikerabbit: hey, I'm going to merge and push your solr schema change now [21:47:15] MaxSem: boo. [21:47:27] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42519 [21:47:39] niiice [21:49:20] notpeter, other than the HW problem, everything looks good so far, 80 concurrent requests generate a ~10% load on the cluster. thanks a lot for your help [21:50:08] defniitely! [21:50:11] there's also https://gerrit.wikimedia.org/r/#/c/40304/ which is related to Solr, but it's not urgent [21:50:51] MaxSem: sure, i'll just do that now [21:51:01] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40304 [21:51:10] done [21:51:15] !log restarted backend varnish on cp102[27]. cp103[0-6] [21:51:19] thanks [21:51:25] yep! no problem [21:51:26] Logged the message, Master [21:52:50] notpeter: cool, thanks [21:54:01] Nikerabbit: sure! no problem [21:56:35] New patchset: Cmjohnson; "Changing mac for ms-be5 |replaced h/w" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42679 [21:57:52] New review: Spage; "Our labs instance piramido doesn't seem to have any role_requires or config_lines from puppet so it'..." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/42502 [21:57:53] Change merged: Cmjohnson; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42679 [21:58:07] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42502 [21:58:08] New review: Hashar; "Please avoid changing unrelated lines, even if it is just a comment. That add some extra steps when ..." [operations/mediawiki-config] (master); V: 0 C: -1; - https://gerrit.wikimedia.org/r/42232 [22:00:12] cmjohnson1: Shall I merge your change onto sockpuppet since I'm about to merge mine? [22:00:28] sure [22:00:58] Hm, someone beat me to it [22:01:33] thought I had already merged it b4 [22:01:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:01:52] maybe just lag...didn't see your change though [22:01:53] Probably you merged yours & mine in the time between when I noticed and asked [22:11:30] PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours [22:11:31] PROBLEM - Puppet freshness on ms-fe1004 is CRITICAL: Puppet has not run in the last 10 hours [22:11:31] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [22:11:31] PROBLEM - Puppet freshness on ms-fe1003 is CRITICAL: Puppet has not run in the last 10 hours [22:11:31] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [22:11:31] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [22:11:44] can't ssh into locke and in a meeting.. can someone take a look? [22:13:49] New patchset: Pyoungmeister; "testing: db61 and db62 to s2 for testing" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42683 [22:14:09] binasher: works for me [22:14:20] notpeter: what host are you on? [22:14:22] binasher, lcoke? [22:14:24] fenari [22:14:24] works for me too [22:15:44] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42683 [22:15:56] works for me now too.. ok! [22:16:00] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.050 seconds [22:17:34] cmjohnson1, shall I create a RT ticket about solr1-3? [22:18:27] PROBLEM - Host db62 is DOWN: PING CRITICAL - Packet loss = 100% [22:20:14] RECOVERY - Host db62 is UP: PING WARNING - Packet loss = 73%, RTA = 0.39 ms [22:21:13] maxsem: yes plz [22:22:28] New review: Dereckson; "I'm not sure it's here an issue." [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/42232 [22:27:35] PROBLEM - MySQL Replication Heartbeat on db62 is CRITICAL: Connection refused by host [22:28:11] PROBLEM - SSH on db62 is CRITICAL: Connection refused [22:28:11] PROBLEM - MySQL Slave Delay on db62 is CRITICAL: Connection refused by host [22:28:11] PROBLEM - Full LVS Snapshot on db62 is CRITICAL: Connection refused by host [22:28:12] New patchset: Aaron Schulz; "Set handler for wfShellWikiCmd hook." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42684 [22:28:48] PROBLEM - MySQL Slave Running on db62 is CRITICAL: Connection refused by host [22:28:56] PROBLEM - MySQL Recent Restart on db62 is CRITICAL: Connection refused by host [22:28:56] PROBLEM - MySQL disk space on db62 is CRITICAL: Connection refused by host [22:29:05] PROBLEM - MySQL Idle Transactions on db62 is CRITICAL: Connection refused by host [22:30:14] Change merged: Aaron Schulz; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42684 [22:30:31] maxsem: can i take a solr server down...i wanna check bios cfg? [22:30:47] cmjohnson1, yes [22:31:00] I've created http://rt.wikimedia.org/Ticket/Display.html?id=4282 BTW [22:31:21] RECOVERY - Host ms-be5 is UP: PING OK - Packet loss = 0%, RTA = 3.14 ms [22:31:25] cool...going to go w/solr....i did see it [22:32:52] !log repooling ssl1001 with proxy_http_version 1.1 enabled [22:33:01] Logged the message, Master [22:33:26] PROBLEM - Puppet freshness on ms-be11 is CRITICAL: Puppet has not run in the last 10 hours [22:35:14] PROBLEM - swift-object-replicator on ms-be5 is CRITICAL: Connection refused by host [22:35:15] PROBLEM - swift-container-replicator on ms-be5 is CRITICAL: Connection refused by host [22:35:15] PROBLEM - swift-account-reaper on ms-be5 is CRITICAL: Connection refused by host [22:35:15] !log shutting down solr3 to check bios setup [22:35:25] Logged the message, Master [22:35:41] PROBLEM - swift-container-server on ms-be5 is CRITICAL: Connection refused by host [22:35:42] PROBLEM - swift-object-server on ms-be5 is CRITICAL: Connection refused by host [22:35:42] PROBLEM - swift-object-updater on ms-be5 is CRITICAL: Connection refused by host [22:35:42] PROBLEM - swift-container-updater on ms-be5 is CRITICAL: Connection refused by host [22:35:42] PROBLEM - swift-account-replicator on ms-be5 is CRITICAL: Connection refused by host [22:35:59] PROBLEM - SSH on ms-be5 is CRITICAL: Connection refused [22:36:00] PROBLEM - swift-account-server on ms-be5 is CRITICAL: Connection refused by host [22:36:27] PROBLEM - swift-object-auditor on ms-be5 is CRITICAL: Connection refused by host [22:36:44] PROBLEM - swift-account-auditor on ms-be5 is CRITICAL: Connection refused by host [22:36:45] PROBLEM - swift-container-auditor on ms-be5 is CRITICAL: Connection refused by host [22:37:10] ummm [22:37:17] wtf is going on with swift? [22:37:30] apergos or paravoid you do that? ^ [22:37:36] !log aaron synchronized wmf-config/CommonSettings.php 'set wfWikiShellCmd hook.' [22:37:45] Logged the message, Master [22:37:56] PROBLEM - Host solr3 is DOWN: PING CRITICAL - Packet loss = 100% [22:38:04] !log aaron synchronized php-1.21wmf7/includes/api/ApiUpload.php 'deployed 542bfaef8f4ca61db04bb60cbd1e617c731db038' [22:38:13] Logged the message, Master [22:38:16] I think it's apergos [22:38:24] ms-be5 was scheduled to be replaced today [22:38:43] ah [22:38:43] ok [22:39:01] robh: that is me [22:39:04] or the server is sentient and decided to commit suicide before we kill it [22:39:11] heh [22:39:16] knowing swift, that's it [22:39:31] * Ryan_Lane hates storage [22:39:39] swift per se is not bad in this respect [22:39:42] c2100s on the other hand... [22:39:46] ms-be5 was officially allowed to be killed off today for the new and improved ms-be5 [22:39:49] indeed [22:40:37] New review: Aude; "@hashar you can blame me for both lines. don't know why I put the full bugzilla url and happy they ..." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/42232 [22:42:28] binasher: oh you did all of them already [22:42:50] binasher: I was halfway into them until I realized it and saw your !log [22:43:05] binasher: thanks a lot [22:44:21] New patchset: Aaron Schulz; "Periodically prune the upload stash rows/files." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42600 [22:45:35] AaronSchulz: speaking of periodic pruning, what should we do about that -temp thing? [22:45:40] AaronSchulz: shall I open a bugzilla? [22:45:48] what thing? [22:46:00] cleaning up temp containers [22:46:09] you said there are stale files there because we're lacking a cronjob or something [22:46:28] those are for temp files [22:48:12] *that is [22:48:32] didn't you say before that we're not cleaning those up? [22:48:40] so presumably there are a lot of stale files in there? [22:48:53] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:49:20] PROBLEM - NTP on db62 is CRITICAL: NTP CRITICAL: No response from NTP server [22:49:49] I said the script was not run systematically, but it should clear much of them up [22:51:45] so there is a script already? [22:51:47] that we should run? [22:52:14] yeah, that one :) [22:53:44] r42600 you mean? [22:54:25] New patchset: Aude; "(bug 43630) set autoconfirm count to 10 @ fawiki, per consensus" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42232 [22:54:35] PROBLEM - Puppet freshness on ms-be9 is CRITICAL: Puppet has not run in the last 10 hours [22:56:49] New patchset: Aude; "Harmonize bugzilla links for $wgAutoConfirmCount" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42688 [22:56:50] PROBLEM - Host ms-be5 is DOWN: PING CRITICAL - Packet loss = 100% [22:57:20] !log powercycling cp1034, dead for 18h, evidence of sudden memory exhaustion [22:57:30] Logged the message, Master [22:58:02] RECOVERY - SSH on db62 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [22:58:17] so apparently we had swift running with essentially no caching for about 12 days [22:58:21] interesting [22:58:40] wow [22:58:41] really? [22:59:01] what kind of caching? paging? [22:59:16] yeah, varnish was broken [22:59:31] varnish bug it seems [22:59:33] PROBLEM - Puppet freshness on ms-be12 is CRITICAL: Puppet has not run in the last 10 hours [22:59:38] see backlog, we were discussing it with binasher [22:59:45] he restarted varnishd and that fixed it [23:00:35] paravoid: wasn't hitrate still around 20-40%? [23:00:35] RECOVERY - SSH on cp1034 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [23:00:45] oh right, it was 22% [23:00:51] so not quite no caching, but still [23:00:54] in 2-3 servers I saw [23:01:02] RECOVERY - Varnish HTCP daemon on cp1034 is OK: PROCS OK: 1 process with UID = 997 (varnishhtcpd), args varnishhtcpd worker [23:01:02] yeah, my bad [23:01:06] plus some more from frontend varnishes [23:01:08] that's also puzzling [23:01:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.035 seconds [23:01:29] RECOVERY - Varnish HTTP upload-backend on cp1034 is OK: HTTP OK HTTP/1.1 200 OK - 632 bytes in 0.054 seconds [23:01:37] frontends are always a miss for me [23:01:49] for more than a few seconds [23:01:51] i think some amount of stuff cached before the breakage survived being lru'd.. though lru seemed kinda broken [23:02:03] probably too small of a cache [23:02:41] RECOVERY - Host ms-be5 is UP: PING OK - Packet loss = 0%, RTA = 0.42 ms [23:06:31] binasher: any feelings on keepalive connection settings for nginx, now that we have http 1.1? [23:06:34] http://nginx.org/en/docs/http/ngx_http_upstream_module.html#keepalive [23:06:59] # of connections is per worker process. i think we have that number set to something stupidly high [23:07:21] 64 processes [23:07:24] yeah, it was bumped at some point [23:07:31] and is unreasonably high in my opinion as well [23:07:32] accidentally bumped [23:07:39] yes, agreed [23:07:53] I think one or two processes per core is reasonable [23:08:18] nod [23:08:26] actually, I'm going to change that now [23:08:51] +1 [23:09:11] it eats a shit-ton of memory in its current config [23:11:07] I'm going to set it to the processor count [23:13:17] MaxSem: So I am looking at the solr issue with Chris [23:13:25] and the issue of 24 versus 12 cores is due to hyperthreading [23:13:30] usually, we turn it off entirely. [23:13:40] so to match, we would go in and turn off ashburn hyperthreading [23:13:49] (unless solr can make use of it effectively?) [23:14:05] do most things not make proper use of hyperthreading? [23:14:09] no. [23:14:10] RobH, but live testing indicates that the Asburn servers are faster [23:14:14] which is why we tend to turn it off. [23:14:22] 4 times as fast, as the matter of fact [23:14:33] then perhaps it does use HT effectively? [23:14:45] we can make them match and see what it does [23:14:58] 1 worker per real core seems good for nginx [23:15:06] agreed [23:15:07] however, HT would explain only 2x difference [23:15:21] I'm setting it conditionally for the ssl cluster [23:15:25] New patchset: Ryan Lane; "Reduce nginx worker_process count for ssl cluster" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42690 [23:15:34] because mark had some reason to have it set otherwise for another system [23:16:02] MaxSem: So they are identical cpu [23:16:11] iirc i just placed the solr quote order twice [23:16:15] so they should match, comparing now [23:16:19] so, now ssl cluster systems will have 8 processes running. now to think how many keepalives per process.... [23:16:27] are any of the solr ashburn servers in service? [23:16:32] MaxSem: or can i pull one offline for comparison? [23:17:10] yea, identical cpu, identical memory [23:17:19] but i need to compare the bios on a known 'faster' server to solr3 [23:17:28] roughly 1000 backend connections in the ESTABLISHED state, currently, and 5,500 in any state [23:17:36] per system [23:17:54] cmjohnson1: this all looks right to me (solr3) but lets compare to one that MaxSem has the higher performance metrics on. [23:18:15] yep..that is the next logical step [23:19:09] its not the disk controller, both match, cpu matches, memory matches [23:19:11] RECOVERY - MySQL Slave Delay on db62 is OK: OK replication delay seconds [23:19:18] so yea, the next step is compare the bus speed settings in bios. [23:19:29] RECOVERY - MySQL Slave Running on db62 is OK: OK replication [23:19:33] lets see if MaxSem can let us take one in ashburn down. [23:19:47] RECOVERY - MySQL Replication Heartbeat on db62 is OK: OK replication delay seconds [23:19:48] RECOVERY - MySQL disk space on db62 is OK: DISK OK [23:19:48] RECOVERY - MySQL Idle Transactions on db62 is OK: OK longest blocking idle transaction sleeps for seconds [23:20:14] RECOVERY - MySQL Recent Restart on db62 is OK: OK seconds since restart [23:20:33] RECOVERY - Full LVS Snapshot on db62 is OK: OK no full LVM snapshot volumes [23:20:41] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42690 [23:21:38] updated ticket 4282 with details, we are blocked until solr1001/2/3 (just one of them) can come down for bios settting comparison [23:22:09] RobH, yes you can pull any server [23:22:12] RECOVERY - Puppet freshness on cp1034 is OK: puppet ran at Mon Jan 7 23:21:41 UTC 2013 [23:22:21] awesome, pulling down solr1003 [23:22:35] !log solr1003 coming offline for comparison work per rt 4282 [23:22:46] Logged the message, RobH [23:22:46] cmjohnson1: so wanna take back over and do the initial compare? [23:22:47] RECOVERY - Varnish HTTP upload-frontend on cp1034 is OK: HTTP OK HTTP/1.1 200 OK - 641 bytes in 0.061 seconds [23:23:14] RECOVERY - Puppet freshness on ssl1001 is OK: puppet ran at Mon Jan 7 23:23:00 UTC 2013 [23:23:15] RECOVERY - Varnish traffic logger on cp1034 is OK: PROCS OK: 3 processes with command name varnishncsa [23:23:16] cmjohnson1: I logged out of solr3 and solr1003. You should be able to do a normal clean reboot of 1003 and check out the bios [23:23:28] robh: yeah...let me take a look [23:23:30] all yours [23:26:44] !log aaron synchronized php-1.21wmf7/includes/api/ApiUpload.php 'deployed e2df801500f9472f812388006206e3e3261d58d3' [23:26:54] Logged the message, Master [23:27:38] New patchset: Pyoungmeister; "testing: swapping db61 and db62 to coredb" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42696 [23:27:53] PROBLEM - Host solr1003 is DOWN: PING CRITICAL - Packet loss = 100% [23:28:04] AaronSchulz: are more tests of chunked uploading welcome/needed? [23:28:04] maxsem: so you prefer HT? cuz the setup in ashburn has it enabled and tampa disabled (robh) [23:28:29] PROBLEM - NTP on ms-be5 is CRITICAL: NTP CRITICAL: No response from NTP server [23:28:42] cmjohnson1, yes - HT appears to improve Solr performance [23:28:56] Nemo_bis: meh, if you want, I just got done fixing some stuff [23:29:18] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42696 [23:29:22] maxsem: i will fix solr3 now and the other 2 later...i gotta run out [23:29:40] awesome [23:30:32] robh or any help needed ms-be5 is in continuous installer even though it is set to boot once [23:33:26] RECOVERY - Host solr3 is UP: PING WARNING - Packet loss = 28%, RTA = 0.24 ms [23:34:29] RECOVERY - Host solr1003 is UP: PING OK - Packet loss = 0%, RTA = 26.53 ms [23:34:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:37:20] RECOVERY - NTP on db62 is OK: NTP OK: Offset -0.01844382286 secs [23:39:53] PROBLEM - Host ms-be5 is DOWN: PING CRITICAL - Packet loss = 100% [23:45:45] RECOVERY - Host ms-be5 is UP: PING OK - Packet loss = 0%, RTA = 0.41 ms [23:50:43] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.025 seconds [23:53:10] New patchset: Pyoungmeister; "testing: borked my testing. retesting db62" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42700 [23:53:32] PROBLEM - LVS HTTPS IPv4 on wikinews-lb.esams.wikimedia.org is CRITICAL: Connection refused [23:53:50] PROBLEM - LVS HTTPS IPv6 on wikibooks-lb.esams.wikimedia.org_ipv6 is CRITICAL: Connection refused [23:53:50] PROBLEM - LVS HTTPS IPv6 on wikisource-lb.esams.wikimedia.org_ipv6 is CRITICAL: Connection refused [23:53:50] PROBLEM - HTTPS on ssl3002 is CRITICAL: Connection refused [23:54:00] PROBLEM - LVS HTTP IPv6 on upload-lb.esams.wikimedia.org_ipv6 is CRITICAL: Connection refused [23:54:00] PROBLEM - LVS HTTP IPv6 on wiktionary-lb.esams.wikimedia.org_ipv6 is CRITICAL: Connection refused [23:54:09] PROBLEM - LVS HTTPS IPv6 on wikiquote-lb.esams.wikimedia.org_ipv6 is CRITICAL: Connection refused [23:54:09] PROBLEM - LVS HTTPS IPv6 on foundation-lb.esams.wikimedia.org_ipv6 is CRITICAL: Connection refused [23:54:09] PROBLEM - LVS HTTPS IPv6 on wiktionary-lb.esams.wikimedia.org_ipv6 is CRITICAL: Connection refused [23:54:17] PROBLEM - LVS HTTPS IPv6 on upload-lb.esams.wikimedia.org_ipv6 is CRITICAL: Connection refused [23:54:18] PROBLEM - LVS HTTPS IPv6 on wikipedia-lb.esams.wikimedia.org_ipv6 is CRITICAL: Connection refused [23:54:21] Ryan_Lane: [23:54:22] -_- [23:54:27] PROBLEM - LVS HTTPS IPv6 on bits-lb.esams.wikimedia.org_ipv6 is CRITICAL: Connection refused [23:54:35] !log starting nginx on ssl3002 [23:54:35] PROBLEM - LVS HTTPS IPv6 on mediawiki-lb.esams.wikimedia.org_ipv6 is CRITICAL: Connection refused [23:54:35] PROBLEM - LVS HTTPS IPv6 on wikinews-lb.esams.wikimedia.org_ipv6 is CRITICAL: Connection refused [23:54:44] Logged the message, Master [23:54:53] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42700 [23:54:58] Ryan_Lane: you should probably apply your fix manually on all the hosts... [23:55:07] or fix whatever causes it to die I guess :-) [23:55:08] Ryan_Lane: this is your fault? [23:55:11] RECOVERY - LVS HTTPS IPv4 on wikinews-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 77088 bytes in 0.776 seconds [23:55:15] my phone is goin nuts. [23:55:29] not his fault technically [23:55:29] RECOVERY - LVS HTTPS IPv6 on wikibooks-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 67662 bytes in 0.803 seconds [23:55:30] RECOVERY - LVS HTTPS IPv6 on wikisource-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 67802 bytes in 0.796 seconds [23:55:30] * RobH places an order for tar and feathers on amazon. [23:55:42] RECOVERY - HTTPS on ssl3002 is OK: OK - Certificate will expire on 08/22/2015 22:23. [23:55:47] RECOVERY - LVS HTTP IPv6 on wiktionary-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 67802 bytes in 0.570 seconds [23:55:48] RECOVERY - LVS HTTP IPv6 on upload-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 637 bytes in 0.231 seconds [23:55:48] RECOVERY - LVS HTTPS IPv6 on wikiquote-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 67802 bytes in 0.803 seconds [23:55:48] RECOVERY - LVS HTTPS IPv6 on foundation-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 67799 bytes in 0.796 seconds [23:55:50] he was the one to trigger the bug in our recipes :) [23:55:56] RECOVERY - LVS HTTPS IPv6 on upload-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 637 bytes in 0.463 seconds [23:55:58] RECOVERY - LVS HTTPS IPv6 on wiktionary-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 67799 bytes in 0.811 seconds [23:55:58] RECOVERY - LVS HTTPS IPv6 on wikipedia-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 67808 bytes in 0.917 seconds [23:56:25] RECOVERY - LVS HTTPS IPv6 on bits-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 3921 bytes in 0.451 seconds [23:56:25] RECOVERY - LVS HTTPS IPv6 on mediawiki-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 67803 bytes in 0.804 seconds [23:56:25] RECOVERY - LVS HTTPS IPv6 on wikinews-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 67800 bytes in 0.805 seconds [23:56:25] can we blame him anyways? ;) [23:56:39] why didn't this happen in eqiad? [23:56:48] they all worked properly after a puppet run ther [23:56:49] *there [23:57:44] PROBLEM - Host db62 is DOWN: PING CRITICAL - Packet loss = 100% [23:58:47] and in pmtpa [23:58:51] worked there as well [23:59:34] New patchset: Pyoungmeister; "testing: also assinging to s2" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42701