[00:03:37] Aaron|home: i tried playing with a copy of wikidatawiki.job on a slave and autocommit off, and got standard locking behavior, not deadlocks [00:04:04] autocommit off? [00:04:38] just so i could be slow about having multiple queries at the same time [00:04:44] from different sessions [00:08:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:10:31] I see [00:12:04] there are a lot more JobQueDB::claim queries than there are inserts [00:17:15] Aaron|home: 85% of all write queries in the binlog for wikidatawiki are JobQueueDB::claim update queries [00:18:04] 526803 in the last hour [00:18:22] but there have only been 131 jobs [00:19:11] might want to drop the order by job_random too [00:21:15] hey. robla mentioned on #wikimedia-dev that EventLogging may be implicated in a job queue spike. i don't see _how_ these could be related, but let me know if i can help diagnose this. [00:21:40] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.209 seconds [00:24:56] ori-l: Yeah, based on how little php code it has, it seems very unlikely [00:25:27] It was seemingly (re-)deployed at the same time as job queue runners went mostly on strike [00:30:25] I can't find the graph where I was able to glean the exact time, but it was 00:30 UTC [00:30:42] (+/- a minute or so) [00:30:48] !log maxsem Finished syncing Wikimedia installation... : [00:31:05] hallelujah [00:31:05] Logged the message, Master [00:31:54] MaxSem: scap liturgy :) [00:34:38] binasher: https://gerrit.wikimedia.org/r/#/c/30927/ [00:35:53] * Aaron|home starts with some small stuff [00:36:16] Aaron|home: getEmptinessCacheKey is already being set? [00:36:32] yeah, I just thought it was getting used more for some reason [00:36:46] pop() sets it when it doesn't find anything too [00:42:33] !log kaldari synchronized php-1.21wmf2/extensions/CentralNotice 'Update CentralNotice on meta and en.wiki to fix banner override bug' [00:42:46] Logged the message, Master [00:45:03] binasher: where did you get 131 from? The inc/dec stats? [00:47:42] Aaron|home: that's not split by wiki afaik, just from looking at inserts via the binlog [00:48:49] RECOVERY - mysqld processes on db1013 is OK: PROCS OK: 1 process with command name mysqld [00:51:58] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [00:53:24] New patchset: Jgreen; "configure db78 as fundraising database slave" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30930 [00:54:15] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30930 [00:54:49] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:57:37] !log awjrichards synchronized php-1.21wmf2/extensions/MobileFrontend/javascripts/actions/mf-edit.js 'touch file' [00:57:51] Logged the message, Master [00:58:08] !log reedy synchronized php-1.21wmf3/includes/job/JobQueueDB.php [00:58:21] Logged the message, Master [00:59:54] Can a Gerrit op abandon this one: https://gerrit.wikimedia.org/r/#/c/23541/4 ? [01:00:12] It has been superseeded by another change. The owner appears unresponsive. [01:06:37] !log awjrichards synchronized php-1.21wmf3/extensions/MobileFrontend/javascripts/actions/mf-edit.js 'touch file' [01:06:46] Logged the message, Master [01:11:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.055 seconds [01:23:02] !log reedy synchronized wmf-config/InitialiseSettings.php 'Disable TMH everywhere (and mwembed at the same time)' [01:23:15] Logged the message, Master [01:41:28] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 229 seconds [01:43:28] !log reedy synchronized wmf-config/ [01:43:42] Logged the message, Master [01:44:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:46:25] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 17 seconds [01:57:40] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.110 seconds [02:01:01] !log reedy synchronized wmf-config/InitialiseSettings.php 'Re-enable TMH' [02:01:20] Logged the message, Master [02:01:25] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 335 seconds [02:05:07] !log reedy synchronized wmf-config/CommonSettings.php [02:05:20] Logged the message, Master [02:11:02] New review: Reedy; "Shouldn't this get an addition to changelog...?" [operations/debs/squid] (master) C: 0; - https://gerrit.wikimedia.org/r/18331 [02:28:50] !log LocalisationUpdate completed (1.21wmf2) at Wed Oct 31 02:28:50 UTC 2012 [02:29:07] Logged the message, Master [02:32:19] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:38:55] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.049 seconds [02:47:44] !log LocalisationUpdate completed (1.21wmf3) at Wed Oct 31 02:47:40 UTC 2012 [02:47:54] Logged the message, Master [03:38:12] New review: Aaron Schulz; "Too bad default_allowed_headers is only in the scope of _init_. I guess it's OK, we know what header..." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/29768 [03:53:23] New patchset: Tim Starling; "Reduce scaler MaxClients to 18" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30939 [03:53:59] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30939 [03:57:28] New review: Tim Starling; "I have reduced the MaxClients on the scalers to account for this in Ia0e1fba6, so it should be possi..." [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/30773 [04:03:05] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 14 seconds [05:57:38] PROBLEM - Lucene on search1015 is CRITICAL: Connection timed out [05:58:58] RECOVERY - Lucene on search1015 is OK: TCP OK - 0.031 second response time on port 8123 [05:59:43] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [05:59:43] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [05:59:43] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [06:09:19] PROBLEM - Lucene on search1015 is CRITICAL: Connection timed out [06:12:22] RECOVERY - Lucene on search1015 is OK: TCP OK - 0.031 second response time on port 8123 [06:25:19] PROBLEM - Lucene on search1015 is CRITICAL: Connection timed out [06:26:56] PROBLEM - Lucene on search1003 is CRITICAL: Connection timed out [06:29:58] RECOVERY - Lucene on search1015 is OK: TCP OK - 3.023 second response time on port 8123 [06:41:47] PROBLEM - Lucene on search1015 is CRITICAL: Connection timed out [06:43:09] RECOVERY - Lucene on search1015 is OK: TCP OK - 0.033 second response time on port 8123 [06:53:20] PROBLEM - Lucene on search1015 is CRITICAL: Connection timed out [06:57:14] PROBLEM - LVS Lucene on search-pool1.svc.eqiad.wmnet is CRITICAL: Connection timed out [06:58:35] RECOVERY - LVS Lucene on search-pool1.svc.eqiad.wmnet is OK: TCP OK - 0.031 second response time on port 8123 [06:59:02] RECOVERY - Lucene on search1015 is OK: TCP OK - 0.031 second response time on port 8123 [07:01:00] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [07:02:20] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.031 second response time on port 8123 [07:02:38] PROBLEM - LVS Lucene on search-pool2.svc.eqiad.wmnet is CRITICAL: Connection timed out [07:05:47] RECOVERY - Lucene on search1003 is OK: TCP OK - 9.034 second response time on port 8123 [07:07:17] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection refused [07:07:17] RECOVERY - LVS Lucene on search-pool2.svc.eqiad.wmnet is OK: TCP OK - 0.031 second response time on port 8123 [07:08:56] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.031 second response time on port 8123 [07:09:21] !log restarted lucene search on search1015 and search 1003 [07:09:35] Logged the message, Master [07:55:04] New patchset: Mark Bergsma; "Add pmtpa/eqiad Squid/Varnish servers as esams backends" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30949 [07:55:05] New patchset: Mark Bergsma; "Install cp3003 as Varnish upload" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30950 [07:55:43] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30949 [07:56:31] wooo [07:56:37] New patchset: Mark Bergsma; "Install cp3003 as Varnish upload" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30950 [07:57:06] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30950 [07:59:13] !log depooling ms-fe3 to upgrade to precise/swift 1.7.4 [07:59:26] Logged the message, Master [08:02:01] what did you do to raid1-varnish partman recipe? [08:02:21] New patchset: ArielGlenn; "amazon ec2 instance added to dump mirrors rsync" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30952 [08:02:30] is it broken? [08:02:36] seems so [08:02:41] was it broken? [08:02:41] dammit [08:02:47] not that I know of [08:02:55] I just wanted to unify it with the rest [08:02:59] what's there to unify? [08:03:02] it works pretty differently [08:03:32] not really, most of the options are the same [08:03:40] I also moved it to swap on raid [08:03:52] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30952 [08:04:06] I can fix it, but I'm guessing you can't wait? [08:04:14] i'll fix it [08:04:20] and then will have to reinstall all boxes [08:04:25] why? [08:04:33] because they're likely partitioned wrong? [08:04:42] what's the problem? [08:04:47] PROBLEM - Host ms-fe3 is DOWN: PING CRITICAL - Packet loss = 100% [08:04:47] i don't know [08:04:52] i'll have to figure that out first don't I [08:05:15] for one thing, the partition is not marked as swap anymore I think [08:05:19] why do you assume it's wrong? :) [08:05:24] because puppet runs are failing [08:05:56] okay, logging in to cp3003 to check [08:09:07] nevermind, i'll handle it [08:09:21] nah, I broke it, I should deal with the fallout [08:09:44] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [08:10:28] RECOVERY - Host ms-fe3 is UP: PING OK - Packet loss = 0%, RTA = 1.38 ms [08:14:04] PROBLEM - Swift HTTP on ms-fe3 is CRITICAL: Connection refused [08:14:24] PROBLEM - Memcached on ms-fe3 is CRITICAL: Connection refused [08:14:52] PROBLEM - SSH on ms-fe3 is CRITICAL: Connection refused [08:16:02] I can't see how it could be my change [08:16:17] have we installed precise varnish boxes before? [08:18:31] yes [08:21:52] ok, let me reinstall cp3003 with the old template [08:21:59] RECOVERY - SSH on ms-fe3 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [08:23:17] it seems there's no proper xfs filesystem on the partitions? [08:23:23] yes [08:23:31] but the recipe has that [08:23:45] I'm guessing partman ignores it because there's no mountpoint [08:23:48] but it used to be like that before too, I haven't changed that [08:24:13] we had & still have "d-i partman-xfs/no_mount_point boolean false" which has been replace as an option with precise [08:24:15] PROBLEM - Varnish HTTP upload-backend on cp3003 is CRITICAL: Connection refused [08:24:18] hmm [08:24:19] (that's why I asked about precise) [08:24:27] perhaps daniel fixed them up manually last time [08:24:32] I'm not sure if *I* have done a precise install myself [08:24:50] PROBLEM - Varnish HTTP upload-frontend on cp3003 is CRITICAL: Connection refused [08:25:08] PROBLEM - Varnish traffic logger on cp3003 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [08:34:47] PROBLEM - NTP on ms-fe3 is CRITICAL: NTP CRITICAL: No response from NTP server [08:41:28] okay, d-i logs are useless as always, I'm really reinstalling the server with the old recipe now [08:44:11] PROBLEM - Host cp3003 is DOWN: PING CRITICAL - Packet loss = 100% [08:49:53] RECOVERY - Host cp3003 is UP: PING OK - Packet loss = 0%, RTA = 117.04 ms [08:51:32] RECOVERY - NTP on ms-fe3 is OK: NTP OK: Offset -0.04059898853 secs [08:52:05] New patchset: J; "Enable TMH transcoding on test2" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/30954 [08:53:48] PROBLEM - SSH on cp3003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:54:05] PROBLEM - Varnish HTCP daemon on cp3003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:55:17] RECOVERY - SSH on cp3003 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [08:58:53] RECOVERY - Swift HTTP on ms-fe3 is OK: HTTP OK HTTP/1.1 200 OK - 397 bytes in 0.511 seconds [08:59:02] RECOVERY - Memcached on ms-fe3 is OK: TCP OK - 0.006 second response time on port 11211 [09:00:59] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [09:03:25] PROBLEM - Host cp3003 is DOWN: PING CRITICAL - Packet loss = 100% [09:07:13] morning :-) [09:09:05] PROBLEM - Swift HTTP on ms-fe3 is CRITICAL: Connection refused [09:09:05] RECOVERY - Host cp3003 is UP: PING OK - Packet loss = 0%, RTA = 117.07 ms [09:13:26] PROBLEM - SSH on cp3003 is CRITICAL: Connection refused [09:14:48] sigh, I hate partman [09:16:11] there's a bug that's bitten me before where if you have previously configured a partition as swap, d-i uses that as swap and then you can't reformat it, put it in an md etc. [09:16:53] PROBLEM - Host cp3003 is DOWN: PING CRITICAL - Packet loss = 100% [09:17:29] orilly [09:22:30] RECOVERY - Swift HTTP on ms-fe3 is OK: HTTP OK HTTP/1.1 200 OK - 397 bytes in 0.018 seconds [09:26:04] RECOVERY - SSH on cp3003 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [09:26:15] RECOVERY - Host cp3003 is UP: PING OK - Packet loss = 0%, RTA = 117.16 ms [09:40:19] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [09:40:19] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [09:40:19] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [09:47:28] mark: my change was the one that broke it [09:47:43] but I can't see how [09:47:49] and I've been looking at bugs that make no sense [09:50:15] PROBLEM - NTP on cp3003 is CRITICAL: NTP CRITICAL: No response from NTP server [09:51:06] ok, another attempt [10:47:16] ffs [10:49:53] PROBLEM - SSH on cp3003 is CRITICAL: Connection refused [10:53:20] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [10:54:30] hehe [10:54:50] RECOVERY - SSH on cp3003 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [10:55:57] New patchset: Faidon; "partman: whitespace matters" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30957 [10:56:00] there... [10:56:12] verified that this fixes it [10:56:31] really [10:56:40] but it still prompts, and that's because of the thing I mentioned before with the switch of no_mount_point [10:56:55] oh well [10:56:55] so, let me do another run to make sure this fixes that prompt too [10:56:55] ok [10:57:15] sorry for taking so long, I had a meet with Bangkok for WP Zero at the same time... [10:58:08] that's fine, I was running errands ;) [10:58:24] New review: Faidon; "Tested & works as intended" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/30957 [10:58:25] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30957 [11:00:23] PROBLEM - Host cp3003 is DOWN: PING CRITICAL - Packet loss = 100% [11:01:30] so I guess I can fix the existing installs by manually doing an mkfs right [11:01:34] there should be no other changes? [11:02:28] correect [11:02:43] ok [11:02:43] method { format } was no-oped [11:02:48] so it didn't format the xfs [11:02:58] as long as the partitioning is fine ;) [11:03:05] so, the only meaningful change of my commit [11:03:11] was that swap is now on raid1 [11:03:19] that means less swap actually [11:03:26] I wanted to ask you about that [11:03:26] i've heard several times that swap on raid1 does not really work anyway [11:03:33] failing when one disk fails [11:03:35] I kept that... :) [11:03:40] but i've not really verified it [11:03:45] I've seen it work in the past [11:03:56] i've had issues in the past when a disk failed too, but never really investigated [11:03:58] so no opinion [11:04:08] and I don't care about swap space size much [11:04:13] should generally be small anyway [11:04:28] it's good for a system to be able to offload some idle memory, but not more than a little bit [11:04:41] you mean not 48GB of swap like the original swift template was? [11:04:44] indeed. [11:04:46] :P [11:06:05] let me know when you're ready [11:06:05] RECOVERY - Host cp3003 is UP: PING OK - Packet loss = 0%, RTA = 117.03 ms [11:06:52] weird issue with teh upload varnish servers [11:06:59] they're not using more than ~ 32 GB per ssd [11:07:55] New patchset: Faidon; "partman: get rid of the "no mount point" prompt" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30958 [11:08:28] weird [11:08:59] so, I've tried to unify the templates into a few variants [11:09:50] PROBLEM - SSH on cp3003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:09:53] the varnish template is really very similar to the rest of the raid1-* family [11:10:18] New review: Faidon; "Tested & works as intended" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/30958 [11:10:18] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30958 [11:11:20] RECOVERY - SSH on cp3003 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [11:15:47] hm, what's the protocol for puppet at esams? [11:15:53] what's the ca_server? [11:16:12] nevermind, i'll handle that [11:16:19] there's a proxy to stafford [11:16:20] I'm sure I've provisioned something at esams before but I don't remember a thing [11:16:21] but none to sockpuppet [11:16:28] so I just ship over keys manually for now [11:16:31] yeah I noticed that [11:16:33] need to fix that some day [11:16:35] oh [11:18:21] haproxy? [11:18:21] hahaha [11:18:41] soon we'll have an mpls link [11:19:56] there, running puppet [11:20:58] I can attempt to fix haproxy so that it directs the puppetca stuff to sockpuppet [11:21:12] do you think I should do that? [11:23:00] sure [11:23:06] if it's an easy fix [11:23:12] yeah might be http path based right [11:23:20] i haven't looked at it at all [11:23:22] it is [11:23:47] RECOVERY - Varnish HTCP daemon on cp3003 is OK: PROCS OK: 1 process with UID = 997 (varnishhtcpd), args varnishhtcpd worker [11:24:02] ok, finish off with cp3003 and whatever else is high on your list now [11:24:15] before I start messing with puppet at esams [11:24:21] oh go ahead if you want [11:26:22] <^demon> Morning. You guys have any objections to me restarting gerrit really quick this AM? Checked the deployment schedule and wanted to get it done before SF wakes up. [11:26:41] go ahead [11:28:38] <^demon> !log restarted gerrit services on manganese & formey to pick up replication changes [11:28:51] Logged the message, Master [11:29:02] oh crap [11:29:21] <^demon> mark: Thanks, already back up. [11:29:32] it's not that easy [11:29:41] then nevermind [11:29:42] it's not that big a deal [11:29:43] it's path-based but it's SSL [11:29:47] right [11:29:50] with puppetmaster's key, which brewster's obviously doesn't have [11:30:01] leave it, we're gonna get an mpls link soon [11:30:09] could use ipv6 too [11:30:09] okay [11:30:12] yeah :-) [11:30:13] i'm thinking of doing that for now for varnish [11:30:37] hehehe [11:30:37] <^demon> mark: If you've got a second, I need https://gerrit.wikimedia.org/r/#/c/30796/ merged for gerrit replication. [11:30:55] ^demon: ah, you asked me yesterday and I forgot about it [11:30:57] I'm sorry [11:31:09] <^demon> It's ok. :) [11:31:14] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30796 [11:31:44] it's live [11:31:56] <^demon> Thanks! [11:32:09] mark: wow, I was very confused [11:32:22] I fetch && diff, saw the diff, ran merge and it said "already up-to-date" [11:32:29] and I was like wtf [11:32:36] i'm faster :P [11:32:36] :P [11:33:05] ok, I badly need a way to join lists in puppet [11:33:19] is that in that std module? [11:33:19] lemme check [11:34:30] ah there's a flatten function [11:34:30] that's good [11:34:35] except it doesn't support a mixture of arrays and hashes [11:34:46] still useful enough [11:35:03] !log putting ms-fe3 back in the pool [11:35:16] Logged the message, Master [11:36:59] paravoid: hi :-) Will you be around this afternoon to review the Zuul puppet classes ? [11:37:09] hashar: yes [11:37:25] you asked me the same and I replied the same yesterday, didn't I? :) [11:37:33] yeah :-] [11:37:33] just making sure [11:37:39] :-) [11:37:41] ;-] [11:37:52] since production might crash whenever it wants hehe [11:41:36] New patchset: Mark Bergsma; "Actually define the backends, not just the directors" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30962 [11:41:57] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30962 [11:47:27] hmm [11:47:30] how am I gonna get varnish to only use ipv6 now [11:47:42] probably have to specify ipv6 addresses instead of hostnames [11:47:58] that's a bit nasty [11:47:58] but it's temporary [11:49:29] how temporary, what's the ETA for the MPLS? [11:49:34] (just wondering) [11:49:38] a few months max [11:49:42] hmm [11:49:53] i don't think I want to resolve hostnames inside a varnish erb template ;) [11:49:54] puppet template [11:57:12] PROBLEM - Puppet freshness on db51 is CRITICAL: Puppet has not run in the last 10 hours [12:08:31] New patchset: Mark Bergsma; "Introduce $cluster_tier to distinguish between pmtpa/eqiad and the rest" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30964 [12:08:58] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30964 [12:12:27] New patchset: Mark Bergsma; "Revert "Introduce $cluster_tier to distinguish between pmtpa/eqiad and the rest"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30965 [12:12:42] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30965 [12:15:19] New patchset: Mark Bergsma; "Introduce $cluster_tier to distinguish between pmtpa/eqiad and the rest" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30966 [12:15:54] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30966 [12:17:45] RECOVERY - Varnish HTTP upload-backend on cp3003 is OK: HTTP OK HTTP/1.1 200 OK - 635 bytes in 0.234 seconds [12:18:04] New patchset: Silke Meyer; "Added puppet files for Wikidata on labs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30593 [12:20:35] New review: Silke Meyer; "Used spaces instead of tabs, removed trailing spaces." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/30593 [12:23:02] New patchset: Mark Bergsma; "Revert "Introduce $cluster_tier to distinguish between pmtpa/eqiad and the rest"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30969 [12:23:14] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30969 [12:27:34] New patchset: Mark Bergsma; "Introduce $cluster_tier to distinguish between pmtpa/eqiad and the rest" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30970 [12:28:14] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30970 [12:31:58] New patchset: Mark Bergsma; "Sort directors by name to avoid Puppet flapping" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30971 [12:32:26] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30971 [12:51:19] New patchset: Hashar; "zuul configuration for Wikimedia" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27611 [12:51:20] New patchset: Hashar; "import zuul module from OpenStack" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25235 [12:52:08] New review: Hashar; "PS7 removed some comments about packages. Comments no more apply since gallium got upgraded to Preci..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/25235 [12:53:09] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25235 [12:54:55] New review: Faidon; "Not terribly excited with zuulwikimedia, but can't think of anything better either." [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/27611 [12:54:56] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27611 [13:00:33] New patchset: Mark Bergsma; "Add Varnish backends for esams frontends as well as Squid" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30976 [13:01:08] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30976 [13:02:22] I merged the zuul changes [13:02:38] thanks ! [13:06:48] New patchset: Mark Bergsma; "Fix backends parameter expression" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30977 [13:07:10] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30977 [13:14:57] RECOVERY - Varnish HTTP upload-frontend on cp3003 is OK: HTTP OK HTTP/1.1 200 OK - 641 bytes in 0.236 seconds [13:15:24] RECOVERY - Varnish traffic logger on cp3003 is OK: PROCS OK: 3 processes with command name varnishncsa [13:16:43] New patchset: Mark Bergsma; "Use flatten/values for eqiad frontend's backends as well" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30978 [13:16:46] so where are you going to direct upload varnish on esams? [13:16:57] eqiad? [13:17:11] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30978 [13:17:15] or too many hops that way? [13:17:25] eqiad [13:17:27] (as in http hops, not network hops) [13:17:32] eqiad is not behind pmtpa [13:17:49] er? [13:17:57] eqiad talks to swift directly [13:18:01] so what do you mean? [13:18:43] I meant that that way we do esams (varnish) -> eqiad (varnish) -> pmtpa (swift) [13:18:53] so we're dependent on all three of them working for Europe [13:18:55] but I guess that's okay [13:18:59] that's the same as squid now [13:19:04] since we can easily switch if something happens [13:19:08] actually we can't [13:19:18] because i'll have to rely on ipv6 which only varnish has [13:19:18] hm? [13:19:35] varnish @ esams will talk to varnish @ eqiad over ipv6 [13:19:39] because varnish @ eqiad is internal [13:20:04] the backend varnish you mean? [13:20:05] yes [13:20:20] esams frontend -> esams backend -> eqiad backend (v6) [13:20:22] we could put it in front of the frontend varnish or squid though, couldn't we? [13:20:38] EPARSE [13:20:47] hahaha [13:21:03] esams frontend -> esams backend -> pmtpa frontend? [13:21:03] ah you mean esams frontend -> esams backend -> eqiad frontend -> eqiad backend [13:21:07] or that [13:21:09] yes, considered it [13:21:11] yeah [13:21:16] but that's yet another http hop [13:21:18] also needs changes on the frontend [13:21:34] so it recognizes it's coming from another varnish instance and treats it not as a normal client [13:21:37] for header stripping etc [13:21:44] nasty [13:21:45] nod [13:24:48] but we can still easily switch if something happens [13:24:56] New patchset: Mark Bergsma; "Make the main storage backends use persistent storage" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30979 [13:24:57] by switching DNS to pmtpa :) [13:25:17] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30979 [13:25:30] btw, is ulsfo going to get a private uplink with eqiad? [13:25:43] s/with/to/ [13:26:11] yes [13:26:12] !log depooling ms-fe4 to upgrade to precise/swift 1.7.4 [13:26:19] also mpls [13:26:25] Logged the message, Master [13:27:08] !log Restarted cp1021 with persistent storage [13:27:10] and also transfer transit traffic through there too? [13:27:13] Logged the message, Master [13:27:18] limited [13:27:25] that's nasty, lots of backhauling if we're not careful [13:27:51] because Ashburn is a large IXP, right? [13:28:08] more like, we announce our complete prefixes in ulsfo [13:28:20] and a single hop through our link [13:28:28] and some networks might only peer with us there and send traffic from NY to ulsfo and back to eqiad [13:28:33] not what we want ;) [13:28:48] heh [13:29:03] * Starting HTTP accelerator [fail] [13:29:03] sizeof(struct smp_ident) = 112 = 0x70 [13:29:03] sizeof(struct smp_sign) = 40 = 0x28 [13:29:03] sizeof(struct smp_segptr) = 32 = 0x20 [13:29:03] sizeof(struct smp_object) = 56 = 0x38 [13:29:04] CHK(0x7fbd969140b0 SILO 0x7fa4960c4000 ) = 1 [13:29:04] Warning SILO (/srv/sda3/varnish.persist) not reloaded (reason=1) [13:29:05] CHK(0x7fbd969140b0 SILO 0x7fa4960c4000 SILO) = 5 [13:29:05] Missing errorhandling code in smp_append_sign(), storage_persistent_subr.c line 134: [13:29:06] Condition((smp_chk_sign(ctx)) == 0) not true. [13:29:06] errno = 17 (File exists) [13:29:07] Aborted [13:29:30] i wouldn't mind if the existing varnish instances would just pick up the persistent storage backend next time they (re)start [13:29:31] but it actually breaks [13:30:35] hmm that's easily actually [13:30:41] rm -f will do the trick now [13:30:57] varnish will keep open the fd until it exits i'm pretty sure, next time it won't exist [13:41:48] PROBLEM - Swift HTTP on ms-fe4 is CRITICAL: Connection refused [13:42:10] PROBLEM - Memcached on ms-fe4 is CRITICAL: Connection refused [13:42:53] PROBLEM - SSH on ms-fe4 is CRITICAL: Connection refused [13:44:45] X-Cache: HIT from sq54.wikimedia.org, cp1021 hit (1), cp1036 frontend hit (341), cp3003 miss (0), cp3003 frontend miss (0) [13:44:46] the horror [13:44:55] hahahaha [13:45:11] it does work ;) [13:45:14] you did it after all :) [13:45:16] that was for /pybaltestfile.txt [13:45:20] no [13:45:26] that's just whatever I happened to have configured [13:45:34] i expected it to not work but I guess there's no reason that it doesn't hehe [13:45:51] I put upload-lb.eqiad as default backend, so that hits the frontends [13:45:51] squid might be more complicated [13:45:55] and that will hit pmtpa squids for anything ms7 like pybaltestfile.txt [13:46:07] so that's why this works ;) [13:46:11] s/ms7/netapp/ [13:46:13] ;) [13:46:21] oh there's sq56 too [13:46:21] yet another hop [13:46:38] it's quite horrendous [13:46:44] should put ssl in the loop too [13:46:56] HTTP/1.1 200 OK [13:46:56] Server: Sun-Java-System-Web-Server/7.0 [13:46:56] Content-Type: text/plain [13:46:56] Last-Modified: Wed, 01 Oct 2008 19:12:16 GMT [13:46:56] ETag: W/"56-48e3cb90" [13:46:57] X-Cache-Lookup: HIT from sq54.wikimedia.org:3128 [13:46:57] X-Cache-Lookup: HIT from sq56.wikimedia.org:80 [13:46:58] X-Varnish: 235676127 235665105, 1256078885 1254816148, 1192361105, 534308509 [13:46:58] Via: 1.1 varnish, 1.1 varnish, 1.1 varnish, 1.1 varnish [13:46:59] Content-Length: 86 [13:46:59] Accept-Ranges: bytes [13:47:00] Date: Wed, 31 Oct 2012 13:44:02 GMT [13:53:12] RECOVERY - SSH on ms-fe4 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [13:53:22] New patchset: Mark Bergsma; "Send misses to eqiad backends" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30980 [13:53:38] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30980 [14:00:10] New patchset: Mark Bergsma; "Figure out why vcl_config isn't working as expected" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30982 [14:00:11] ganglia says that upload caches in eqiad have -400G of free memory [14:00:13] how interesting [14:00:37] http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Upload%20caches%20eqiad&m=load_one&r=hour&s=by%20name&hc=4&mc=2&st=1351691969&g=mem_report&z=large&c=Upload%20caches%20eqiad [14:00:47] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30982 [14:01:16] ganglia is on crack sometimes [14:02:39] PROBLEM - NTP on ms-fe4 is CRITICAL: NTP CRITICAL: Offset unknown [14:03:21] New patchset: Mark Bergsma; "Revert "Figure out why vcl_config isn't working as expected"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30983 [14:03:30] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30983 [14:04:18] RECOVERY - NTP on ms-fe4 is OK: NTP OK: Offset 0.05764937401 secs [14:08:59] PROBLEM - Host ms-fe4 is DOWN: PING CRITICAL - Packet loss = 100% [14:09:42] RECOVERY - Host ms-fe4 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [14:10:05] RECOVERY - Swift HTTP on ms-fe4 is OK: HTTP OK HTTP/1.1 200 OK - 397 bytes in 0.507 seconds [14:10:09] RECOVERY - Memcached on ms-fe4 is OK: TCP OK - 0.001 second response time on port 11211 [14:10:31] New patchset: Hashar; "clones zuul configuration as root" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30985 [14:12:08] !log putting ms-fe4 back in the pool [14:12:21] Logged the message, Master [14:12:28] yay [14:12:46] memory leak is hopefully history [14:13:03] :) [14:13:11] hm, increased req/s to swift [14:13:18] did you change something wrt caching? [14:13:31] i restarted a varnish box [14:13:33] will restart more [14:13:34] ah [14:13:39] okay [14:13:46] nice [14:13:53] in our default VCL file: [14:14:07] if (req.backend.healthy) { [14:14:07] set req.grace = 5m; [14:14:07] } else { [14:14:07] set req.grace = 60m; [14:14:07] } [14:14:08] and then a little bit below that: [14:14:13] /* Select the default backend(s) */ [14:14:29] if I change the backend, the first check probably is not very valid anymore right ;) [14:14:39] New patchset: Hashar; "git::clone now log a message on failure" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30986 [14:15:42] New review: Faidon; "Yes!" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/30986 [14:15:42] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30986 [14:15:57] paravoid: which let me fix the cloning of the zuul configuration : https://gerrit.wikimedia.org/r/30985 ;-] [14:16:10] let / led me to [14:18:00] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30985 [14:18:14] arGghghghghh [14:18:18] exec {} does not support umask :-] [14:18:41] just clone under /var/lib/git and symlink [14:18:43] like puppetmaster::self does [14:20:14] New patchset: Mark Bergsma; "Set default backend as a configurable parameter, and before health check" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30987 [14:21:02] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30987 [14:21:18] New patchset: Hashar; "git::clone $group parameter was not used" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30988 [14:21:33] New patchset: Faidon; "swift: remove proxy restart cronjob" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30989 [14:21:33] New patchset: Faidon; "swift: minor change to proxy config for 1.7.4" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30990 [14:21:52] paravoid: and finally https://gerrit.wikimedia.org/r/30988 makes git::clone{} honors the $group parameter. It was not being passed to the inner exec{} calls :/ [14:22:00] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:23:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.970 seconds [14:24:44] New patchset: Mark Bergsma; "Fix syntax error" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30991 [14:25:13] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30991 [14:26:28] !log temporarily depooling ms-fe4 to apply a temporary local hack [14:26:40] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30988 [14:26:41] Logged the message, Master [14:26:58] New patchset: Mark Bergsma; "If only Puppet was Python based" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30992 [14:27:06] https://github.com/openstack/swift/commit/40f46e245c2465ad1561f78ca1dcbc9272974ea7 fwiw [14:27:20] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30992 [14:28:06] so puppet makes integers strings, but booleans not? [14:30:39] New patchset: Mark Bergsma; "Use vcl_config['default_backend'] to set the default backend" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30997 [14:31:55] New patchset: Mark Bergsma; "Use vcl_config['default_backend'] to set the default backend" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30997 [14:32:24] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30997 [14:33:11] nice [14:33:18] varnish has a prefer_ipv6 parameter for connecting to backends [14:33:20] I might just use that [14:45:10] !log putting ms-fe4 back in the pool [14:45:23] Logged the message, Master [14:45:29] frigging swift [14:47:12] hehe [14:47:32] sbernardin: robh is very knowledgeable and is always a good person to ask if you need help as well [14:48:00] sbernardin: you should get an email now from equinix for access [14:49:00] ok [14:49:54] <^demon> RobH is good people :) [14:50:01] robh has the disney plague [14:50:14] sudafed = able to function [14:51:29] the EQ portal normally has our two sites... [14:51:38] now its showing another one. [14:51:38] odd [14:56:33] mark: when do you expect to be done with upload caches? [14:56:42] I want to start trying 1.7.4 on backend servers too [14:57:09] and I'd like to detect anomalies by seeing the count of backend requests among other things [14:57:31] so I shouldn't overlap this with your work that empties caches :) [14:58:59] it's complicated [14:59:13] i don't think i'm gonna restart all eqiad upload caches now [14:59:21] i'll remove the old files and they'll just restart when they restart [14:59:42] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:59:42] but i'm also gonna put traffic on esams tomorrow or so [15:00:07] but that shouldn't affect swift hits [15:00:15] (much) [15:00:15] less [15:00:27] !log Restarted backend varnish instance on cp1022 [15:00:31] here's another bump [15:00:41] Logged the message, Master [15:03:10] ^demon: we waited to long [15:03:14] now its too cold to go sail. [15:03:54] <^demon> Too cold for anything now :( [15:03:56] (for me, i am a wuss and sailing when its cold sucks) [15:07:45] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 1.602 seconds [15:10:54] <^demon> RobH: Maybe we can try again in the spring :) [15:11:22] I shall hopefully live in SF by then ;] [15:11:34] so have to settle for sailing the bay area when ya visit [15:11:44] (and eventually relocate) [15:11:50] isnt everyone relocating? ;] [15:12:15] <^demon> In June. 8 long painful months. [15:22:21] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30989 [15:22:30] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30990 [15:33:32] New patchset: Hashar; "sync Zuul module with upstream and update our conf" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31009 [15:34:43] ^demon: feel free to relocate in France for a few months :-] [15:34:47] we have warm winters here [15:37:55] New patchset: MaxSem; "Support for multiple Solr cores" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29827 [15:39:44] New patchset: MaxSem; "Support for multiple Solr cores" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29827 [15:42:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:44:12] ^demon: you know of any hooks for rendering pages that don't exist? Or just something to change the tag? [15:44:55] <^demon> hexmode: No, I don't. Should be pretty easy to find where we issue the 404 and go from there. [15:45:39] <hexmode> ^demon: yeah, I've got that... [15:45:44] <^demon> I *believe* it's handled via Article.php for normal pages. But stuff may have moved around. [15:45:53] <hexmode> k, ty [15:46:08] <^demon> Ah, Article and ImagePage. [15:46:17] <^demon> Look for $wgSend404Code to get you a rough target. [15:46:40] <hexmode> oo! [15:47:18] <^demon|lunch> Oh, SpecialPageFactory.php too (for special pages) [15:47:22] <^demon|lunch> Those 3 places. [15:47:26] <^demon|lunch> Yay code duplication :) [15:47:56] <gerrit-wm> New patchset: Hashar; "sync Zuul module with upstream and update our conf" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31009 [15:49:28] <logmsgbot_> !log reedy synchronized php-1.21wmf3/extensions/UniversalLanguageSelector/ 'Update to master' [15:49:41] <morebots> Logged the message, Master [15:50:10] <logmsgbot_> !log reedy synchronized wmf-config/CommonSettings.php 'wgULSIMEEnabled = false' [15:50:23] <morebots> Logged the message, Master [15:52:56] <gerrit-wm> New patchset: Hashar; "sync Zuul module with upstream and update our conf" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31009 [15:53:47] <Reedy> Hmm [15:54:14] <Reedy> I've found at least one apache in rotation that is missing php5-memcached, last puppet run was 12 minutes ago.. [15:54:23] <mark> typical puppet idiocity [15:54:29] <mark> they define a prefix function, but not a suffix [15:55:20] <gerrit-wm> New review: Hashar; "I forgot to update the zuulwikimedia::instance parameterized class. It was not supporting the status..." [operations/puppet] (production); V: 1 C: 0; - https://gerrit.wikimedia.org/r/31009 [15:57:00] <gerrit-wm> New review: Hashar; "Faidon, this follow up the changes you have merged this wednesday. Should not cause any trouble :-]" [operations/puppet] (production); V: 1 C: 0; - https://gerrit.wikimedia.org/r/31009 [15:59:03] <nagios-wm> RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.023 seconds [16:00:24] <nagios-wm> PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [16:00:24] <nagios-wm> PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [16:00:24] <nagios-wm> PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [16:01:04] <hashar> out to get daughter [16:01:13] <hashar> will be back at 9pm GMT+1 [16:12:33] <gerrit-wm> New review: jan; "Andrew Bogott is working on a change to mediawiki::singlenode so the class does not have to copy for..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/30593 [16:25:33] <mark> !log Restarted backend varnish instance on cp1023 [16:25:46] <morebots> Logged the message, Master [16:30:34] <nagios-wm> PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:45:15] <nagios-wm> RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.018 seconds [16:48:51] <gerrit-wm> New patchset: Mark Bergsma; "Define pmtpa/eqiad backend/directors in upload-backend.inc instead" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31018 [16:55:47] <gerrit-wm> New patchset: Mark Bergsma; "Define pmtpa/eqiad backend/directors in upload-backend.inc instead" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31018 [17:01:30] <gerrit-wm> Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31018 [17:09:27] <gerrit-wm> New patchset: Mark Bergsma; "Some fixes, remove comments" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31020 [17:10:14] <gerrit-wm> Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31020 [17:13:47] <gerrit-wm> New patchset: Mark Bergsma; "Remove erroneous if" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31021 [17:14:00] <gerrit-wm> Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31021 [17:16:38] <nagios-wm> PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:26:23] <LeslieCarr> mark: can you put the omni outliner in git ? [17:29:39] <nagios-wm> RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.580 seconds [17:30:15] <nagios-wm> PROBLEM - check_minfraud_primary on payments4 is CRITICAL: HTTP CRITICAL - No data received from host [17:35:21] <nagios-wm> RECOVERY - check_minfraud_primary on payments4 is OK: HTTP OK: HTTP/1.1 302 Found - 120 bytes in 0.230 second response time [17:39:46] <preilly> mutante: https://gerrit.wikimedia.org/r/31027 [17:39:47] <gerrit-wm> New patchset: Reedy; "wgOldChangeTagsIndex to false for wikidatawiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31028 [17:40:02] <gerrit-wm> Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31028 [17:40:27] <AaronSchulz> Reedy: wgOldChangeTagsIndex default should be flipped and the exceptions enumerated ;) [17:40:28] <gerrit-wm> Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31027 [17:40:51] <Reedy> Have fun :p [17:41:18] <logmsgbot_> !log reedy synchronized wmf-config/InitialiseSettings.php [17:41:31] <morebots> Logged the message, Master [17:41:56] <AaronSchulz> or we just clean up the schema [17:42:08] <Reedy> Yeah [17:42:10] <Reedy> I added an item to http://wikitech.wikimedia.org/view/Schema_updates before [17:43:39] <nagios-wm> PROBLEM - Host storage3 is DOWN: PING CRITICAL - Packet loss = 100% [17:47:11] <mutante> preilly: ok done. merged and ran puppet on cp1041-44. mobile-frontend]/Exec[load-new-vcl-file-frontend]: Triggered 'refresh'. acl carrier_orange_botswana { [17:48:12] <preilly> mutante: thank [17:48:13] <preilly> s [17:48:39] <mutante> yw [17:50:51] <gerrit-wm> New review: Dzahn; "redir wikimania to wikimania2013" [operations/apache-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/30421 [17:50:51] <gerrit-wm> Change merged: Dzahn; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/30421 [17:51:38] <logmsgbot_> !log preilly synchronized php-1.21wmf2/extensions/MobileFrontend 'update after deploy' [17:51:47] <morebots> Logged the message, Master [17:52:21] <logmsgbot_> !log preilly synchronized php-1.21wmf3/extensions/MobileFrontend 'update after deploy' [17:52:21] <morebots> Logged the message, Master [17:53:46] <logmsgbot_> dzahn is doing a graceful restart of all apaches [17:54:04] <logmsgbot_> !log dzahn gracefulled all apaches [17:54:17] <morebots> Logged the message, Master [17:55:12] <nagios-wm> RECOVERY - Host storage3 is UP: PING OK - Packet loss = 0%, RTA = 0.42 ms [17:56:13] <mutante> !log now redirecting wikimania.wm to wikimania2013.wm [17:56:26] <morebots> Logged the message, Master [17:58:48] <nagios-wm> PROBLEM - MySQL disk space on storage3 is CRITICAL: Connection refused by host [17:59:33] <nagios-wm> PROBLEM - SSH on storage3 is CRITICAL: Connection refused [18:00:46] <logmsgbot_> !log reedy synchronized php-1.21wmf2/ [18:01:00] <morebots> Logged the message, Master [18:01:57] <logmsgbot_> !log reedy synchronized php-1.21wmf3/ [18:02:11] <morebots> Logged the message, Master [18:04:21] <nagios-wm> PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:07:33] <logmsgbot_> !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: special, closed and private to 1.21wmf3 [18:07:46] <morebots> Logged the message, Master [18:08:02] <Reedy> 1 Catchable fatal error: Argument 1 passed to ContentHandler::getContentText() must implement interface Content, boolean given, called in /usr/local/apache/common-local/php-1.21wmf3/includes/Article.php on line 390 and defined in /usr/local/apache/common-local/php-1.21wmf3/includes/content/ContentHandler.php on line 88 [18:08:03] <Reedy> 1 Catchable fatal error: Argument 1 passed to ReaderFeedbackHooks::ratingToolboxLink() must be an instance of Skin, instance of VectorTemplate given in /usr/local/apache/common-local/php-1.21wmf3/exte [18:08:03] <Reedy> nsions/ReaderFeedback/ReaderFeedback.hooks.php on line 178 [18:08:46] <Reedy> bleh, wrong channel [18:10:30] <nagios-wm> PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [18:16:47] <Reedy> Anyone from ops around? [18:17:05] <Reedy> Can someone look at https://bugzilla.wikimedia.org/show_bug.cgi?id=41589 [18:17:15] <logmsgbot_> !log reedy synchronized php-1.21wmf3/extensions/ReaderFeedback [18:17:18] <Reedy> wm18 is missing php5-memcached [18:17:30] <morebots> Logged the message, Master [18:17:51] <nagios-wm> RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.151 seconds [18:18:57] <mutante> !log mw18 - dpkg --configure -a to fix interrupted dpkg, configures php-luasandbox, install php5-memcached [18:18:57] <mutante> Reedy: fixed [18:19:05] <Reedy> thanks [18:19:06] <mutante> dpkg was interrupted [18:19:11] <morebots> Logged the message, Master [18:19:11] <mutante> np [18:22:08] <Reedy> mutante: same for srv286 if you wouldn't mind [18:22:53] <mutante> hmm, this is different [18:23:05] <mutante> gets 404 for the package [18:23:31] <mutante> php5-memcached 2.0.1-6~wmf+lucid2 [18:24:27] <mutante> ah.. it did not update apt for a long time.. could not get lock [18:25:58] <mutante> !log srv286 - kill frozen apt-get update process, update sources, install package upgrades [18:26:10] <morebots> Logged the message, Master [18:27:06] <logmsgbot_> !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikiversity, wikiquote and wikibooks to 1.21wmf3 [18:27:19] <morebots> Logged the message, Master [18:29:21] <mutante> !log srv286 - installing php5-memcached and dist-upgrade [18:29:35] <morebots> Logged the message, Master [18:29:40] <mutante> Reedy: ii php5-memcached 2.1.0-2~wmf+lucid1 [18:30:03] <mutante> (this needed more updates :P) [18:30:27] <Reedy> heh [18:30:54] <logmsgbot_> !log reedy synchronized php-1.21wmf3/extensions/ReaderFeedback [18:31:01] <cmjohnson1> !log rebuilding software raid for storage3 and provisioning [18:31:08] <morebots> Logged the message, Master [18:31:21] <morebots> Logged the message, Master [18:37:23] <gerrit-wm> New patchset: Brion VIBBER; "Set $wgMFTrademarkSitename = true to restore previous TM/(R) in footer on live sites (default has changed to disable them)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31044 [18:37:39] <nagios-wm> PROBLEM - NTP on storage3 is CRITICAL: NTP CRITICAL: No response from NTP server [18:38:10] <logmsgbot_> !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wiktionary and wikinews to 1.21wmf3 [18:38:24] <morebots> Logged the message, Master [18:39:43] <logmsgbot_> !log preilly synchronized php-1.21wmf2/extensions/ZeroRatedMobileAccess 'update post deploy' [18:39:57] <morebots> Logged the message, Master [18:40:11] <logmsgbot_> !log preilly synchronized php-1.21wmf3/extensions/ZeroRatedMobileAccess 'update post deploy' [18:40:26] <morebots> Logged the message, Master [18:41:13] <logmsgbot_> !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Everything else non wikipedia or wikivoyage to 1.21wmf3 [18:41:26] <morebots> Logged the message, Master [18:41:50] <gerrit-wm> New patchset: Dereckson; "Removing settings for no more existant wikis." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28631 [18:42:41] <gerrit-wm> New review: Dereckson; "PS2: rebased" [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/28631 [18:44:33] <nagios-wm> PROBLEM - Host storage3 is DOWN: PING CRITICAL - Packet loss = 100% [18:45:27] <gerrit-wm> New patchset: Reedy; "Numerous other wikis over to 1.21wmf3" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31046 [18:45:41] <gerrit-wm> Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31046 [18:51:22] <gerrit-wm> New patchset: Mark Bergsma; "Can't use default_backend when defining it in the include file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31051 [18:51:50] <gerrit-wm> Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31051 [18:52:05] <gerrit-wm> New patchset: Mark Bergsma; "Revert "Can't use default_backend when defining it in the include file"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31053 [18:52:35] <gerrit-wm> Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31053 [18:53:43] <nagios-wm> PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:54:39] <logmsgbot_> !log preilly Started syncing Wikimedia installation... : update zero rated mobile access [18:54:53] <morebots> Logged the message, Master [18:55:00] <gerrit-wm> New patchset: Mark Bergsma; "Can't use default_backend when defining it in the include file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31054 [18:55:35] <gerrit-wm> Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31054 [19:02:15] <gerrit-wm> New patchset: Jgreen; "replace storage3 with db78" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31055 [19:02:24] <nagios-wm> PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [19:08:43] <nagios-wm> RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.025 seconds [19:10:45] <gerrit-wm> Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31055 [19:12:14] <nagios-wm> PROBLEM - Puppet freshness on ms-fe3 is CRITICAL: Puppet has not run in the last 10 hours [19:12:56] <gerrit-wm> New patchset: Mark Bergsma; "Apparently backend port numbers are strings" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31059 [19:13:25] <gerrit-wm> Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31059 [19:15:52] <gerrit-wm> New patchset: Jgreen; "fixed deprecated qw() syntax" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31061 [19:16:45] <gerrit-wm> Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31061 [19:34:04] <mark> !log Restarted backend varnish instance on cp1024 [19:34:17] <morebots> Logged the message, Master [19:35:43] <logmsgbot_> !log preilly Finished syncing Wikimedia installation... : update zero rated mobile access [19:35:53] <gerrit-wm> New patchset: Cmjohnson; "Removing decommissioned servers from the dhcpd file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31070 [19:35:56] <morebots> Logged the message, Master [19:36:21] <ori-l> (binasher: thanks!) [19:41:00] <nagios-wm> PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:41:27] <nagios-wm> PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [19:41:27] <nagios-wm> PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [19:41:27] <nagios-wm> PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [19:41:49] <binasher> ori-l: hopefully soon :) [19:48:28] <gerrit-wm> New patchset: Asher; "a very sketchy start to a redis module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31073 [19:52:05] <Krinkle> Reedy: Can I ask you to set a path in svn to read-only? [19:52:13] <Krinkle> trunk/tools/ToolserverI18N (it has been migrated to github) [19:52:58] <Reedy> Krinkle: I can't [19:53:08] <Krinkle> who can? [19:53:08] <Reedy> -rw-r--r-- 1 demon svn [19:53:11] <Krinkle> Okay [19:53:23] <Krinkle> Reedy: In general, or this directory? [19:53:36] <Krinkle> ^demon: ping [19:53:50] <Krinkle> actually, nevermind. I'll delete the directory instead. [19:54:00] <^demon> No, I'll mark it r/o. [19:54:04] <^demon> One second. [19:54:04] <Krinkle> No need to keep it in svn. I'll make sure usages are updated (only used on 2 servers) [19:54:19] <^demon> There's no harm in leaving it in SVN. [19:54:19] <Krinkle> ^demon: well, then I'll add a note first pointing to the new location. [19:54:21] <nagios-wm> RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.275 seconds [19:54:28] <Krinkle> an OBSOLETE file [19:55:30] <AaronSchulz> binasher:I've been watching dberrors.log and messing with https://gerrit.wikimedia.org/r/#/c/30937/ on my testwiki [19:55:36] <AaronSchulz> binasher: do you have a local testwiki set up? [19:55:58] <Krinkle> ^demon: done :) [19:56:02] <Krinkle> feel free to r/o [19:56:15] <^demon> Done [19:56:35] <AaronSchulz> binasher: maybe ^demon can merge that ;) [19:57:17] <gerrit-wm> New review: RobLa; "Per conversation on IRC yesterday, Jan says in his testing that $wgMaxShellMemory = 300000 was suffi..." [operations/mediawiki-config] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/30773 [19:58:39] <AaronSchulz> binasher: killing the ORDER BY kills the deadlocks pretty well, even with 20 threads [19:59:58] <AaronSchulz> binasher: without it deadlocks can happen occasionally if there are a large number of jobs and frequently when there are just a few [20:00:34] <binasher> yeah, i don't think the order by really did anything useful either [20:01:50] <^demon> AaronSchulz: I can't merge, waiting on jenkins. [20:02:07] <AaronSchulz> did jenkins slow down? [20:03:25] <^demon> I dunno, looks backed up. [20:03:31] <^demon> The build queue. [20:06:53] <robla> AaronSchulz: would you mind deploying https://gerrit.wikimedia.org/r/30773 ? [20:07:35] <gerrit-wm> Change merged: Aaron Schulz; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/30773 [20:08:00] <robla> also (sorry) slightly bigger task: https://gerrit.wikimedia.org/r/#/c/30935/ [20:08:12] <robla> (backport and deploy) [20:08:16] <robla> AaronSchulz: ^ [20:09:03] <logmsgbot_> !log aaron synchronized wmf-config/CommonSettings.php 'Bug 41528 - need more memory for video thumbnails' [20:09:16] <morebots> Logged the message, Master [20:09:57] <gerrit-wm> New patchset: Dzahn; "update CA for the new star.wikipedia.org SSL cert" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31077 [20:10:55] <gerrit-wm> Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31077 [20:11:48] <robla> foo....thumbs still broken [20:29:00] <nagios-wm> PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:31:00] <gerrit-wm> New patchset: Asher; "a very sketchy start to a redis module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31073 [20:38:11] <cmjohnson1> can i get a +2 on this plz https://gerrit.wikimedia.org/r/#/c/31070/ [20:41:03] <logmsgbot_> !log aaron synchronized php-1.21wmf3/extensions/TimedMediaHandler 'deployed bf098d262dce290ac4b0cf70f74182c98aeca4d9' [20:41:16] <morebots> Logged the message, Master [20:42:08] <logmsgbot_> !log aaron synchronized php-1.21wmf2/extensions/TimedMediaHandler 'deployed cbe526b06ec1876c933f8a04c2526b137642a879' [20:42:22] <morebots> Logged the message, Master [20:43:05] <AaronSchulz> binasher: how is update without order by both less locking and replication safe? [20:44:09] <nagios-wm> RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.025 seconds [20:49:55] <AaronSchulz> binasher: http://dev.mysql.com/doc/refman/5.1/en/replication-features-limit.html [20:50:47] * AaronSchulz hugs row-based replication [20:50:51] <binasher> AaronSchulz: arg.. yeah [20:51:47] <binasher> first mariadb, then rbr [20:53:02] <binasher> correct jobq replication is of limited but some importance [20:53:16] <binasher> death to the mysql job queue [20:53:28] <binasher> another inefficiency is that with a low number of jobs in the queue, it can take a lot of update queries to successfully grab one [20:54:08] <binasher> i saw four after a single job was inserted, based on the random id and direction select [20:54:35] <nagios-wm> PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [20:54:57] <binasher> and those all have to go to the binlog even with 0 rows changed [20:55:18] <AaronSchulz> ;) [20:56:19] <Reedy> mutante: Did you restart apache on 286 afterwards? If not, could you do so please? [20:57:24] <binasher> AaronSchulz: the update query could be run with sql_log_bin = 0, but replicate the delete after job completion [20:57:48] <gerrit-wm> New patchset: Hoo man; "Removed unused wikidata docroots" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31119 [20:58:38] <binasher> assuming job workers also read from the master [20:59:17] <AaronSchulz> slaves are used to cache isEmpty(), that's it [21:00:04] * AaronSchulz tries something [21:01:38] <Reedy> AaronSchulz: Amazing! [21:02:21] <AaronSchulz> Reedy: hm? [21:02:42] <Reedy> You tried "something" ;) [21:02:57] <Reedy> Can someone restart/graceful the apache on srv286 please? [21:07:03] <binasher> Reedy: sure [21:07:11] <Reedy> Thanks [21:07:22] <binasher> done [21:07:46] <Reedy> There were a couple of stray apaches without the pecl memcached installed [21:08:05] <binasher> ah yeah, i saw mutante fixed a few where apt was broken [21:08:42] <Reedy> I noticed the error came up in the logs again, but it was installed, so presumed it was just missing a restart [21:09:17] <nagios-wm> PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [21:10:34] <nagios-wm> RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [21:12:35] <logmsgbot_> !log reedy synchronized php-1.21wmf3/includes/site/SiteObject.php [21:12:49] <morebots> Logged the message, Master [21:14:04] <binasher> bbiab, lunch [21:14:32] <nagios-wm> PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [21:16:47] <nagios-wm> PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:28:14] <Nemo_bis> Reedy: could you update https://wikitech.wikimedia.org/view/Locations ? [21:28:31] <Reedy> With what? [21:28:43] <Nemo_bis> or create [[Logs]] with a couple lines on where they are and an ls or something [21:29:07] <Nemo_bis> Reedy: with what it used to contain if you have no imagination? :p [21:29:24] <Nemo_bis> or just strike what you've never heard of [21:29:27] <Nemo_bis> dunno, it's so old [21:31:40] <nagios-wm> RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.027 seconds [21:34:14] <nagios-wm> RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.052 second response time [21:35:32] <nagios-wm> PROBLEM - Puppet freshness on brewster is CRITICAL: Puppet has not run in the last 10 hours [21:47:08] <j^> AaronSchulz: should test2 run latest TMH already? http://test2.wikipedia.org/wiki/Special:Version still lists the old version of TMH. its up to date on test.w.o [21:50:10] <robla> j^: hrm, that's odd [21:51:50] <logmsgbot_> !log reedy synchronized php-1.21wmf3/maintenance/nextJobDB.php [21:52:03] <morebots> Logged the message, Master [21:57:08] <nagios-wm> PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [21:58:31] <nagios-wm> PROBLEM - Puppet freshness on db51 is CRITICAL: Puppet has not run in the last 10 hours [22:00:45] <AaronSchulz> gwicke: looking at ganglia? :) [22:01:36] <gwicke> AaronSchulz: hm? [22:02:24] <gwicke> we are expanding to use most of labs' CPU resources for rt testing, so ganglia is useful as a quick status check.. [22:03:37] <gerrit-wm> Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31073 [22:03:57] <gwicke> if anybody notices too many API requests from labs ips, it is probably us [22:04:39] <nagios-wm> PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:06:17] <binasher> Reedy: AaronSchulz: was timedmediahandler disabled last night and found not the cause of the jobqueue breakage? looks like refreshLinks2 and htmlCacheUpdate are still broken [22:06:31] <Reedy> binasher: yup, we know. See -tech ;) [22:18:00] <nagios-wm> RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.746 seconds [22:18:28] <nagios-wm> RECOVERY - Memcached on virt0 is OK: TCP OK - 0.001 second response time on port 11000 [22:25:51] <AaronSchulz> binasher: https://gerrit.wikimedia.org/r/#/c/31129/ doesn't deadlock with 32 runners on my box [22:33:49] <binasher> AaronSchulz: how many jobs have you had in the queue? [22:34:19] <AaronSchulz> 261 [22:34:25] * AaronSchulz will try with less [22:37:22] <logmsgbot_> !log reedy synchronized php-1.21wmf3/maintenance/nextJobDB.php [22:37:36] <morebots> Logged the message, Master [22:43:47] <AaronSchulz> binasher: works fine with 10 too [22:45:55] <hashar> hello dear ops :-]  The gallium server as more CPU waiting I/O since we upgraded it from Lucid to Precise. Does it sounds familiar to anyone ? [22:46:28] <binasher> AaronSchulz: i'm not sure how i feel about the reduceContention switch [22:48:55] <robla> AaronSchulz: j^: did you guys sort out the TMH issue that j^ identified above? [22:49:39] <robla> (may need to call off the deployment tomorrow, which would really suck) [22:51:17] <AaronSchulz> RoanKattouw: are those Special:Version git links always up to date? [22:51:31] <AaronSchulz> binasher: what would you do? [22:51:37] <RoanKattouw> No, not always, because sync-dir excludes .git [22:51:47] <AaronSchulz> yep, so there [22:51:50] <RoanKattouw> It could be fixed if someone cared enough [22:51:54] <nagios-wm> PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:52:00] <RoanKattouw> Or by "just" moving to git-deploy [22:52:07] <AaronSchulz> "just" [22:52:08] <robla> ok....won't worry about that then [22:52:48] * robla double checks that the IE fixes are in the right branch [22:55:24] <robla> sweet...looks good [22:59:16] <binasher> AaronSchulz: why make the behavior selectable vs always doing the select + update by pk? [23:01:06] <AaronSchulz> binasher: yeah, maybe I'll just comment out the old code [23:01:49] <binasher> the old code minus the order by could still be ok if you disable binlogging for just that query [23:02:22] <AaronSchulz> how do do that? [23:03:19] <AaronSchulz> I kind of like having both code paths since I prefer the update/limit way, though the select/update works if needed [23:06:29] <AaronSchulz> binasher: SET sql_log_bin = 0;? [23:07:11] <binasher> yeah [23:07:21] <AaronSchulz> does wikiuser even have SUPER? [23:07:25] <binasher> well [23:08:08] <gerrit-wm> New patchset: Ori.livneh; "Don't exclude .git when syncing to prod" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31146 [23:08:16] <ori-l> ^^ RoanKattouw [23:08:24] <nagios-wm> RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.032 seconds [23:08:30] <binasher> AaronSchulz: i don't think super is needed to set it on a session basis, only as a global [23:08:50] <AaronSchulz> Disables or enables binary logging for the current session (sql_log_bin is a session variable) if the client has the SUPER privilege. [23:09:00] <RoanKattouw> ori-l: It should at least exclude .git/objects though, otherwise the syncs will be twice as big [23:09:09] <RoanKattouw> Or whatever the name of the dir with all the bloat is [23:09:22] <gerrit-wm> New patchset: Reedy; "$wgWBSettings['useChangesTable'] = false;" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31147 [23:09:46] <gerrit-wm> Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31147 [23:10:21] <logmsgbot_> !log reedy synchronized wmf-config/CommonSettings.php [23:10:28] <morebots> Logged the message, Master [23:11:18] <ori-l> RoanKattouw: ah, i see. i gotta run now but i'll figure out exactly later and update the patch. [23:11:31] <binasher> AaronSchulz: i was going to say that wikiadmin doesn't have super, but it does have a GRANT ALL PRIVILEGES ON `%wik%`.* TO 'wikiadmin'@'10.64.0.0/255.255.252.0'; [23:11:38] <binasher> wikiuser, nope [23:12:01] <binasher> stick with update and select by pk [23:12:19] <AaronSchulz> you mean select and update by pk? :) [23:12:51] <^demon|brb> RoanKattouw: I think we only need .git/HEAD, iirc. [23:13:03] <^demon|brb> Nevermind, we need that and refs/heads/* [23:13:51] <RoanKattouw> Yeh [23:14:00] <^demon|brb> Oh, and GitInfo also wants config. [23:14:04] <RoanKattouw> Plus .git sans objects is small anyway [23:16:29] <^demon|brb> Eh, not necessarily. For me on core, .git is 372M, .git/objects is 192M [23:16:50] <^demon|brb> .git/modules/ (um, wtf is that?) is taking up 179M [23:17:12] <^demon|brb> Ahhh, it's for submodules, duh [23:17:38] <^demon|brb> Hahah, .git/modules is the .git dirs of all the submodules. [23:18:08] <RoanKattouw> Oh ouch [23:18:27] <RoanKattouw> OK so you'd need to exclude .git/objects and .git/modules/*/objects or something? [23:19:11] <^demon|brb> Something like that. Although for extensions we'll just use the extension's .git directory. Easier to just explicitly include HEAD, refs/* and config, rather than accidentally syncing extra crap. [23:20:07] <^demon|brb> Oh, .git dirs for extensions are just flat files? Maybe we do need the directory :\ [23:20:35] <^demon|brb> Blargh, yeah we will need them. [23:20:38] <^demon|brb> $ cat extensions/ProofreadPage/.git [23:20:38] <^demon|brb> gitdir: ../../.git/modules/extensions/ProofreadPage [23:21:08] <^demon|brb> So yeah, your solution is right. [23:23:37] <RoanKattouw> Well, almost [23:23:44] <RoanKattouw> .git/objects and .git/modules/**/objects [23:23:51] <logmsgbot_> !log reedy synchronized php-1.21wmf3/extensions/Wikibase [23:23:55] <morebots> Logged the message, Master [23:23:57] <RoanKattouw> Because names of submodules can contain slashes [23:29:42] <gerrit-wm> New patchset: J?r?mie Roquet; "Add $wgForceUIMsgAsContentMsg array for wikidata" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31153 [23:32:44] <gerrit-wm> Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31153 [23:33:12] <gerrit-wm> Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28631 [23:33:40] <gerrit-wm> Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/30588 [23:34:46] <gerrit-wm> New review: Reedy; "reedy@fenari:/home/wikipedia/common$ mwscript namespaceDupes.php nowiki --fix" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/30588 [23:35:20] <gerrit-wm> New review: Demon; "This isn't quite what we want. We don't want to include all of .git, or we'll be syncing roughly twi..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/31146 [23:35:26] <gerrit-wm> New review: Reedy; "Reverting for now..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/30588 [23:35:36] <gerrit-wm> Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31154 [23:36:16] <gerrit-wm> Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23927 [23:37:17] <gerrit-wm> Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31119 [23:38:08] <gerrit-wm> Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/30585 [23:39:09] <logmsgbot_> !log reedy synchronized wmf-config/ [23:39:18] <morebots> Logged the message, Master [23:40:03] <nagios-wm> PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:40:09] <gerrit-wm> Change merged: awjrichards; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31044 [23:43:05] <Reedy> mutante (or someone else!) can you graceful/restart apache on mw18 please? [23:46:20] <binasher> Reedy: will do [23:47:33] <binasher> done [23:53:15] <nagios-wm> RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.420 seconds