[00:00:29] RobH: looking [00:02:10] in the old slot format config file, we had to pull down memcached servers out asap [00:02:23] so just wondering if the new memcached and config format also needs this, or if its smarter =] [00:02:35] (i rather know before we have a crashed mc server, heh ;) [00:02:45] so nothing is wrong right now? [00:02:51] on the doc? [00:02:56] oh, on cluster [00:02:57] on the site :) [00:02:58] nothing is wrong now. [00:03:01] ok [00:03:09] there will still be a flood of errors [00:03:19] so I'd imagine you'd want to remove servers from the list [00:03:22] I am just updating the wikitech docs so when we have a broken one we know [00:03:26] cool, sounds good to me [00:03:31] the only difference is that you don't have to put in a replacement [00:03:47] since the hashing is consistent, a lot of keys will still map to the same servers (though not all) [00:05:13] I suppose changing the servers in the list more than once a in a short time could cause keys to map back to a server they used to before...which could cause consistency problems [00:05:27] a short time is ? [00:05:37] 1/5/10 minutes? [00:05:53] adding to docs so folks know what to do [00:06:35] depends how long items are cached [00:06:39] we should probably audit the MW code [00:09:10] RobH: I can think of same things that cache for up to a day that expect some consistency [00:09:17] eww [00:09:51] i think the mctest.php only tests tampa memcached. [00:10:09] yep... [00:10:25] so our memcache testing script is checking tampa memcached, and i thikn we are running memcached out of eqiad arent we? [00:10:35] say a file is cached as "not existing" on mc1, mc 1 is pulled, and mc7 is used, someone uploads A and it is cached as "existing", then someone adds mc1 back, and the old "not existing" key comes back [00:10:35] PROBLEM - Puppet freshness on amssq41 is CRITICAL: Puppet has not run in the last 10 hours [00:10:55] well, if you restarted mc1 that won't happen [00:10:56] New patchset: Jdlrobson; "Enable Watchlist schema in config file (EventLogging)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/48060 [00:11:19] but you could imagine, maybe pulling mc2 would cause the key to move from m1 => mc7 and back when mc2 is re-added [00:11:42] in which case the file would be falsely cached as not existing and thumbnails would give 404s for no apparent reason [00:11:44] and this wasnt an issue before sicne servers had slot IDs? [00:11:55] and since a server wouldnt go back to its old slot id, no problem then? [00:12:05] it wasn't because we used modulo hash and always swapped in a spare [00:12:10] yea [00:12:13] so keys were never getting remapped around [00:12:37] i put in docs that it can cause an issue, and why, and to be careful [00:12:40] the advantage of consistent hashing is that if *you don't have a spare* you can pull a server without causing almost everything to map to new servers [00:12:48] but i dont think there is much else one can do then when swapping if a server is down. [00:13:12] AaronSchulz: if an mc host is actually pulled by ops, its likely to come back after a reboot [00:13:27] binasher: see my above comment :) [00:13:30] i just put in the docs on wikitech to reboot before pulling [00:13:43] and only pull from the config if the server cannot be resurrected asap [00:13:45] er, reboot after pulling? [00:13:51] binasher: the server pulled won't be the only with keys remapped [00:14:07] ie: dont take server out of mc-site.php until its rebooted and isnt goign to come back easily [00:14:28] if it's coming back right away, less overhead and errors to simply fix [00:14:33] it's better to let the log spam for a few minutes than pulling a server [00:14:37] AaronSchulz: ahh, your pulling mc2 causing mc1 keys to move to mc7 comment [00:14:39] on old setup one would always swap out with an up spare first, then troubleshoot [00:14:44] unless that server is used for same crazy hot key (like slave lag cache) [00:14:51] you'd probably know that when you see it :) [00:14:55] binasher: this is all discussion, all mc servers are fine now =] [00:14:57] yeah, puling servers should be done in ways that support the consistent hashing [00:15:10] im trying to update the memcache wikitech page since it was horribly outdate [00:15:11] d [00:15:15] PROBLEM - MySQL Slave Delay on db1007 is CRITICAL: CRIT replication delay 187 seconds [00:15:21] Also, the mctest.php script seems to only test tampa mc servers. [00:15:59] PROBLEM - MySQL Replication Heartbeat on db1007 is CRITICAL: CRIT replication delay 211 seconds [00:16:34] binasher: maybe it would be nice to have a feature to list a server as down, so that keys that map to it would map to the "next" server on the hash ring [00:16:38] AaronSchulz: we use http://www.last.fm/user/RJ/journal/2007/04/10/rz_libketama_-_a_consistent_hashing_algo_for_memcache_clients by way of libmemcached [00:17:15] New patchset: FastLizard4; "Allow override of the MySQL server bind address through the mysql_server_bind_address puppet variable. The default will still be 127.0.0.1 (end result: Instance administrators may now set their MySQL server's bind address in their NovaInstance configurat" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48585 [00:17:24] binasher: right [00:17:26] removing a server should not remap many keys, though its possible that a small percentage will be [00:17:57] that's what I was saying about, most stuff will map to the same place [00:18:04] s/about/above [00:18:25] which is different than in the slot system [00:18:48] but slot system style, we can also still avoid changing the number of server slots when a box is down for maintenance [00:19:43] or as you said, put up with the log spam [00:20:03] sorry, i jumped in on this without having read all of the back scroll [00:20:53] no worries [00:21:08] so who do i bug to fix mctest.php? [00:21:09] ;] [00:21:10] RobH: actually when I said "module hash" I meant "modulo slot", but you know :) [00:21:24] wasn't that fixed? [00:21:37] it outputs tampa mc servers [00:21:47] shouldnt it be outputting whatever the active cluster is? (in this case, eqiad?) [00:21:55] RobH: Running from fenari? [00:21:58] yep [00:21:59] try it on bastion1001? [00:22:03] ^ [00:22:14] bleh. [00:22:52] different error [00:22:55] you guys try this? ;] [00:23:04] What's the error there? [00:23:05] Could not open input file: /srv/deployment/mediawiki/common/multiversion/MWScript.php [00:23:11] lols [00:23:24] heh [00:23:31] I guess bast1001 isn't a deployment target [00:23:55] so, then is the spence memcached check running against tampa servers? [00:24:03] makes it kind of pointless then ;] [00:24:04] it wouldn't suprise me [00:24:33] New review: Andrew Bogott; "As discussed on IRC, this change is good but should be done in the labs role class instead." [operations/puppet] (production); V: 0 C: -2; - https://gerrit.wikimedia.org/r/48585 [00:25:25] RobH: run /usr/bin/sync-common on bast1001 as root [00:25:25] heh, so copper is running a memcache check on nagios [00:25:37] so somepalce does have mc in eqiad chekcing [00:25:39] thats good. [00:25:41] Then someone needs to add it to the dsh group [00:26:22] well, the check is on nagios but its its own memcache instance, nm. [00:26:36] i still think we may not be checking mc service in eqiad [00:27:09] trwikibooks: Memcached error for key "trwikibooks:revisiontext:textid:33143" on server "10.64.0.183:11211": ITEM TOO BIG [00:27:10] heh [00:27:27] so we list the ports in the mc-pmtpa.php [00:27:31] but no ports in mc-eqiad.php [00:27:57] still same ports, just odd they dont match configurations. [00:28:18] !log aaron synchronized php-1.21wmf9/includes/db/Database.php 'deployed d8705542627f006a7ec9f81a9fb488fcc9a367bd' [00:28:19] Logged the message, Master [00:28:21] (why the hell do we run it on non standard port anyhow?) [00:28:29] AaronSchulz: the max slab size in memcached is no longer limited to 1mb objects, it's user configurable [00:28:30] legacy? [00:28:37] Reedy: thats all i can see. [00:28:44] i hate that reason. [00:28:48] Kids these days. [00:28:52] RobH: we don't. [00:28:59] we use the standard port [00:29:04] i thought the tampa port was non standard [00:29:16] i guess it was fixed a long time ago and i never noticed =P [00:29:53] AaronSchulz: i do believe that the 1m limit is hardcoded in libmemcached though, so changing our running memcached wouldn't immediately filter down [00:29:57] binasher: maybe we should do some logging to get histogram from a sample of attempted sets for size [00:30:01] we do build and package libmemcached ourselves [00:30:05] oh [00:30:10] but i'm not sure if this would even be worth doing [00:30:14] PROBLEM - MySQL Replication Heartbeat on db1007 is CRITICAL: CRIT replication delay 208 seconds [00:30:18] * RobH is still bummed he cannot deploy new apaches until chris runs some cables [00:30:26] seems pretty occasional that we hit it the case [00:30:32] not super often [00:30:50] PROBLEM - MySQL Slave Delay on db1007 is CRITICAL: CRIT replication delay 217 seconds [00:31:25] all depends on the latency cost of never caching those few keys vs. the ram cost in having memory allocated to huge-object slabs that would very rarely be used [00:32:30] and we'd still need to generate a key size histogram to determine the new max slab size [00:32:32] one could even automatically shard keys and use getmulti ;) [00:32:51] some updates applied: http://wikitech.wikimedia.org/view/Memcached [00:33:24] AaronSchulz: twemproxy has a cool feature that lets you set an arbitrary shard key for a set of keys [00:33:25] andrewbogott: http://arstechnica.com/security/2013/02/at-facebook-zero-day-exploits-backdoor-code-bring-war-games-drill-to-life/ [00:33:34] binasher: I kind of wish the hashing was based on names the client gives the hosts [00:33:35] AaronSchulz: so that when you getmulti, they will all be on the same server [00:33:49] that way one could swap out a server and the distribution would not change [00:35:15] New patchset: Reedy; "Remove memcached ports from pmtpa config" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/48590 [00:35:38] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [00:36:11] AaronSchulz: splitting very large keys into multiples and reassembling after a getmulti might not be that crazy… though doing so with that twemproxy feature would be better [00:36:31] getting them on the same server is nice [00:36:43] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/48590 [00:38:17] AaronSchulz: i might start testing twemproxy on a small scale, possibly on a production apache or two [00:39:05] RECOVERY - MySQL Replication Heartbeat on db1007 is OK: OK replication delay 28 seconds [00:39:21] TimStarling: I've made some more jobqueue commits [00:39:32] RECOVERY - MySQL Slave Delay on db1007 is OK: OK replication delay 11 seconds [00:39:40] ok [00:40:46] binasher: btw, how are those redis servers comming? :) [00:41:21] depends on if we burn db class hosts on them or not… which would kinda be a waste [00:41:43] RobH: what's up with the continuing saga of the new ssd having high perf misc servers? [00:42:05] the swift legacy? [00:43:13] LocalFile::recordUpload2: Transaction already in progress (from SiteStatsUpdate::doUpdate), performing implicit commit! [00:43:15] * AaronSchulz hrms [00:48:16] wikitech is full of horribly outdated and misleading cruft. [00:48:33] RobH: do you just learn that? [00:48:59] i thought wikitech was getting killed off [00:49:04] at last years berlin hackathon [00:49:18] its going to be merged with labsconsole wiki [00:49:33] so I am attempting to clean up the major pages we actually use, and get rid of some cruft pre-merge [00:49:48] though most will have to wait post-merge when we have a mass of volunteers who can edit it. [00:50:05] plus a lot of it is just wrong. [00:57:54] <^demon> Ok, let's do this! [00:58:03] meh [00:58:07] let's push it off another month [00:58:14] ;) [00:58:18] <^demon> That's not even funny at this point :p [00:58:21] ready when you are [00:58:38] realistically, you're really doing all of the work ;) [00:58:45] New patchset: Andrew Bogott; "Rework the RT manifests so it can be installed in Labs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47026 [00:59:46] <^demon> Ryan_Lane: Ok, can you merge https://gerrit.wikimedia.org/r/#/c/48574/ now and update sockpuppet? [01:00:03] yep [01:00:11] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48574 [01:00:40] ^demon: done [01:01:25] <^demon> !log stopped puppet and gerrit on manganese and formey [01:01:27] Logged the message, Master [01:06:47] AaronSchulz: hey, https://bugzilla.wikimedia.org/show_bug.cgi?id=42133 [01:07:03] Reedy just opened an RT about creating containers for new wikis [01:07:09] so this is the next time I was referring to :) [01:07:17] heh [01:07:25] I got distracted at the end of the process... [01:09:47] paravoid: do you have dinner plans tonight? [01:10:13] uhm, kind of [01:10:23] paravoid: hmm [01:10:26] tomorrow? [01:10:32] paravoid: PM [01:13:24] <^demon> Ryan_Lane: https://gerrit.wikimedia.org/r/#/c/35562/ and https://gerrit.wikimedia.org/r/#/c/34516/, please? [01:13:51] ooh, style changes [01:14:08] New review: Ryan Lane; "Patch Set 2: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/35562 [01:14:09] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/35562 [01:14:29] New review: Ryan Lane; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/34516 [01:14:39] New review: Ryan Lane; "Patch Set 1: Verified+2" [operations/puppet] (production); V: 2 - https://gerrit.wikimedia.org/r/34516 [01:14:48] path conflict [01:14:54] needs rebase [01:14:58] on 2nd change [01:16:52] <^demon> Rebased. [01:17:05] <^demon> (Hmm, no notif? Will investigate in a bit) [01:17:22] is it down? [01:17:47] it's not working too well for me [01:17:58] <^demon> wfm? [01:18:07] when I click review it just sits there [01:18:13] saying "Working ..." [01:18:42] maybe it's just working hard [01:18:48] ;) [01:19:01] yeah, this is broken [01:19:07] I tried firefox and chrome [01:19:21] Ditto [01:19:49] GET https://gerrit.wikimedia.org/r/accounts/self/avatar?s=100 404 (Not Found) 7EC8574480641920105EFDDCBDBC26F2.cache.html:22797 [01:19:49] GET https://gerrit.wikimedia.org/r/accounts/self/avatar?s=26 404 (Not Found) 7EC8574480641920105EFDDCBDBC26F2.cache.html:22797 [01:19:49] Uncaught TypeError: Cannot read property 'length' of undefined 7EC8574480641920105EFDDCBDBC26F2.cache.html:11326 [01:20:13] The last error specifically from trying to review [01:22:02] <^demon> The 404s are expected, knowing how that feature works.. The last error is the problem. [01:23:07] <^demon> Hmm, was able to review https://gerrit.wikimedia.org/r/#/c/48487/ [01:23:54] another 405 [01:23:56] *404 [01:23:57] GET https://gerrit.wikimedia.org/r/projects/mediawiki%2Fcore/dashboards/default?inherited 404 (Not Found) [01:24:07] oh [01:24:14] did that change get applied? [01:24:18] and did apache restart? [01:24:24] <^demon> Yes. [01:24:28] :( [01:24:37] was hoping it would be an easy fix. heh [01:24:45] I can apparently review https://gerrit.wikimedia.org/r/#/c/48591/1 [01:25:01] <^demon> Maybe stuff stuck in your cache? Weird, but maybe. [01:25:52] nope [01:25:55] ^demon: Possibly related, I'm trying to submit a patch and 'The remote end hung up unexpectedly' [01:26:02] Working better in an incognito window [01:26:04] I reset safari and tried [01:26:24] <^demon> andrewbogott_afk: Gerrit has restarted several times during the process. [01:26:46] ^demon: he's having the issue right now [01:27:41] I can't fetch either, it seems [01:31:18] <^demon> I was just able to push a new patch to https://gerrit.wikimedia.org/r/#/c/28352/ [01:31:37] <^demon> And just fetched from a couple of repos. [01:31:54] <^demon> And just saw a review from Tyler. [01:32:11] some pages definitely aren't working, though [01:32:18] I was able to review and merge 1/2 of the changes [01:32:30] the other one I cannot [01:32:32] <^demon> Let's flush all caches. [01:32:38] ok [01:32:45] need me to do so, or can you? [01:32:52] <^demon> Just did. [01:32:55] ok [01:33:40] still not working [01:33:58] I'm resetting safari on each attempt as well [01:34:40] <^demon> Ahh, error for me in safari too. [01:34:43] <^demon> (was working in chrome) [01:35:08] it doesn't work for me in any browser [01:35:16] specifically this change: https://gerrit.wikimedia.org/r/#/c/34516 [01:36:21] <^demon> The heck is up with that change. [01:36:28] <^demon> Try some unrelated change? [01:37:02] https://gerrit.wikimedia.org/r/#/c/22698/ [01:37:19] that also has the problem [01:38:19] lots of changes have this problem [01:38:27] https://gerrit.wikimedia.org/r/#/c/43148/ [01:39:29] https://gerrit.wikimedia.org/r/#/c/47535/ [01:39:34] <^demon> The heck. I've not hit this anywhere in our testing. [01:44:18] ^demon: so, what to do? [01:44:26] <^demon> I'm looking still. [01:44:40] ok [01:46:42] <^demon> A-ha! [01:46:46] <^demon> There's a fix in master we need. [01:46:47] <^demon> https://gerrit-review.googlesource.com/#/c/42170/ [01:46:51] <^demon> Re-building now. [01:47:27] <^demon> Well, didn't know we needed this. But it'll fix the problem. [01:47:46] <^demon> (and it just went in 30m ago, so wouldn't have seen it) [01:48:34] heh [01:48:59] <^demon> So, we will be deploying HEAD after all :p [01:55:22] <^demon> Ryan_Lane: Working now. [01:55:31] eh? [01:55:36] oh [01:55:39] <^demon> It's working now. [01:55:40] <^demon> :) [01:55:49] you deployed a newer version without a package update? [01:56:00] mind updating the package? [01:56:07] <^demon> Yes, was going to. [01:56:08] <^demon> Once I unbroke gerrit. [01:56:10] I'll build it and push it into the repo [01:56:11] heh [01:56:11] ok [01:56:24] Oh, that's right. It's February 11. [01:56:32] <^demon> I was afraid of putting in a patch for operations/debs/gerrit that we couldn't review :p [01:59:31] heh [01:59:31] right [02:01:35] New review: Demon; "Patch Set 1:" [operations/debs/gerrit] (master) - https://gerrit.wikimedia.org/r/48594 [02:01:43] <^demon> Ok, updated package. [02:04:39] <^demon> Ryan_Lane: ^ [02:05:21] New review: Ryan Lane; "Patch Set 1: Code-Review+2" [operations/debs/gerrit] (master) C: 2; - https://gerrit.wikimedia.org/r/48594 [02:05:28] New review: Ryan Lane; "Patch Set 1: Verified+2" [operations/debs/gerrit] (master); V: 2 - https://gerrit.wikimedia.org/r/48594 [02:05:29] Change merged: Ryan Lane; [operations/debs/gerrit] (master) - https://gerrit.wikimedia.org/r/48594 [02:08:08] So… this is thought to work now? Or are other fixes still in the works? [02:08:29] <^demon> The review problem should be fixed now. [02:08:44] ^demon: I thought they were going to do something about the "needs verified" error kicking you out of the review screen [02:08:49] it's so freaking annoying [02:08:59] <^demon> I don't remember that. [02:09:11] <^demon> Nothing upstream I know of ever addressing that. [02:09:15] :( [02:09:39] <^demon> Can you review https://gerrit.wikimedia.org/r/#/c/34516/ and its child now? [02:09:49] <^demon> (both a please and "does it work" :) [02:10:07] Ryan_Lane: Would you have time today or tomorrow for me to pick your brain about how to deploy a 30-megabyte node_modules directory via git-deploy (but presumably not via git itself)? [02:10:32] RoanKattouw: sure. after the gerrit stuff is done I can [02:10:36] OK [02:10:43] Yeah deal with that first obviously :) [02:11:04] <^demon> Wrapping it up now, should be mostly up. [02:11:08] New review: Ryan Lane; "Patch Set 2: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/34516 [02:11:09] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34516 [02:11:19] <^demon> (Will need to restart one last time for 2 config changes to go live) [02:11:41] merged all the way through [02:13:21] <^demon> Changes live. [02:15:12] <^demon> Heh, glad we didn't put gitblit live as default. It's having problems on prod. [02:15:24] <^demon> Will sort that tomorrow. [02:15:29] ok. you should upgrade the package [02:15:37] I pushed it into the repo [02:17:53] <^demon> Done. [02:17:53] cool [02:18:59] <^demon> Man this package sucks. [02:19:08] yep [02:19:17] I think there's a native one in debian now [02:19:23] we should look at switching to that [02:20:10] <^demon> Yeah, worth looking at. [02:20:19] gerrit is down [02:20:29] I was just coming to ask if that was intentional. [02:20:42] https://gerrit.wikimedia.org/r/#/c/48583 isn't loading for me. I get "Service Temporarily Unavailable." [02:21:01] ^demon: ^^ [02:21:19] <^demon> I had to run puppet one last time because the package does stupid things. [02:21:23] ah ok [02:24:29] <^demon> Ok, package in place, and cleaned up after it with puppet. [02:24:33] <^demon> Everything's back up. [02:25:05] New review: Andrew Bogott; "Patch Set 2: Verified+2 Code-Review+2" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/48595 [02:25:07] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48595 [02:28:54] !log LocalisationUpdate completed (1.21wmf9) at Tue Feb 12 02:28:53 UTC 2013 [02:28:58] Logged the message, Master [02:33:11] <^demon> Hmm, replication plugin isn't loading. Will debug that now. [02:33:29] <^demon> Luckily plugin deployment doesn't require gerrit restart. [02:52:43] !log LocalisationUpdate completed (1.21wmf8) at Tue Feb 12 02:52:42 UTC 2013 [02:52:45] Logged the message, Master [03:16:07] <^demon> Ryan_Lane: I'm beat. I've got a couple of loose ends to tie up in the morning, but I think we're mostly in the clear. [03:16:10] <^demon> Night. [03:16:40] ^demon|away: night! [03:50:38] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [03:57:23] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [04:15:30] New review: Asher; "Patch Set 2: Verified+2 Code-Review+2" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/48129 [04:15:32] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48129 [04:18:32] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [04:18:32] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [04:18:32] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [04:19:35] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [04:20:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:22:08] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.300 seconds [04:55:08] RECOVERY - MySQL disk space on neon is OK: DISK OK [05:26:47] RECOVERY - Puppet freshness on srv241 is OK: puppet ran at Tue Feb 12 05:26:37 UTC 2013 [05:34:55] New patchset: FastLizard4; "Allow setting of MySQL bind address in NovaInstance config" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48585 [05:52:35] PROBLEM - Puppet freshness on ms-fe1002 is CRITICAL: Puppet has not run in the last 10 hours [06:38:38] PROBLEM - Puppet freshness on mw37 is CRITICAL: Puppet has not run in the last 10 hours [06:43:37] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 187 seconds [06:44:04] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 197 seconds [06:45:25] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [06:45:43] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [07:00:13] New review: FastLizard4; "Patch Set 1: Code-Review-1" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/48585 [07:44:49] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 184 seconds [07:46:37] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [07:57:07] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [08:09:07] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 189 seconds [08:09:52] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 194 seconds [08:14:32] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 198 seconds [08:18:16] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [08:18:43] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [08:29:04] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [09:34:20] paravoid, around? [09:36:43] RECOVERY - Puppet freshness on amssq41 is OK: puppet ran at Tue Feb 12 09:36:29 UTC 2013 [09:41:04] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [09:42:34] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 198 seconds [09:42:43] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 201 seconds [09:49:37] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 1 seconds [09:49:46] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [10:24:43] RECOVERY - Puppet freshness on spence is OK: puppet ran at Tue Feb 12 10:24:32 UTC 2013 [10:33:24] RECOVERY - Solr on vanadium is OK: HTTP OK HTTP/1.1 200 OK - 6435 bytes in 0.081 seconds [11:25:01] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 202 seconds [11:25:27] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 215 seconds [11:33:01] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [11:33:28] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [12:14:40] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 190 seconds [12:14:49] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 195 seconds [12:38:58] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [12:40:28] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [13:00:57] New review: Silke Meyer; "Patch Set 3:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48474 [13:38:17] would it be hard to move branch "production" in operations/puppet to be "master", now that we don't use "test" anymore and every other repo has "master"? [13:40:42] gerrit-wm: ping [13:41:51] New review: Dzahn; "Patch Set 1: Verified+2 Code-Review+2" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/48620 [13:41:53] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48620 [13:43:55] merging a change to nginx.conf.erb logformat that was sitting on sockpuppet [13:51:25] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [13:51:51] singer, heh, for real..just removed old stuff from there [13:52:19] ah, of course:) fixing [13:58:16] New review: Dzahn; "Patch Set 1: Verified+2 Code-Review+2" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/48623 [13:58:17] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48623 [13:58:39] gerrit-wm: you don't report new patch sets anymore, but you do report the reviews and merges [14:00:07] RECOVERY - Puppet freshness on singer is OK: puppet ran at Tue Feb 12 13:59:43 UTC 2013 [14:04:22] hashar: re: changing font size in gerrit.. it is also just hitting Ctrl++ [14:04:47] <^demon> mutante: new patch set hook is messed up, on my todo list this morning. [14:04:48] C++ ? [14:05:08] hashar: Ctrl and the + key [14:05:19] oh sorry [14:05:24] ^demon: ah, gotcha. thanks [14:05:31] mutante: well it uses to be hardcoded to 8pt now it is 9pt [14:05:35] but I guess we can make it dynamic [14:05:44] that would another hack though to fix the font-family: monospace; [14:06:21] i don't care much, just expecting people to then say it is too large now.:) [14:06:44] hehe [14:06:55] I will let them figure out another patch so :-] [14:07:03] ok:) [14:07:05] for now 8pt is too small for my eyes. [14:07:23] just saying how large 8pt are is under the control of the user anyways..on their computer [14:07:30] I can detect small fonts when I am not able to read the text without my glasses [14:07:38] yeah sure :-] [14:07:44] I use it multiple time [14:07:44] <^demon> hashar: Upstreaming our skin improvements makes people happy :) [14:08:08] Ctrl + + is actually the first thing I do on Wikipedia since the font is a bit too small there [14:08:18] ^demon: if that plays well sureè [14:08:41] maybe the design team feels like doing gerrit CSS too,heh:) [14:08:56] <^demon> No, just give them to me in the forms of generic CSS to fix, and I'll upstream them. [14:09:39] btw.. video is online http://video.fosdem.org/2013/lightningtalks/How_to_hack_on_Wikipedia.webm [14:09:48] from Quim's talk at Fosdem [14:10:09] and all the others http://video.fosdem.org/2013/ [14:10:17] someone put it on commons..or i will [14:10:49] ^demon: remembers me I can't contribute to Gerrit cause of their end user agreement license [14:15:50] http://mirror.be.gbxs.net/video.fosdem.org//2013/maintracks/Janson/The_Keeper_of_Secrets.webm [14:19:28] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [14:19:29] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [14:19:29] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [14:19:48] hrmmm, is mutante not in SF? (i.e. why is he awake?) [14:19:59] no, i am in Germany [14:20:31] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [14:20:43] good :) (but i thought you moved?) [14:20:50] i did, i am just visting [14:21:07] i was in Europe for FOSDEM anyways [14:21:17] going back in time for S.F. hackathon [14:25:38] RECOVERY - Puppet freshness on professor is OK: puppet ran at Tue Feb 12 14:25:13 UTC 2013 [14:25:56] !log ran puppet on professor [14:25:56] Logged the message, Master [14:32:46] morning [14:34:09] morning! [14:40:34] mutante: I got an easy change for you :-D That adds a wikimedia package 'php-luasandbox' ensuring it is latest https://gerrit.wikimedia.org/r/48127 [14:41:02] mutante, you around? looking at 4513 - I'm pretty sure we have a reject posts from non members setting [14:41:41] hashar: you sure you want automatic updates ? [14:41:48] mutante: yup :-] [14:41:55] I can add a comment there though [14:42:07] Thehelpfulone: if you wanna reply to it, that would be nice:) [14:42:15] sure [14:42:15] oh i did [14:42:25] i just forwarded it to RT to be reminded to check for that sometime [14:43:20] New review: Dzahn; "Patch Set 2: Verified+2 Code-Review+2" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/48127 [14:43:21] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48127 [14:45:14] \O/ [14:45:36] mutante: just merge on sock puppet, I got it installed on the server already [14:45:54] hashar: already done. and watching gallium [14:46:01] hashar: unrelated "err: /Stage[main]/Misc::Docs::Puppet/Git::Clone[puppetsource]/Exec[git_pull_puppetsource]/returns: change from notrun to 0 failed: git pull --quiet returned 1 instead of one of [0]" [14:46:05] bah [14:46:21] I need to phase out that puppet doc stuff [14:46:28] should be made by Jenkins instead of via puppet [14:47:07] as long as we keep doc.wm :) [14:47:24] because i already referred people to it when being asked for puppet docs:) [14:52:05] mark: netapp tech will be here soon to replace main board on nas1001a [15:02:02] mutante: yeah doc.wikimedia.org is on the Jenkins host :-] [15:03:37] hashar: ..it reminds us that we should convert everything into modules.. then they would all show up in the upper left corner [15:03:48] I would do the module conversion [15:04:02] but then the patches would lay in Gerrit until they bit rot :-] [15:04:23] I did a few with Faidon already though [15:04:31] we even passed puppet-lint on the new manifests to have them clean [15:04:38] cool [15:05:23] like the huuuuge manifests/misc/contint.pp manifests has been split in smaller modules [15:05:26] one per software [15:05:37] example for jenkins https://gerrit.wikimedia.org/r/#/c/47664/5 [15:07:07] ah, i should do this with planet some time [15:13:29] cut/paste, rename a few class, run puppet-lint, fix, done :-] [15:21:40] mutante, resolved, although those brackets were me yesterday ;) [15:23:20] * aude wonders if someone can help us with refreshing wikibase localisation messages in wmf9? [15:23:49] no idea if only specific extensions can be refresehd without running entire rebuild localisation [15:24:12] Not really [15:24:19] oh really? [15:24:21] The files are just updated from scratch [15:24:37] It's quite a bit quicker now with Tim majorly improving the scap scripts [15:24:47] my preferences on itwiki say wbc-rc-show-wikidata-pref [15:24:50] Though.. [15:24:58] Doesn't LU only update changed messages anyway? [15:24:58] the message was renamed to wikibase-rc-show-wikidata-pref [15:25:08] no idea how it works [15:25:14] I think it does [15:25:26] I'll run it nowish [15:25:36] ok, thanks [15:25:59] it works fine on my test wiki but don't have to manually rebuild my localisation [15:26:06] heh [15:27:25] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [15:29:24] Running updates for 1.21wmf9 (on aawikibooks) [15:29:24] 0 MediaWiki messages are updated [15:30:05] * Reedy lets it continue anyway [15:30:48] hmm.... [15:30:54] aawikibooks has no wikibase [15:37:06] Thehelpfulone: re.. saw it now.. thank you:)) [15:37:10] it uses the extension-list in wmf-config [15:37:19] no problem :) [15:44:34] Ooh, ^demon getting Internal Server Error when using the reviewer autocomplete [15:44:55] <^demon> Yes. [15:45:07] <^demon> Pushed a fix upstream for it, will deploy at some point today [15:45:14] <^demon> (Some people hit it, some people don't, not all the time) [15:47:12] New patchset: Dereckson; "(bug 44032) Deploy Universal Language Selector to oldwikisource" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47732 [15:47:40] New review: Dereckson; "Patch Set 4: Code-Review+1" [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/47732 [15:53:34] PROBLEM - Puppet freshness on ms-fe1002 is CRITICAL: Puppet has not run in the last 10 hours [15:54:12] !log LocalisationUpdate completed (1.21wmf9) at Tue Feb 12 15:54:11 UTC 2013 [15:54:13] Logged the message, Master [15:58:22] RECOVERY - MySQL disk space on neon is OK: DISK OK [15:58:45] !log reedy synchronized php-1.21wmf9/includes/site/ [15:58:46] Logged the message, Master [15:58:49] PROBLEM - Apache HTTP on mw1151 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:58:55] ok, time to check preferences [16:00:17] !log mw1161 and mw1165 are both missing mediawiki deployment files [16:00:18] Logged the message, Master [16:00:30] New review: Demon; "Patch Set 1: Code-Review+1" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/48622 [16:00:38] New review: Demon; "Patch Set 1: Code-Review+1" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/48624 [16:01:17] <^demon> Someone mind merging those two changes ^ [16:01:21] please :-] [16:01:23] <^demon> Minor UI fixes, won't require a gerrit reboot. [16:01:33] !log reedy synchronized php-1.21wmf9/extensions/Wikibase [16:01:34] Logged the message, Master [16:01:40] PROBLEM - Apache HTTP on mw1149 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:01:42] good good [16:02:16] RECOVERY - Apache HTTP on mw1151 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.089 second response time [16:04:47] There's been a hike in processes, but a drop in network [16:04:47] Reedy: seems unrelated to wikibase but on itwiki i get a js error [16:04:47] Failed to load resource: the server responded with a status of 503 (Service Unavailable) http://bits.wikimedia.org/it.wikipedia.org/load.php?debug=false&lang=it&mod…port.style%7Cmw.PopUpMediaTransform&skin=vector&version=20130212T155337Z&* [16:04:47] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&s=by+name&c=Bits%2520application%2520servers%2520eqiad&tab=m&vn= [16:05:06] my enhanced changes and preferences don't do the js magic that they should [16:05:07] RECOVERY - Apache HTTP on mw1149 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.669 second response time [16:05:19] Ah, that's a bits apache [16:05:26] hmm [16:05:36] ah, nevermind! [16:05:39] * aude had disabled my js [16:05:42] duh [16:05:43] apergos: mark, paravoid, Ryan_Lane ^ Can someone have a look at the eqiad bits apaches please? [16:06:12] big spike in processes, drop in network [16:06:15] but i still see the error [16:06:15] And numerous 503 repors [16:06:28] :( [16:06:42] looking on enwiki which has no wikibase [16:06:50] Getting 503s as well [16:06:55] same thing [16:07:02] is the vector extension not updated properly [16:07:03] ? [16:07:15] or don't know what [16:09:35] ping Jeff_Green, mutante too.. [16:09:50] hey [16:10:31] PROBLEM - Apache HTTP on mw1152 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:10:40] PROBLEM - Apache HTTP on mw1149 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:10:45] ^ moar bits apaches [16:11:23] Jeff_Green: Major bump in processes on the eqiad bits apaches, but a drop in network traffic [16:11:29] Users reporting 503s [16:11:43] looking at mw1152 now [16:11:45] everything is broken, js and css wise [16:12:17] did we intentionally stop apache logging? nothing logged on mw1152since 12/23 [16:13:04] PROBLEM - Apache HTTP on mw1150 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:13:13] PROBLEM - Apache HTTP on mw1151 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:13:14] /etc/init.d/apache2: 55: [: nice: unexpected operator [16:13:56] !log restarted apache on mw1152 [16:13:58] Logged the message, Master [16:15:33] a lot of these but I don't know if they are normal [16:15:35] File does not exist: /usr/local/apache/common/docroot/bits/w [16:15:46] apergos: where? [16:15:46] RECOVERY - Apache HTTP on mw1152 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.556 second response time [16:15:54] and /usr/local/apache/common/docroot/bits/static-1.21wmf1 [16:16:04] mw1149 [16:17:43] we're logging somewhere? logs on 1152 are empty [16:17:44] I see them from earlier on so presumable [16:17:50] they aren't a symptom [16:18:06] /var/log/apache2.log [16:18:19] RECOVERY - Apache HTTP on mw1150 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.713 second response time [16:18:32] !log LocalisationUpdate completed (1.21wmf8) at Tue Feb 12 16:18:31 UTC 2013 [16:18:33] Logged the message, Master [16:18:45] ah. /me wonders why we moved it out of the /var/log/apache2/ [16:18:51] no idea [16:19:44] n real idea what's wrong over there either [16:20:30] i'm not sure what things normally look like but try an strace on an apache proc [16:20:42] writev(9, [{"HTTP/1.1 200 OK\r\nDate: Tue, 12 F"..., 449}, {"\37\213\10\0\0\0\0\0\0\3}\220\313j\3030\20E\367\375\nc\f\226\2U\36&\261k\323U"..., 301}], 2) = 750 [16:20:42] read(9, 0x7f54a6b81048, 8000) = -1 EAGAIN (Resource temporarily unavailable) [16:20:43] times({tms_utime=1047, tms_stime=147, tms_cutime=0, tms_cstime=0}) = 2969318869 [16:21:19] RECOVERY - Apache HTTP on mw1149 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.580 second response time [16:21:33] and it's back, did you do something? [16:21:38] no [16:21:49] only the apache restart on 1152 several minutes ago [16:21:54] meh [16:22:04] RECOVERY - Apache HTTP on mw1151 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.160 second response time [16:26:52] PROBLEM - Apache HTTP on mw1152 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:27:01] PROBLEM - Apache HTTP on mw1149 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:27:13] New patchset: Demon; "Further hook tweaks for gerrit" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48636 [16:27:42] hmm mutante Mailman is having some real issues saving config changes in chrome, the page takes forever to save and sometimes times out - if there anything you can check in logs? [16:27:46] PROBLEM - Apache HTTP on mw1151 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:28:31] RECOVERY - Apache HTTP on mw1152 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.142 second response time [16:28:41] RECOVERY - Apache HTTP on mw1149 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.620 second response time [16:29:25] PROBLEM - Apache HTTP on mw1150 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:29:25] RECOVERY - Apache HTTP on mw1151 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.409 second response time [16:32:36] ok here's a dumb q [16:33:13] why are squids 67 through 70 talking to mw1149? [16:34:20] PROBLEM - Apache HTTP on mw1152 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:35:05] RECOVERY - Apache HTTP on mw1150 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.908 second response time [16:36:22] Thehelpfulone: you've checked ganglia? [16:37:02] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [16:37:17] jeremyb_, nope, which source am I looking for? [16:37:38] PROBLEM - Apache HTTP on mw1149 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:38:37] try mchenry [16:38:38] i think [16:39:17] RECOVERY - Apache HTTP on mw1149 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.772 second response time [16:39:18] oh mailman's on sodium [16:39:35] PROBLEM - Puppet freshness on mw37 is CRITICAL: Puppet has not run in the last 10 hours [16:39:54] ganglia's sloowww [16:39:58] i can't be too much right now... 30-40 sec ping times :( [16:39:58] on GSM edge [16:39:58] in a basement [16:42:26] PROBLEM - Apache HTTP on mw1150 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:42:27] heh [16:44:08] does ganglia work for you aude? [16:44:14] RECOVERY - Apache HTTP on mw1150 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.787 second response time [16:44:16] it's taking forever to load for me [16:44:17] http://ganglia.wikimedia.org/latest/?c=Miscellaneous%20eqiad&h=sodium.wikimedia.org&m=cpu_report&r=hour&s=descending&hc=4&mc=2 [16:44:22] uh, link? [16:44:28] got it [16:44:33] you could look at ganglia about ganglia [16:44:34] Thehelpfulone: Loads fine for me. [16:44:48] with all the graphs James_F? [16:44:52] looks good [16:44:56] well i can see it [16:44:57] :-P [16:44:59] Thehelpfulone: Yes. [16:45:20] :( [16:45:43] can you see anything the could explain a Error 324 (net::ERR_EMPTY_RESPONSE): The server closed the connection without sending any data. in chrome? [16:46:52] * aude views everything via proxy in the US, although think it doesn't matter for ganglia [16:48:35] RECOVERY - Apache HTTP on mw1152 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.460 second response time [16:48:41] Thehelpfulone: maybe your problem with ganglia is the same as your problem with mailman? [16:49:46] hmm, ganglia's decided to work now [16:49:55] the maillman change works fine in IE [16:49:58] but not in Chrome [16:52:31] mailman* [16:54:08] PROBLEM - Apache HTTP on mw1152 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:55:47] RECOVERY - Apache HTTP on mw1152 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.013 second response time [16:57:23] New patchset: FastLizard4; "Allow setting of MySQL bind address in NovaInstance config" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48585 [16:58:21] PROBLEM - Apache HTTP on mw1151 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:58:38] RECOVERY - MySQL disk space on neon is OK: DISK OK [16:59:59] RECOVERY - Apache HTTP on mw1151 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.596 second response time [17:10:56] PROBLEM - Apache HTTP on mw1151 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:17:50] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 211 seconds [17:18:08] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 218 seconds [17:22:02] PROBLEM - Apache HTTP on mw1150 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:22:38] PROBLEM - Apache HTTP on mw1149 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:22:47] PROBLEM - Apache HTTP on mw1152 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:24:17] RECOVERY - Apache HTTP on mw1149 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.773 second response time [17:24:26] RECOVERY - Apache HTTP on mw1152 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.446 second response time [17:28:47] RECOVERY - Apache HTTP on mw1151 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.395 second response time [17:28:52] things coming back? [17:28:54] * aude hopes [17:28:58] they flap [17:29:00] we'll see [17:29:03] New patchset: Mark Bergsma; "Temp fix duplicate backend test_wikipedia" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48642 [17:29:59] PROBLEM - Apache HTTP on mw1152 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:30:07] !log catrope synchronized php-1.21wmf9/extensions/Wikibase 'Rolling back Wikibase' [17:30:09] Logged the message, Master [17:30:44] RECOVERY - Apache HTTP on mw1150 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.940 second response time [17:30:48] New review: Faidon; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/48642 [17:30:49] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48642 [17:34:20] PROBLEM - Apache HTTP on mw1151 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:35:14] PROBLEM - Apache HTTP on mw1149 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:35:23] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [17:35:59] RECOVERY - Apache HTTP on mw1151 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.405 second response time [17:36:44] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [17:36:46] New patchset: Mark Bergsma; "Temp block mobile.startup requests (feel free to revert later)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48645 [17:36:53] RECOVERY - Apache HTTP on mw1152 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.731 second response time [17:38:50] PROBLEM - Varnish HTTP bits on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:40:06] New review: Faidon; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/48645 [17:40:07] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48645 [17:41:41] PROBLEM - Apache HTTP on mw1150 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:41:51] New patchset: Demon; "Further hook tweaks for gerrit" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48636 [17:42:27] PROBLEM - Apache HTTP on mw1152 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:42:35] things seem better on itwiki now [17:45:17] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 181 seconds [17:45:35] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 185 seconds [17:45:35] PROBLEM - SSH on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:56] RECOVERY - Apache HTTP on mw1150 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.514 second response time [17:49:30] RECOVERY - Apache HTTP on mw1152 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.107 second response time [17:49:30] RECOVERY - Apache HTTP on mw1149 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.119 second response time [17:50:40] !log Forced mw1149-1150 to healthy in bits.eqiad varnish [17:50:41] Logged the message, Master [17:50:58] !log Blocking modules=mobile.startup requests with 503 on bits varnish (earlier) [17:51:00] Logged the message, Master [17:52:01] hey yaalllsss [17:52:18] can someone explain the puppet $realm variable to me? I understand what it is for, but not how it gets set [17:52:30] paravoid maybe knows? [17:52:43] it's not exactly a good time right now [17:52:46] hokay [17:52:47] np [17:52:53] we're in the middle of an outage :) [17:53:03] ottomata: i can help you [17:53:06] hahah TYPICAL! [17:53:06] heheh [17:53:11] yessuh! [17:53:34] !log powercycling niobium, SSH & console unresponsive [17:53:36] Logged the message, Master [17:53:37] yeah so, Jeff_Green, I'm trying to puppetize some things real nice [17:53:51] yup. i actually have to review where realm is set [17:53:56] i've got a kraken role class, and right now i'm testing it out on my local vm [17:54:01] but assumedly I could test it out in labs as well [17:54:09] right [17:54:10] I've seen case statements where configs are set conditioanlly based on $realm [17:54:15] yes [17:54:22] and labs has some interesting exceptions [17:54:32] for now, i'd really just like the realm not to be production on my local vm [17:54:40] i can use a default case for non-production [17:54:51] ottomata, see realm.pp [17:54:57] case $::realm { [17:54:58] "production": { [17:54:58] … } [17:54:58] default { [17:54:58] … everything set to localhost [17:54:58] } [17:55:01] yeah I saw that [17:55:11] but it sets $realm globally to 'production' if it isn't already set [17:55:37] and that is the only place in the manifests I could find that $realm was initiallized [17:55:38] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [17:55:49] so i'm guessing the labs instances have something special to include before realm.pp gets included [17:55:53] right, I think you should be able to set $realm in site.pp [17:55:56] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [17:55:57] so that $realm is already set to 'labs' by the time it gets there [17:56:15] RECOVERY - SSH on niobium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [17:56:27] Jeff_Green, like in a node, or globally? [17:56:36] node, but I may be wrong [17:56:41] RECOVERY - Varnish HTTP bits on niobium is OK: HTTP OK HTTP/1.1 200 OK - 633 bytes in 0.053 seconds [17:56:56] sec, spelunking [17:57:19] in a node, you'd have to fq it with $:: [17:57:20] but [17:57:24] $::realm = 'blabla' [17:57:37] Could not parse for environment production: Cannot assign to variables in other namespaces [17:57:39] in a node [17:57:40] why does gerrit randomly keep failing [17:57:41] and doing it globally [17:58:06] Cannot reassign variable realm [17:58:26] just says "working..." [17:58:38] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [18:03:27] ottomata: I don't see where realm is being set either [18:03:51] aside from that one place in realm.pp [18:03:55] right? [18:03:58] afaik it's either labs|production, so we'd be looking for where it's overridden to labs [18:04:06] yes agreed [18:04:51] hm [18:06:20] does labs use the same site.pp? I can't remember [18:06:35] !log Reverted Wikibase update earlier, now reapplying [18:06:36] Logged the message, Mr. Obvious [18:07:04] hmm, you know I doubt it, since there aren't any labs entries in there [18:08:00] on labs, you can apply puppet classes in instance settings [18:08:24] what's that mean, MaxSem? [18:08:34] $::projectgroup [18:08:34] ? [18:08:38] alternatively, use puppetmaster::self and edit site.pp to your liking [18:09:14] !log catrope synchronized php-1.21wmf9/extensions/Wikibase 'Un-rollback Wikibase' [18:09:15] Logged the message, Master [18:09:43] ottomata, e.g. https://labsconsole.wikimedia.org/w/index.php?title=Special:NovaInstance&action=configure&instanceid=3fa564df-2c63-4dcd-b9f9-9c677d7b99b4&project=mobile®ion=pmtpa [18:09:49] ohohoh, yes ok [18:09:55] hm [18:09:56] k [18:10:50] http://i50.tinypic.com/34obxw0.png [18:10:56] !log aaron synchronized php-1.21wmf9/extensions/UploadWizard 'deployed 4548f9817d3d8025f4c19e279e9fb1f05b178087' [18:10:57] Logged the message, Master [18:12:17] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [18:14:03] New patchset: ArielGlenn; "torblock script needs --force or it fails after cache expires" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48647 [18:14:58] ah the new interface [18:15:13] New review: ArielGlenn; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/48647 [18:15:24] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48647 [18:18:08] PROBLEM - Host mw1085 is DOWN: PING CRITICAL - Packet loss = 100% [18:19:20] RECOVERY - Host mw1085 is UP: PING OK - Packet loss = 0%, RTA = 26.51 ms [18:20:50] !log nas1001-a swapping mainboard [18:20:51] Logged the message, Master [18:25:21] !log reedy synchronized php-1.21wmf9/extensions/Wikibase [18:25:22] Logged the message, Master [18:30:35] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [18:34:17] New patchset: Faidon; "Revert "Temp block mobile.startup requests"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48652 [18:34:39] New review: Faidon; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/48652 [18:34:59] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48652 [18:36:51] !log reedy synchronized php-1.21wmf9/extensions/Wikibase [18:36:52] Logged the message, Master [18:41:48] !log reedy synchronized php-1.21wmf9/extensions/Wikibase [18:41:49] Logged the message, Master [18:42:51] RECOVERY - MySQL disk space on neon is OK: DISK OK [18:44:01] New patchset: Demon; "Update gerrit for latest build" [operations/debs/gerrit] (master) - https://gerrit.wikimedia.org/r/48655 [18:44:39] New review: Demon; "Patch Set 1:" [operations/debs/gerrit] (master) - https://gerrit.wikimedia.org/r/48655 [18:47:28] New review: Siebrand; "Patch Set 1:" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/48444 [18:49:34] !log reedy synchronized php-1.21wmf9/extensions/Wikibase [18:49:36] Logged the message, Master [18:56:01] New review: Ottomata; "Patch Set 6:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48041 [18:56:02] New review: Andrew Bogott; "Patch Set 3:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48474 [19:00:16] PROBLEM - Apache HTTP on mw1152 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:00:24] notpeter: Since you are the on-duty monkey this week, could you take a look at https://gerrit.wikimedia.org/r/#/c/47103/ ? [19:00:27] FYI, deploying WikibaseClient to enwiki in a minute or 2 [19:00:51] PROBLEM - Apache HTTP on mw1150 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:00:52] New review: Silke Meyer; "Patch Set 3:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48474 [19:01:00] heads up, the mobile team is working on reducing the load.php fragmentation - will go live today [19:01:20] !log reedy synchronized wmf-config/InitialiseSettings.php 'Enable WikibaseClient on enwiki' [19:01:21] Ahm, guys [19:01:21] Logged the message, Master [19:01:25] yaya [19:01:27] It looks like the bits apaches are falling over again [19:01:28] PROBLEM - Apache HTTP on mw1149 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:02:01] Although their load avg is still pretty low [19:02:12] PROBLEM - Apache HTTP on mw1151 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:02:30] RECOVERY - Apache HTTP on mw1150 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.732 second response time [19:03:06] RECOVERY - Apache HTTP on mw1149 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.643 second response time [19:03:38] Proc count is going up but load is staying below 10 [19:03:40] ugh [19:03:45] For some reason Ganglia believes 1149 is down [19:03:59] Hmm 1149's Ganglia collection is flapping, it's up though [19:04:16] And now it's 1151 [19:04:17] A fix for that AFTv5 fatal really needs deploying [19:04:22] Oh Ganglia, why do you suck so much [19:07:27] New review: Andrew Bogott; "Patch Set 3: Code-Review+1" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/48585 [19:08:19] bits is having issues on en.wikipedia.org. [19:08:26] gj Susan [19:09:34] Request: POST http://en.wikipedia.org/w/index.php?title=New_York_City&action=submit, from 69.164.222.250 via cp1006.eqiad.wmnet (squid/2.7.STABLE9) to 10.64.0.138 (10.64.0.138) [19:09:38] Error: ERR_READ_TIMEOUT, errno [No Error] at Tue, 12 Feb 2013 19:08:53 GMT [19:09:41] can't save a page [19:10:01] OK WTF [19:10:07] Reedy: Can we revert Wikibase maybe? [19:10:11] All sorts of things are exploding [19:10:12] Revert what? [19:10:15] on enwiki? [19:10:17] Ya [19:10:20] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&s=by+name&c=Bits+application+servers+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 for instance [19:10:20] yeah [19:10:25] And aude's report above [19:10:51] i can't imagine why, when it works on itwiki, etc. [19:11:00] !log reedy synchronized wmf-config/InitialiseSettings.php 'Disable WikibaseClient on enwiki' [19:11:01] Logged the message, Master [19:11:04] but we certainly don't want there to be issues [19:11:15] We reverted Reedy's Wikibase update earlier [19:11:21] Because bits started going crazy right around the time he deployed it [19:11:28] aude, on smaller wikis the decrease in performance can be less noticeable [19:11:32] But we also blocked some mobile requests, so we're not sure which of those fixed it [19:11:33] yeah [19:12:01] RoanKattouw, mobile requests were like that for some time so it's not us [19:12:11] Yeah [19:12:42] RECOVERY - Apache HTTP on mw1152 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.386 second response time [19:12:53] we'll deploy a fix that will reduce the load though [19:13:28] PROBLEM - Apache HTTP on mw1150 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:13:34] http://it.wikipedia.org/wiki/New_York is good, saved quick and has some of the linsk from wikidata [19:15:27] it's still painfully slow to load pages [19:15:35] * aude waiting to load http://en.wikipedia.org/wiki/New_York_City [19:16:03] ok, there it is [19:16:51] * aude attempting save again [19:16:54] it's never been fast, though to save a big page like that one [19:17:40] Request: POST http://en.wikipedia.org/w/index.php?title=New_York_City&action=submit, from 69.164.222.250 via cp1007.eqiad.wmnet (squid/2.7.STABLE9) to 10.64.0.138 (10.64.0.138) [19:17:44] Error: ERR_READ_TIMEOUT, errno [No Error] at Tue, 12 Feb 2013 19:17:27 GMT [19:17:47] again [19:18:24] not good [19:19:27] PROBLEM - Apache HTTP on mw1149 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:20:03] RECOVERY - Apache HTTP on mw1151 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.901 second response time [19:20:41] * aude able to save http://en.wikipedia.org/wiki/Computable_topology [19:20:45] smaller page [19:22:54] RECOVERY - Apache HTTP on mw1149 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.074 second response time [19:23:39] PROBLEM - Apache HTTP on mw1152 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:25:28] Reedy: aude: no more attempts at enwiki today. Let's make sure we're over the current outage, and then do our best to understand what the problem was before trying again [19:25:35] sure [19:25:46] anyway, people will blame us, even if it's entirely unrelated to wikidata [19:26:05] let's make sure things are settled and ok [19:26:13] I'll blame Domas. [19:27:06] RECOVERY - Apache HTTP on mw1152 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.220 second response time [19:30:29] Is someone looking at bits? [19:32:17] * aude can't load old revision of [[en:New York City]] [19:32:23] or it's taking forever........ [19:32:55] mark: still around? [19:33:00] who was looking at bits earlier? [19:33:53] Most of ops who were around [19:33:58] i saw that mark blocked modules=mobile.startup bits requests earlier [19:34:06] looking at one of the bits apaches [19:34:16] 75% of all requests hitting it are for mobile.startup [19:35:10] * aude trying old revision again [19:35:27] it worked but quite slow [19:36:27] look at 327 modules=mobile.startup..&version=$VERSION requests hitting a single bits apache in 20 sec, there are 158 different $VERSION numbers [19:36:51] so, that's pretty broken [19:36:55] anyone from mobile around? [19:37:05] binasher, we'll deploy a fix today [19:37:19] MaxSem: today as in within the next 10 minutes [19:37:20] ? [19:37:48] no, we were targetting at today's deployment window [19:38:06] RECOVERY - Apache HTTP on mw1150 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.110 second response time [19:38:15] PROBLEM - Apache HTTP on mw1152 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:38:16] we are planning to remove mobile.device.* from those URLs, that should reduce the fragmentation [19:38:27] we could potentially hotfix but we need to do some testing first [19:38:28] ok but this is impacting site availability as the nagios messages right now show [19:38:49] any quick alternatives to restoring mark's prior block? [19:38:51] PROBLEM - Apache HTTP on mw1151 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:39:07] and why are there so many different version numbers? [19:39:37] binasher, scap copying files at different times to different servers? [19:39:54] RECOVERY - Apache HTTP on mw1152 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.151 second response time [19:40:12] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 183 seconds [19:41:33] PROBLEM - Apache HTTP on mw1149 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:41:51] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 243 seconds [19:42:18] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [19:43:39] PROBLEM - Apache HTTP on mw1150 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:44:39] binasher we're going to try and push two changesets which should help [19:45:00] RECOVERY - Apache HTTP on mw1149 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.423 second response time [19:45:06] ... did somebody break bits? [19:45:33] Coren: it's been broken the past hours, on and off [19:45:42] ops is working on it [19:45:55] RECOVERY - Apache HTTP on mw1151 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.792 second response time [19:47:06] PROBLEM - Apache HTTP on mw1152 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:47:12] Hi Coren. [19:47:22] * Coren waves. [19:50:33] PROBLEM - Apache HTTP on mw1149 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:51:27] PROBLEM - Apache HTTP on mw1151 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:52:07] MaxSem: please deploy the fix ASAP [19:52:37] mobile is breaking bits for everyone and I don't think you should wait an hour for that [19:52:52] if you do decide to wait, I'll have to block mobile for now to restore the rest of the site [19:53:59] holy fuck, gerrit doesn't merge [19:54:04] RoanKattouw_away: sure, I'll bring it up to ops more widely [19:54:45] New patchset: Pyoungmeister; "allowing all of our external IPs to hit home rsyncd" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48667 [19:55:18] WFM [19:55:48] RECOVERY - Apache HTTP on mw1149 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.413 second response time [19:57:30] MaxSem: please ack :) [19:57:42] paravoid: he's pushing code now [19:57:45] RECOVERY - Apache HTTP on mw1152 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.331 second response time [19:57:47] !log maxsem synchronized php-1.21wmf9/extensions/MobileFrontend [19:57:49] Logged the message, Master [19:57:51] ok, thanks [19:57:55] ^_^ [19:58:51] me starts fatalmonitor and grabs some popcorn [19:59:05] !log maxsem synchronized php-1.21wmf8/extensions/MobileFrontend [19:59:06] Logged the message, Master [19:59:11] done [19:59:24] paravoid, I see more hosts failing than usually [20:00:01] notpeter, another thing for RT once all this bits stuff is over, can you remove the "access request" tag now that it's been deprecated in favour of the access-requests queue? https://rt.wikimedia.org/Ticket/Display.html?id=4514 was created today with the tag (and I did the same thing when I created my access request ticket) [20:00:05] mw27, srv266, srv278, mw1041, mw1161, mw1165 [20:00:29] anybody who needs the hangout link for the analytics security review meeting (it's also in the invite): https://plus.google.com/hangouts/_/20382b2df1da85847705f77c1b5dc90101d8f3f3 [20:01:37] binasher, paravoid: MF was updated, any changes in caches? [20:02:34] etherpad: http://etherpad.wmflabs.org/pad/p/KrakenSecurityReview [20:02:46] Thehelpfulone: where do tags even exist in a ticket? [20:02:51] I've never used this feature, tbh [20:03:02] MaxSem: doesn't look it [20:03:09] PROBLEM - Apache HTTP on mw1149 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:03:10] does mobile varnish need flushing? [20:03:23] when you create a ticket https://rt.wikimedia.org/Ticket/Create.html?Queue=15 [20:03:29] it shows up there as something to select [20:03:48] oh! [20:03:48] and shows up under "custom fields" under Bugzilla ticket on the actual ticket [20:04:02] you mean remove the tag all together [20:04:03] sure [20:04:03] binasher, probably - we've switched to full RL, but it needs another flush to require less flushes in the future:] [20:04:08] yeah please [20:04:26] although notpeter before you do, if there's any tagged as access request that are not in the access request queue [20:04:29] it might be an idea to move them [20:04:53] I literally have never been able to to figure out how to search by tags [20:05:00] or use them in any meaningful way [20:05:04] so I'm not too worried [20:05:06] PROBLEM - Apache HTTP on mw1152 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:05:12] !log flushed mobile varnish cache :( [20:05:13] Logged the message, Master [20:05:15] RECOVERY - Apache HTTP on mw1150 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.956 second response time [20:05:58] New review: Reedy; "Patch Set 1: Verified+2 Code-Review+2" [operations/mediawiki-config] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/48668 [20:05:58] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/48668 [20:06:10] MaxSem: bits apaches are still getting inundated with mostly mobile.startup requests [20:06:39] !log reedy synchronized wmf-config/CommonSettings.php [20:06:40] Logged the message, Master [20:06:54] binasher, now they should be less fragmented as they have one parameter less [20:07:19] is bits still 503'ing mobile.startup requests/ [20:09:21] RECOVERY - Apache HTTP on mw1149 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.737 second response time [20:10:16] RECOVERY - Apache HTTP on mw1151 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.196 second response time [20:10:33] that looks promising [20:11:00] I'm preparing another fix [20:11:10] MaxSem: FIX ALL TEH WIKIS [20:11:19] Thehelpfulone: DISABLING A CUSTOM FIELD: You can't delete a custom field, only disable it. [20:11:27] from http://requesttracker.wikia.com/wiki/ManualAdministration#DISABLING_A_CUSTOM_FIELD [20:11:34] sooooo, it would seem that I can't delete that tag ;) [20:11:45] i showed jro some of the requests still hitting bits apache, and apparently they appear to be the old style reqs still [20:11:50] hmm, one sec just checking something [20:11:52] despite the mobile varnish flush [20:14:27] PROBLEM - Host mw31 is DOWN: PING CRITICAL - Packet loss = 100% [20:14:54] RECOVERY - Apache HTTP on mw1152 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.118 second response time [20:16:15] RECOVERY - Host mw31 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [20:19:50] !log maxsem synchronized php-1.21wmf9/extensions/MobileFrontend [20:19:51] New review: Pyoungmeister; "Patch Set 1:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47103 [20:19:51] Logged the message, Master [20:20:36] PROBLEM - Apache HTTP on mw31 is CRITICAL: Connection refused [20:21:38] !log maxsem synchronized php-1.21wmf8/extensions/MobileFrontend [20:21:40] Logged the message, Master [20:22:03] ok, looks better with the wmf8 deploy [20:22:08] going to flush mobile varnish again [20:22:11] New review: Pyoungmeister; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/48622 [20:22:19] !log flushed mobile varnish [20:22:19] Logged the message, Master [20:22:25] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48622 [20:22:28] [20:22:36] phewww [20:22:43] New review: Pyoungmeister; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/48624 [20:22:51] MaxSem: ok, things are looking a lot better [20:22:55] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48624 [20:23:21] notpeter, so 32 tickets in ops-requests use any tag, I'm guessing most of them are access request [20:23:46] Thehelpfulone: you get RT access? [20:23:49] New review: Ryan Lane; "Patch Set 2: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/48636 [20:24:00] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48636 [20:24:12] hashar, yeah I've had it for about a month now [20:24:14] ugh, another update to the package? [20:24:18] MaxSem: i'm seeing a 99.7% hitrate on bits for reqs with an m domain referrer now [20:24:22] Thehelpfulone: I guess that is going to be helpful :-] [20:24:22] err [20:24:31] indeed ;) [20:24:53] Thehelpfulone: ok. I looked around and couldn't find a way to remove that specific tag [20:24:56] and the docs [20:24:58] suck [20:25:00] !log reedy synchronized php-1.21wmf9/resources/mediawiki [20:25:01] I'm going to make a tic [20:25:02] Logged the message, Master [20:25:11] and assign to daniel in the hopes that he knows [20:25:18] heh [20:25:28] binasher, yeah - some of these were scheduled for deployment today, some were hacked right now [20:25:28] notpeter, on that link you gave me, does To disable queue-specific custom fields: [20:25:28] not work? [20:25:45] MaxSem: thanks for the right now hacking [20:25:45] New review: Ryan Lane; "Patch Set 1: Verified+2 Code-Review+2" [operations/debs/gerrit] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/48655 [20:25:45] Change merged: Ryan Lane; [operations/debs/gerrit] (master) - https://gerrit.wikimedia.org/r/48655 [20:26:01] nope. that would let you remove the "tags" field completely from a queue [20:26:16] and when I try to edit the tags object [20:26:16] it doesn't show me specific fields [20:26:19] er, options [20:27:24] notpeter, not sure if you've seen my messages before, but I've written a solution for Solr monitoring [20:28:28] notpeter, I think it might be a global custom field instead of a queue specific one (yeah it's weird) Configuration > Global > CustomFields maybe? [20:29:31] Thehelpfulone: still don't see what I'm searching for there :( [20:29:39] MaxSem: I saw the patchset, I'll take a look now [20:30:01] yeah RT's odd, I'll leave what I've figured out on the ticket you just created [20:30:11] Thehelpfulone: cool! thanks [20:32:53] RECOVERY - Apache HTTP on mw31 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.066 second response time [20:35:30] New review: Pyoungmeister; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/48667 [20:35:45] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48667 [20:36:47] sure no problem [20:37:23] New review: Pyoungmeister; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/48732 [20:37:37] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48732 [20:55:57] binasher: we're trying to debug why login is not working in MF stable - i think we're getting token mismatches, perhaps because the session cookie is not getting passed back to the apaches [20:56:32] New patchset: Hashar; "Add a --verbose parameter to mw-update-l10n" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46907 [20:56:33] binasher: presumably due to: [20:56:34] set req.http.X-Orig-Cookie = req.http.Cookie; [20:56:34] if(req.http.Cookie ~ "disable" || req.http.Cookie ~ "optin") { [20:56:34] /* Do nothing; these are the cookies we pass. [20:56:34] * this is a hack in the absence of X-V-O support [20:56:35] */ [20:56:36] } else { [20:56:37] unset req.http.Cookie; [20:56:38] } [20:56:50] in mobile-frontend.inc-vcl.erb [20:57:04] New review: Hashar; "Patch Set 3: Code-Review+1" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/46907 [20:57:07] do we need to add || req.http.Cookie ~ "session" to that? will that cause problems/ [20:57:09] awjr: is the session cookie being set for all users? or only for users who actually login? [20:57:29] binasher: well the session cookie gets set when you go to a form for csrf protection [20:57:35] so that would mean... [20:57:43] that would mean that the cookie gets set if a user hits the settings page [20:57:47] login form [20:57:50] or account creation form [20:58:05] (even if the user does not log in) [20:59:22] adding session to that would result in everyone with that cookie bypassing caching for all requests [20:59:32] :| [20:59:35] can login use a different cookie? [21:00:04] binasher: that might require us to deviate from using core login functionality [21:00:41] can we only bypass the cache for the session cookie for specific URLs? [21:01:02] such as? [21:01:11] Special:UserLogin [21:01:14] awjr, Special:Userlogin is localiseable [21:01:20] oh right [21:01:34] so users would be logged in on the login page but effectively not everywhere else? [21:02:26] binasher: a user will get a session cookie even if they're not logged in if they go to a form that has csrf protection [21:03:20] as you said above [21:03:31] but if they go to these forms, they will likely enable something that bypasses cache [21:04:03] New patchset: Parent5446; "(bug 39380) Enabling secure login (HTTPS)." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21322 [21:04:16] i guess we didn't run into this earlier because the settings page doesn't actually do a token check :| [21:06:12] MaxSem: we could potentially have the login/account creation forms set temporary cookies that would cause the user to bypass caching [21:06:21] awjr, I see testwiki_session=eb9ffef9faab58a796a55e5587183f2f after going to Special:MobileOptions [21:06:35] MaxSem: yes it's there, but it's not verified after form submission [21:06:42] ok, so on the regular site, foowiki_session bypasses caching [21:07:01] binasher: even if the user is not logged in? [21:07:25] yes but i don't think it gets set when users aren't logged in [21:07:39] the wikiLoggedOut cookie bypasses caching as well [21:07:40] binasher: it should when a user hits a form (eg login) [21:08:17] does just going to the mobile sidebar set the session cookie? [21:08:23] no [21:08:27] ok [21:08:52] maybe just allowing session thru is ok [21:09:02] yeah, enwiki_session will get set for non-logged in users as soon as they hit the login page [21:09:09] (on desktop site) [21:09:30] just keep in mind that setting it will bypass caching, so avoid setting it unless necessary [21:09:34] and… esi please :) [21:10:28] i think that will let us fix a lot of this cache bypassing stupidity, at least for the bulk of most pages [21:10:35] agreed [21:10:57] that has been totally on the back burner but maybe we can get that more prioritized [21:11:02] New patchset: Hashar; "gerrit: make submenus to be above logo" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48737 [21:11:42] ^demon: https://gerrit.wikimedia.org/r/48737 should fix the unclickable submenus due to the logo overriding them. [21:12:11] awjr: it's something that would make a big impact and pave the way for some great improvements to the non-mobile site as well [21:12:17] aye [21:12:40] binasher: should this do the trick? [21:12:40] - if(req.http.Cookie ~ "disable" || req.http.Cookie ~ "optin") { [21:12:40] + if(req.http.Cookie ~ "disable" || req.http.Cookie ~ "optin" || req.http.Cookie ~ "session") { [21:12:54] yup [21:13:43] New review: Pyoungmeister; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/48737 [21:13:51] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48737 [21:13:59] <^demon> hashar: lgtm. [21:14:04] New patchset: awjrichards; "Send *_session cookies to backend" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48738 [21:14:16] binasher: ^ [21:14:23] ^demon: I left a message to notpeter so he can merge it :-] [21:14:51] awjr, lol - I had to hit Ctrl+C when I saw your commit [21:15:46] New review: Asher; "Patch Set 1: Verified+2 Code-Review+2" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/48738 [21:15:54] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48738 [21:17:12] awjr: merged thru to the puppet master, will be out in 30 min or so [21:17:19] thanks binasher [21:18:21] robh: mw1201-1220 needed to be added to site.pp if you want to add w/your current cfg. already provisioned [21:23:16] awjr, is session cookie sufficient to keep users logged in? [21:23:24] @_@ [21:23:49] should be MaxSem [21:23:56] just like 'optin' is enough in beta [21:24:37] paravoid: netapp main board swap is finished [21:29:28] So, question. There are some wikibugs patches that haven't been reviewed because there's no way to deploy them, or at least, that's what I've been told. [21:29:46] Is there something pending that an ops member can review, so I can submit changes for wikibugs again? [21:30:25] <^demon> marktraceur: We've been talking about getting rid of wikibugs entirely since wm-bot also does BZ notifs. [21:30:39] Ah. [21:30:56] ^demon: In that case, when will that happen and where can I submit a patch for wm-bot? [21:31:24] <^demon> Haven't decided that much yet. Talk to Petr about contributing, he maintains wm-bot on labs :) [21:31:49] Hrm. [21:32:15] Seems like the sort of thing we should have in Gerrit, if it's going to be a big part of our botting experience on IRC. [21:33:04] <^demon> It might be in a repo somewhere. [21:33:11] <^demon> I just can't remember all the repos. [21:33:19] <^demon> There's...too many of them [21:34:22] Heh. [21:37:51] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [21:39:42] ah [21:39:47] I should commit my lame wikibugs rewrite [21:39:51] code named fayot [21:40:47] marktraceur: also petan wrote another bot that consume RSS feeds [21:41:14] <^demon> mw-bot already consumes bz rss feeds, which is what I'd rather replace wikibugs with. [21:41:17] <^demon> Less bots is a good thing :) [21:41:21] hi. Are there issues on Meta right now? [21:41:35] ^demon: where is da mw-bot code ? [21:42:19] ah wm-bot maybe [21:42:24] that is the one Petan wrote yeah [21:42:43] <^demon> abartov: What sort of issues? Was just able to make an edit, everything looks fine. [21:44:33] bed time at last [21:44:36] see you tomorrow [21:44:42] RoanKattouw: lots of '"10.64.0.189:11211": SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY' [21:44:46] mc errors [21:44:52] Right [21:45:00] So definitely a memcached failure then [21:45:02] dewiki:gadgets-definition:6 [21:45:29] frwiki:resourceloader:filter:minify-css:7:f7e5882711b035029b30d43fb8bcdf98 [21:45:31] lots of keys [21:45:48] Could you try and correlate their times with Wikibase deployment times? Specifically 19:01-19:11 and 15:50-17:30 today [21:45:58] Yeah I'd expect a lot of resourceloader:filter stuff [21:49:39] 174 2013-02-12 15:57 [21:49:41] 1460 2013-02-12 16:53 [21:49:46] 7602 2013-02-12 17:02 [21:49:49] some spikes [21:50:18] RoanKattouw_away: wtf happened at 17:02? [21:50:47] the rest of the minutes have 1-3 errors mostly [21:51:43] New review: MaxSem; "Patch Set 4: Verified+2 Code-Review+2" [operations/mediawiki-config] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/48060 [21:51:45] Change merged: MaxSem; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/48060 [21:57:32] mlitn: lots of "Fatal error: Cannot use object of type Status as array at /usr/local/apache/common-local/php-1.21wmf9/extensions/ArticleFeedbackv5/api/ApiArticleFeedbackv5.php on line 363" [21:58:02] memcached would make sense as an issue [21:59:36] Solarium_Client_HttpException at /usr/local/apache/common-local/php-1.21wmf9/extensions/Solarium/lib/Result.php, line 98: Solr HTTP error: Java heap space java.lang.OutOfMemoryError: Java heap space (500) [21:59:38] * AaronSchulz chuckles [22:00:42] <^demon> AaronSchulz: That's already reported, I think they've got a fix in master for it. [22:01:03] PROBLEM - MySQL Idle Transactions on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:01:03] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:02:36] RECOVERY - MySQL Idle Transactions on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [22:02:36] RECOVERY - MySQL Replication Heartbeat on db1047 is OK: OK replication delay 597.499160 seconds [22:04:31] Exception from line 352 of /usr/local/apache/common-local/php-1.21wmf9/includes/cache/MessageCache.php: Could not acquire 'commonswiki:messages:fr:status' lock. [22:04:49] !log creating metrics user on all s1-7 masters [22:04:51] Logged the message, notpeter [22:05:01] I wonder if a lot of processes pile up on that outer lock() before they throw an exception on the non-blocking inner lock [22:05:19] they wait for 10 seconds in that state [22:05:45] I suppose it could increase TIME_WAIT state connections on the mc servers...not sure to what extent [22:06:19] AaronSchulz: nah, lets roll out twemproxy [22:07:08] nah what? I was just describing a problem [22:07:12] heh [22:07:24] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 304 seconds [22:07:26] binasher: how many of these conns did you see? [22:07:31] or whoever was looking at mc [22:08:23] * AaronSchulz lols at Ryans "hate" rant :) [22:08:54] RECOVERY - MySQL disk space on neon is OK: DISK OK [22:10:51] RECOVERY - MySQL Slave Delay on db1047 is OK: OK replication delay 2 seconds [22:11:06] binasher: anyway, I guess local twemproxy might help in this case [22:11:08] AaronSchulz: For the AFTv5 error there's already a fix in gerrit, not sure if it was merged or not [22:12:03] AaronSchulz: i misread you saying "I suppose it" as "I suppose i" [22:12:04] binasher: btw, Yao was interested in talking to you if wanted to (the person who did that presentation) [22:12:23] AaronSchulz: yah, i had lunch with her [22:12:35] \o/ [22:12:37] AaronSchulz, thanks for the Solr heads up [22:13:01] Or so I thought.. [22:13:10] can someone restart jetty on solr2 and solr3 please? [22:13:46] MaxSem: doing so now [22:14:20] check_solr would've noticed it:) [22:14:35] heh [22:14:55] yeah, i still can't figure out why that freaking check always exits 0 even when it shouldn't... [22:15:01] it doesn't when run from the cli.... [22:15:39] notpeter, since I wrote a completely separate script, it shouldn't have these problems [22:15:48] yep [22:21:31] preilly: minor change https://gerrit.wikimedia.org/r/48748 [22:21:49] going to be working on the tests for sartoris soon [22:21:57] * MaxSem scaps [22:27:22] so are we pretty sure memcached was/is the issue we've had [22:27:25] ? [22:27:53] well, it had some role [22:28:00] hmmm, ok [22:28:22] hard to say, I wasn't one of the main people looking at this though [22:28:37] * aude nods [22:28:56] it makes sense that wikibase would not exactly help the situation, since we use memcached for some stuff [22:29:06] + resource loader uses it [22:29:26] and message caching, which RL depends on at times [22:29:31] yep [22:29:35] among most of everything else [22:29:41] * aude nods [22:38:33] PROBLEM - Puppet freshness on cp3022 is CRITICAL: Puppet has not run in the last 10 hours [22:38:49] !log maxsem Started syncing Wikimedia installation... : [22:38:50] Logged the message, Master [22:40:00] New patchset: Pyoungmeister; "db1047 => innodb_file_per_table = true" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48750 [22:41:57] New review: Pyoungmeister; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/48750 [22:42:05] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48750 [22:44:06] RECOVERY - Puppet freshness on db1047 is OK: puppet ran at Tue Feb 12 22:43:47 UTC 2013 [22:44:37] rfaulkner: Change has been successfully merged into the git repository. [22:44:50] preilly. thanks [22:45:32] rfaulkner: np and thank you for doing it [22:46:12] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [22:46:59] !log disabling innodb-flush-log-at-trx-commit=2 on db1047 w/asher's blessing. will schedule downtime to apply [22:47:00] Logged the message, notpeter [22:58:47] !log maxsem Finished syncing Wikimedia installation... : [22:58:49] Logged the message, Master [23:14:10] New review: Andrew Bogott; "Patch Set 3: Verified+2 Code-Review+2" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/48585 [23:14:19] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48585 [23:17:24] RECOVERY - MySQL disk space on neon is OK: DISK OK [23:36:28] ori-l: ping [23:44:18] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [23:44:36] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 1 seconds [23:54:07] py is doing a graceful restart of all apaches [23:54:49] !log awjrichards synchronized php-1.21wmf9/extensions/MobileFrontend/javascripts/specials/donateimage.js 'touch file' [23:54:49] that's a lie, for the record [23:54:50] Logged the message, Master [23:55:49] preilly: what do you think about using https://nose.readthedocs.org/en/latest/ for our unit tests on sartoris? [23:56:03] rather than the unittest module