[00:05:30] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [00:07:40] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu Jan 31 00:07:37 UTC 2013 [00:08:31] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [00:08:40] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu Jan 31 00:08:38 UTC 2013 [00:09:30] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [00:09:41] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu Jan 31 00:09:36 UTC 2013 [00:10:30] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [00:39:52] New review: Tim Starling; "If they don't break the site, then why not run them every week? Why only once every 6 months?" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/33713 [00:46:50] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu Jan 31 00:46:44 UTC 2013 [00:47:30] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [00:48:47] New review: Tim Starling; "Let me also say: the reason they broke whatever slave server they ran on was because special pages l..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/33713 [00:49:37] New review: Reedy; "Certainly running them more regularly on slaves in tampa that aren't in rotation would be fine and s..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/33713 [01:02:58] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [01:14:34] mark: how much varnish config work is needed for the thumbnail hashing stuff, I gather it's not terribly high? Is it worth working on the MW side now? [01:17:17] what do you mean? [01:17:50] mark's hashing idea isn't hard to implement, I think he already proof-of-concepted it [01:18:14] but varnish won't scale for the tons of thumb sizes that we currently have [01:19:14] how many files actually have a bunch of thumbs? [01:19:55] who knows? [01:19:55] lots of them I'd say [01:20:05] it's only that case that matters [01:20:35] Container: wikipedia-commons-local-thumb.00 Objects: 642534 [01:20:40] Container: wikipedia-commons-local-public.00 Objects: 69107 [01:20:45] very very rough [01:21:15] what is that a measure of? [01:21:25] avg thumbs per original? [01:21:34] avg thumb count that is [01:21:44] close to 10:1 [01:22:12] says nothing about distribution though [01:22:23] but even 10:1 is a lot I'd say [01:22:31] so on average the hash-chain has 10 items [01:22:46] might work, not optimal though [01:22:52] nod [01:23:05] it's really the bad cases that might suck [01:23:16] indeed [01:23:18] like those pages with thumbs 0-999ox [01:23:21] *999px [01:23:48] *hash-chain would have 10 items [01:24:24] anyway, there are serious problems with current css using fixed-size thumbs though [01:24:35] how come? [01:24:49] did you see Timo's email? [01:25:10] I buy that problem more than the "client scaling sucks" one [01:26:34] aha [01:27:42] you also saw Tim's suggestion about versioned urls? [01:27:57] btw, mark pointed me to this the other day [01:27:58] http://p.defau.lt/?QpJXBqzvRcQtMaf_LuWsAg [01:28:08] 1-minute varnish sample he took [01:32:27] I don't see what versioning has to do with this problem [01:32:39] disable purging completely? [01:32:49] which problem? [01:32:49] well, from a varnish pov it helps [01:33:06] but fixed sizes -if at all possible- would help in the media storage backend as well [01:33:14] there is the problem of purging being unreliable garbage now, which is compounded by items in cache and not swift [01:37:22] paravoid: oh, btw how about that swift patch? :) [01:37:49] heh, sorry about that [01:38:06] I've been swamped [01:38:11] how soon do you need that? [01:38:21] I'm really not looking forward into building our own swift packages [01:38:22] well, it's not urgent, it would be nice though [01:38:48] it seems like it will be a while before we are on ceph though [01:38:58] yeah, I've hit another wall lately... [01:39:14] apergos: around? [01:39:44] btw, I'll be there in less than two weeks [01:40:26] i think we could go to fixed sizes for images not included by css. only allow fixed sizes for new uploads and in new articles and hire thousands of cheap laborers via mechanical turk to fix the layout of every existing article [01:41:03] binasher: can you cr https://gerrit.wikimedia.org/r/#/c/46824/1 ? [01:41:19] haha [01:41:21] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [01:41:21] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [01:41:21] PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours [01:41:22] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [01:41:52] binasher: that's what you get for making useful comments when you're supposed to be packing! [01:42:36] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46824 [01:42:45] bah! [01:42:55] AaronSchulz: that looks sensible, so i merged it [01:43:09] binasher: can you restart the runners? [01:43:18] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [01:43:36] it should be fine, but I want to confirm that the hack is not needed anymore [01:44:03] i actually need to log off for a bit [01:44:20] * AaronSchulz looks at paravoid [01:44:21] i'll be back in an hour or so [01:44:21] I'm also hitting my bed [01:44:26] heh [01:44:38] I can do a simple restart [01:44:42] but I won't stay to babysit it [01:44:57] you tell me [01:45:06] go ahead, it won't take long to tell [01:45:14] and tim is around [01:48:30] done [01:54:56] paravoid: seems fine [01:55:03] thanks [01:56:50] paravoid: I'm actually a little surprised at how well the queue has been doing the last few days [01:57:51] eqiad boxes seem to be the same [01:59:36] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 181 seconds [01:59:37] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 185 seconds [02:01:00] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 213 seconds [02:01:01] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 213 seconds [02:08:03] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 223 seconds [02:08:21] https://wikimediafoundation.org/w/index.php?title=Peering&diff=87455&oldid=82659 [02:10:00] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 9 seconds [02:10:33] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 1 seconds [02:10:42] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [02:11:22] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [02:11:23] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [02:11:39] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 1 seconds [02:12:01] Susan: hah! [02:12:08] what's the current percentage? [02:12:32] I believe the cost of keeping the sites up and running is about $2.5M/year. [02:12:40] And the total budget is about $35M? [02:12:46] So whatever that percent is. [02:15:06] are you amortizing ulsfo, eqiad, etc.? [02:17:09] ugh. echo's css is specific to vector, it seems [02:17:20] yeah, it's on my list to file [02:17:28] at least doesn't work with monobook [02:17:38] it's under the body? (z-level) [02:17:49] I'm trying to make it work with this: https://github.com/OSAS/strapping-mediawiki [02:17:52] i guess it's your strapping ? [02:17:53] right [02:18:07] but monobook at least should be tested :) [02:18:27] There's some bug with Echo + Monobook where I can't see any of the notifications. [02:18:32] They go under some element, I think. [02:18:41] Is that the bug you're hitting? It's kind of aggravating. [02:19:01] no. I can see notifications [02:19:10] but it's improperly positioned [02:19:15] s/body/content container/ [02:19:22] oh, that's not so bad [02:19:42] well, the position is hardcoded for vector [02:19:50] I can't seem to make my skin override it [02:20:21] specificity? [02:20:29] or !important [02:20:34] I tried !important [02:20:48] try specificity [02:21:02] this is nova-precise2? [02:28:19] !log LocalisationUpdate completed (1.21wmf8) at Thu Jan 31 02:28:18 UTC 2013 [02:28:21] Logged the message, Master [02:29:48] !log LocalisationUpdate completed (1.21wmf7) at Thu Jan 31 02:29:47 UTC 2013 [02:29:49] Logged the message, Master [02:34:22] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 191 seconds [02:34:27] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 190 seconds [02:34:33] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 193 seconds [02:36:15] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [02:36:22] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [02:36:33] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 3 seconds [03:25:24] PROBLEM - Puppet freshness on dataset1001 is CRITICAL: Puppet has not run in the last 10 hours [03:35:31] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 185 seconds [03:35:37] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 185 seconds [03:35:40] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 191 seconds [03:37:24] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [03:37:31] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [03:37:40] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [04:07:45] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu Jan 31 04:07:43 UTC 2013 [04:08:45] PROBLEM - Puppet freshness on dataset1001 is CRITICAL: Puppet has not run in the last 10 hours [04:08:45] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [04:08:55] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu Jan 31 04:08:44 UTC 2013 [04:09:44] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [04:20:20] * jeremyb wonders what happened to mwalker's client [04:22:15] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu Jan 31 04:22:11 UTC 2013 [04:22:44] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [04:59:46] New review: Tim Starling; "If you just want more output when you run it manually, why not add a --verbose option?" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/42970 [05:35:37] New patchset: Tim Starling; "Add a --verbose parameter to mw-update-l10n" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46907 [05:38:40] New review: Tim Starling; "I mean like Ic6db1d8a. I also took the liberty of suppressing non-error output from mergeMessageFile..." [operations/puppet] (production); V: 0 C: -2; - https://gerrit.wikimedia.org/r/42970 [05:40:04] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 182 seconds [05:40:04] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 185 seconds [05:40:21] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 183 seconds [05:41:03] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 11 seconds [05:42:03] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [05:42:10] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [06:02:24] AaronSchulz: at 3:30 a.m. I was definitely not around :-D [06:05:33] PROBLEM - MySQL Replication Heartbeat on db36 is CRITICAL: CRIT replication delay 192 seconds [06:06:33] RECOVERY - MySQL Replication Heartbeat on db36 is OK: OK replication delay 0 seconds [06:09:07] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [06:28:18] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu Jan 31 06:28:15 UTC 2013 [06:29:07] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [06:36:08] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [06:36:58] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on port 8123 [06:41:58] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 197 seconds [06:42:08] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 196 seconds [06:42:27] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 190 seconds [06:42:28] RECOVERY - swift-account-reaper on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [06:42:54] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 199 seconds [06:46:08] PROBLEM - swift-account-reaper on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [06:46:58] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [06:47:08] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [06:47:09] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [06:47:42] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [06:47:59] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on port 8123 [06:48:09] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [07:15:18] PROBLEM - LVS Lucene on search-pool2.svc.eqiad.wmnet is CRITICAL: Connection timed out [07:17:08] RECOVERY - LVS Lucene on search-pool2.svc.eqiad.wmnet is OK: TCP OK - 0.001 second response time on port 8123 [07:18:21] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [07:33:42] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 182 seconds [07:34:02] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 195 seconds [07:34:02] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 195 seconds [07:34:22] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 209 seconds [07:35:41] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [07:35:41] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [07:35:59] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [07:36:22] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [07:59:33] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [08:00:33] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 1.000 second response time on port 8123 [08:07:42] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu Jan 31 08:07:33 UTC 2013 [08:08:23] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [08:20:22] PROBLEM - Puppet freshness on stat1001 is CRITICAL: Puppet has not run in the last 10 hours [08:20:53] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu Jan 31 08:20:44 UTC 2013 [08:21:22] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [08:24:50] PROBLEM - Puppet freshness on stat1001 is CRITICAL: Puppet has not run in the last 10 hours [08:34:32] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 181 seconds [08:34:35] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 183 seconds [08:36:14] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [08:36:32] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [09:02:54] New patchset: Hashar; "(bug 44041) adapt role::cache::mobile for beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44709 [09:26:56] PROBLEM - Host mw1085 is DOWN: PING CRITICAL - Packet loss = 100% [09:27:35] PROBLEM - Host mw1085 is DOWN: PING CRITICAL - Packet loss = 100% [09:28:17] RECOVERY - Host mw1085 is UP: PING OK - Packet loss = 0%, RTA = 26.57 ms [09:28:35] RECOVERY - Host mw1085 is UP: PING OK - Packet loss = 0%, RTA = 0.59 ms [09:36:25] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 184 seconds [09:37:13] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 209 seconds [09:37:25] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 217 seconds [09:39:25] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [09:39:26] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [09:40:40] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 1 seconds [10:23:35] New patchset: Reedy; "Remove strategyappswiki from wikiversions.dat" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46914 [10:24:13] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46914 [10:24:52] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Remove strategyappswiki [10:24:53] Logged the message, Master [10:36:06] PROBLEM - Puppet freshness on virt1000 is CRITICAL: Puppet has not run in the last 10 hours [10:37:09] PROBLEM - Puppet freshness on virt1000 is CRITICAL: Puppet has not run in the last 10 hours [10:38:12] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 187 seconds [10:38:35] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 199 seconds [10:38:45] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 200 seconds [10:38:48] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 199 seconds [10:39:45] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [10:40:00] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [10:40:35] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [10:40:36] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [11:02:09] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [11:03:28] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [11:42:43] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [11:42:43] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [11:42:43] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [11:42:43] PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours [11:42:43] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [11:42:44] PROBLEM - Puppet freshness on msfe1002 is CRITICAL: Puppet has not run in the last 10 hours [11:43:09] New patchset: Reedy; "Use overriding to muchly simplify wgNamespacesWithSubpages" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46826 [11:43:14] New patchset: Reedy; "Use overriding to muchly simplify wgNamespacesToBeSearchedDefault" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46873 [11:44:46] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [12:11:11] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [12:12:13] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [12:12:36] !log reedy synchronized php-1.21wmf8/extensions/Wikibase [12:12:37] Logged the message, Master [12:44:23] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [12:47:13] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 3.008 second response time on port 8123 [13:10:21] New patchset: Hashar; "contint: install bzr package on gallium" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46931 [13:11:10] New review: Hashar; "Already deployed manually on gallium. Be bold and merge on sight :-]" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/46931 [13:16:39] New review: Demon; "Don't merge, don't need after all." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/46931 [13:19:21] !log gallium: manually installed "bzr" [13:19:22] Logged the message, Master [13:19:28] !log gallium manually removed bzr: apt-get remove bzr python-keyring python-httplib2 python-launchpadlib python-zope.interface python-oauth python-bzrlib bzr python-simplejson python-configobj python-lazr.uri python-lazr.restfulclient python-wadllib [13:19:29] Logged the message, Master [13:21:22] New patchset: Hashar; "contint: install mercurial package on gallium" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46931 [13:21:45] !log gallium: installed mercurial manually (puppet change is {{gerrit|46931}} PS2) [13:21:46] Logged the message, Master [13:21:58] New review: Hashar; "Mercurial installed" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/46931 [13:22:45] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 183 seconds [13:23:15] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 193 seconds [13:23:40] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 200 seconds [13:23:58] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 208 seconds [13:26:40] PROBLEM - Puppet freshness on dataset1001 is CRITICAL: Puppet has not run in the last 10 hours [13:29:34] PROBLEM - SSH on pdf3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:30:24] RECOVERY - SSH on pdf3 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [13:51:11] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [13:51:19] New patchset: Faidon; "Switch Ceph to the stable train repository" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46935 [13:51:38] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [13:52:10] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46935 [13:54:45] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [13:54:45] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [14:09:43] PROBLEM - Puppet freshness on dataset1001 is CRITICAL: Puppet has not run in the last 10 hours [14:18:34] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [14:28:34] New patchset: Mark Bergsma; "Implement If-Cached request header feature" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46937 [14:29:56] New patchset: Mark Bergsma; "Implement If-Cached request header feature" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46937 [14:32:58] New patchset: Mark Bergsma; "Increase connection pool size" [operations/software] (master) - https://gerrit.wikimedia.org/r/46938 [14:32:58] New patchset: Mark Bergsma; "Fix socket.timeout exception in send_object" [operations/software] (master) - https://gerrit.wikimedia.org/r/46939 [14:32:59] New patchset: Mark Bergsma; "Sync deletes on the source to the destination" [operations/software] (master) - https://gerrit.wikimedia.org/r/44422 [14:33:34] Change merged: Mark Bergsma; [operations/software] (master) - https://gerrit.wikimedia.org/r/44422 [14:34:02] Change merged: Mark Bergsma; [operations/software] (master) - https://gerrit.wikimedia.org/r/46938 [14:34:13] Change merged: Mark Bergsma; [operations/software] (master) - https://gerrit.wikimedia.org/r/46939 [14:37:54] PROBLEM - Host mw1085 is DOWN: PING CRITICAL - Packet loss = 100% [14:38:09] is there a database of our hardware somewhere? [14:38:47] New patchset: Ottomata; "Now using tab as field delimiter in webrequest frontend cache logs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46942 [14:39:44] RECOVERY - Host mw1085 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [14:39:58] New review: Ottomata; "I have tested this change on the log1.pmtpa.wmflabs instance." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/46942 [14:41:37] MaxSem: not really [14:41:41] what are you looking for? [14:42:08] a volunteer ask for configuration of our OSM servers [14:42:20] I know they're 720, but nothing else [14:43:57] 720? i don't think so [14:45:45] they're dell R610s, R620s, R410s [14:46:48] what's their exact configuration? [14:47:03] differs, which ones do you want to know? [14:47:20] databases, apaches, caches [14:47:48] mark, i'm thinking about RT 4433 (ACLs for analytics cluster) [14:48:28] and as far as I know, the analytics cluster does not need to initiate any connections to anything outside of itself (except maybe to brewster or outside internet to dl / apt things) [14:48:29] mark, he wants to know about all of these [14:49:29] is there a way to ACL it such incoming / established connections are allowed, but it analytics can't initiate anything to other VLANs? [14:50:16] yes [14:50:30] MaxSem: can you ask robh later? [14:50:42] sure [14:51:00] ok, cool, i'll update the RT with that request then. There are a few machines it would be handy to be able to initiate connections to (stat1, oxygen. etc.) but not necessary [14:51:04] I'll list those there [14:52:49] ottomata: specify which protocol, tcp port, destination then [14:53:03] for those exceptions? [14:53:14] for analytics to initiate? [14:53:30] MaxSem: why does he care? [14:53:45] we have no reason to not provide the information [14:54:02] he's interested in helping out with OSM [14:54:03] but I don't think it matters for anyone [14:54:21] and? [14:54:21] well, SSD vs HDD matters [14:54:31] RAM size matters [14:54:41] total storage size matters [14:55:04] how's this volunteer is going to help? [14:58:08] !log taking down db1047 for upgrade to precise [14:58:09] Logged the message, notpeter [14:59:18] !log reedy synchronized php-1.21wmf8/extensions/Wikibase [14:59:19] Logged the message, Master [14:59:21] ottomata: yes [14:59:40] k, tahnks [15:03:31] New patchset: Pyoungmeister; "switching db1047 to coredb reasearchdb role class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46948 [15:13:26] PROBLEM - Full LVS Snapshot on db1047 is CRITICAL: Connection refused by host [15:13:35] PROBLEM - MySQL disk space on db1047 is CRITICAL: Connection refused by host [15:13:39] LVS snapshot? [15:13:45] PROBLEM - mysqld processes on db1047 is CRITICAL: Connection refused by host [15:13:45] PROBLEM - MySQL Idle Transactions on db1047 is CRITICAL: Connection refused by host [15:14:05] paravoid: I'm upgarding that host [15:14:11] no I mean [15:14:17] that should be LVM, not LVS, no? [15:14:25] bahahaha [15:14:25] RECOVERY - Full LVS Snapshot on db1047 is OK: OK no full LVM snapshot volumes [15:14:29] indeed, sir. indeed [15:14:33] okay :) [15:14:35] RECOVERY - MySQL disk space on db1047 is OK: DISK OK [15:14:45] RECOVERY - MySQL Idle Transactions on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [15:14:46] RECOVERY - mysqld processes on db1047 is OK: PROCS OK: 1 process with command name mysqld [15:15:11] using do-release-upgrade is weird.... [15:15:15] yes [15:15:54] VOLS="$(lvs | awk '$1 != "LV" && $6 > 90 {print $5 "=" $6 "%"}')" [15:15:58] hm, it might actually be lvs [15:16:02] I wonder what lvs is in that context [15:16:31] oh no [15:16:34] lvs is lvscan [15:16:44] er, no, but close [15:16:49] anyway [15:18:55] PROBLEM - Host db1047 is DOWN: PING CRITICAL - Packet loss = 100% [15:19:38] PROBLEM - Host db1047 is DOWN: PING CRITICAL - Packet loss = 100% [15:19:56] New patchset: Mark Bergsma; "First naive attempt at fetching objects from Varnish" [operations/software] (master) - https://gerrit.wikimedia.org/r/46950 [15:20:51] paravoid, so are you going to Copenhagen? [15:21:05] haven't I already replied? [15:21:06] RECOVERY - Host db1047 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [15:21:17] RECOVERY - Host db1047 is UP: PING OK - Packet loss = 0%, RTA = 26.53 ms [15:21:25] like three times now? :) [15:21:45] ah, my bad:) [15:21:59] bzzzt, stack overflow [15:23:09] but if this happening (I'm still waiting to hear a confirmation) [15:23:20] we should book tickets asap [15:23:41] airfares tend to go up [15:23:43] Don't be daft [15:23:45] WMF don't do that [15:23:45] PROBLEM - mysqld processes on db1047 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [15:24:20] heh [15:24:24] * notpeter sighs [15:24:31] Though, ~5 weeks is probably about right [15:24:31] Reedy speaks truth [15:24:36] they do [15:24:41] it was me who waited too long this time [15:24:46] haha [15:24:47] * ^demon headdesks [15:25:07] <^demon> Someone have a bit of rope I can hang myself with? [15:25:08] I've still nothing booked for the tech meet in SF at end of feb [15:25:11] i booked last night [15:25:29] PROBLEM - mysqld processes on db1047 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [15:26:04] ^demon, wassup? [15:26:32] <^demon> https://groups.google.com/d/topic/repo-discuss/Xs5NDXBvCFw/discussion :\ [15:26:39] MAN-IAD-SFO, SFO-FRA-MAN [15:26:40] Screw that [15:26:55] the other thing about booking late is not finding good flights [15:27:07] so for CPH I want to get a direct flight and there's only one via SAS [15:28:22] * notpeter hands ^demon a yahoo group [15:28:56] <^demon> notpeter: When you play in someone else's sandbox, you've got to use their toys ;-) [15:29:00] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46948 [15:29:02] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46937 [15:29:46] mark: I shall merge yours [15:30:09] ouch [15:30:11] we were merging together [15:30:20] did you merge? [15:30:35] yep [15:30:51] sweet! [15:30:53] not on stafford [15:31:07] will fix [15:31:32] done [15:31:41] cool [15:31:42] tanks [15:33:25] RECOVERY - MySQL Replication Heartbeat on db1047 is OK: OK replication delay seconds [15:33:35] RECOVERY - Puppet freshness on db1047 is OK: puppet ran at Thu Jan 31 15:33:07 UTC 2013 [15:34:56] RECOVERY - MySQL Replication Heartbeat on db1047 is OK: OK replication delay seconds [15:35:36] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [15:37:26] RECOVERY - swift-account-reaper on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [15:37:26] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on port 8123 [15:37:36] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 195 seconds [15:37:56] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 201 seconds [15:37:57] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 201 seconds [15:38:05] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 204 seconds [15:40:26] PROBLEM - swift-account-reaper on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [15:40:38] New patchset: Mark Bergsma; "Implement If-Cached Etag matching for upload frontends as well" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46951 [15:41:18] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46951 [15:47:51] paravoid: hi...ready? [15:49:08] yes [15:49:15] mark: can I? [15:49:34] cmjohnson1: I want to do it one box at a time if you don't mind [15:50:03] i agree...so swap card, reinstall than do another? [15:50:27] yeah go a head [15:50:34] no [15:50:35] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [15:50:36] no need for a reinstall [15:50:44] h710p/h710 are compatible [15:50:52] make sure swift doesn't auto-out them during the swap [15:50:53] er [15:50:54] ceph [15:50:55] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [15:50:58] it will auto-out them [15:51:04] change that? [15:51:05] but it should in them again when they boot up [15:51:17] no need for that, needless copying around [15:51:25] I'd like to simulate a machine failure [15:51:29] under normal ops [15:51:31] ok [15:51:38] now that we have sane raid controllers [15:51:44] maybe it's not too bad [15:51:45] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 1 seconds [15:51:53] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [15:51:58] I'd like to make sure we don't have the whol cluster locked up because of a machine swap [15:52:02] yes [15:52:59] cmjohnson1: so, no reinstall needed, but I think the controller might need an "import foreign setup" from its menu [15:53:06] most likely [15:53:22] so, whenever you're ready [15:53:27] go ahead and shutdown 1005 when you are ready [15:53:44] powering off [15:54:35] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [15:55:26] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on port 8123 [15:55:27] ha! [15:55:32] 4 stale pgs [15:55:41] their OSD pair was on the same box [15:55:46] how come [15:55:52] 48,57 for 2 of them, 58,51 for the others [15:55:55] PROBLEM - Host ms-be1005 is DOWN: PING CRITICAL - Packet loss = 100% [15:55:56] PROBLEM - Host ms-be1005 is DOWN: PING CRITICAL - Packet loss = 100% [15:56:18] swiftrepl is still going fine [15:56:59] oh [15:57:00] I know [15:57:18] I've changed "rule data" to be rack-based [15:57:22] but not the other rules [15:57:33] and that's 2.* pgs, i.e. pool 2 [15:57:58] rbd [15:58:05] RECOVERY - mysqld processes on db1047 is OK: PROCS OK: 1 process with command name mysqld [15:58:06] aha [15:58:19] good to know though [15:58:35] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [15:58:56] RECOVERY - mysqld processes on db1047 is OK: PROCS OK: 1 process with command name mysqld [15:59:25] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on port 8123 [16:01:05] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 2492 seconds [16:01:09] New patchset: Mark Bergsma; "Ensure If-Cached requests don't pollute the frontend caches" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46952 [16:01:49] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46952 [16:02:56] paravoid: ms-be1005 came back ok...no foreign cfg problem [16:03:10] even better [16:03:23] RECOVERY - Host ms-be1005 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms [16:03:32] give me a min before you move on [16:03:39] let me know [16:03:44] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 2465 seconds [16:03:45] RECOVERY - Host ms-be1005 is UP: PING OK - Packet loss = 0%, RTA = 26.50 ms [16:03:48] it's recovering [16:04:18] swiftrepl is getting tons of 500s now :( [16:04:24] yeah [16:04:29] stuck at peering [16:04:45] 2013-01-31 16:04:36.519846 mon.0 [INF] pgmap v2211464: 16952 pgs: 34 active, 13106 active+clean, 929 active+recovery_wait, 2 stale+active+clean, 1642 peering, 1203 active+degraded, 9 active+clean+scrubbing, 27 active+recovering; 25504 GB data, 53498 GB used, 185 TB / 238 TB avail; 5022732/140246486 degraded (3.581%) [16:05:02] it's going, but slow [16:07:09] 0MB/s traffic on ms-fe1001 [16:07:14] that is problematic [16:07:16] yes [16:07:43] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu Jan 31 16:07:33 UTC 2013 [16:08:34] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [16:08:34] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu Jan 31 16:08:32 UTC 2013 [16:09:12] still peering [16:09:13] so strange [16:09:20] yes [16:09:25] it's not network limited now either [16:09:26] no [16:09:33] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [16:12:08] cmjohnson1: sorry for the delay, please bear with us :) [16:12:46] paravoid: i am following along...lmk ..thx [16:20:04] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu Jan 31 16:19:58 UTC 2013 [16:20:33] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [16:22:55] cmjohnson1: next! :) [16:22:59] yes [16:23:16] awesome...paravoid plz shut it down so I know you are ready...thx [16:23:29] mark: you only won because I had to type "ceph osd set noout" first [16:23:35] no [16:23:40] I waited a bit for you first [16:23:42] but you were just too slow [16:25:11] PROBLEM - Host ms-be1006 is DOWN: PING CRITICAL - Packet loss = 100% [16:25:54] PROBLEM - Host ms-be1006 is DOWN: PING CRITICAL - Packet loss = 100% [16:27:03] RECOVERY - MySQL Slave Delay on db1047 is OK: OK replication delay 16 seconds [16:27:16] RECOVERY - MySQL Slave Delay on db1047 is OK: OK replication delay 4 seconds [16:31:03] RECOVERY - Host ms-be1006 is UP: PING OK - Packet loss = 0%, RTA = 0.48 ms [16:31:10] RECOVERY - Host ms-be1006 is UP: PING OK - Packet loss = 0%, RTA = 26.83 ms [16:31:29] paravoid ^ [16:31:37] yep [16:31:40] it's recovering [16:32:24] mark: ha, noout didn't help [16:32:27] not much [16:35:06] come on [16:35:07] peer [16:35:33] I can set nodown too [16:35:43] not sure what that will do, we'll see [16:36:02] it still will need to peer [16:36:04] so it won't help [16:37:01] so it was slightly faster this time [16:37:17] nothing to write home about [16:40:22] it's a bit concerning that recovery takes as long as the outage while the box wasn't out [16:40:28] so it only needs to recover changed objects [16:40:37] like 11k of them [16:41:05] note that it's recovery, not backfill [16:41:23] that's why [16:41:44] if we had three replicas it'd probably be faster too [16:41:59] I don't mind that [16:42:01] I do mind the outage [16:42:08] yes [16:42:15] no requests during the long peering process is a big problem [16:42:31] it's strange that peering is slow btw [16:42:59] hm, let me upgrade the other two monitors [16:43:04] and radosgw [16:44:37] mark: recovery threads is in its default value, "1" [16:44:51] up [16:44:57] up? [16:45:03] no problems during recovery [16:45:07] it's peering that does it [16:45:13] yes [16:45:29] so recovery could be made faster (but doesn't need to be) [16:45:29] that's why I said before that I don't mind recovery [16:45:34] agreed [16:47:16] New patchset: Ottomata; "Removing apache_site 000_default* definitions in webserver::apache::site define." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46954 [16:47:30] paravoid: ready for the last one? [16:47:40] not yet [16:47:46] waiting for recovery [16:47:48] but close [16:47:48] almost [16:48:09] mark: don't give the go ahead, I want to restart mon/radosgw with 0.56.2 first [16:48:30] you better be done then ;p [16:48:53] waiting for the dist-upgrade on ms-fe1002 [16:48:54] it's slllow [16:49:01] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46954 [16:50:40] RECOVERY - Puppet freshness on stat1001 is OK: puppet ran at Thu Jan 31 16:50:14 UTC 2013 [16:52:00] restarting radosgw [16:54:14] ok, both gw and mons done [16:54:27] 12 pgs to recover and we should be ready to go [16:55:02] cmjohnson1: next! :) [16:57:27] cmjohnso_: powering of ms-be1007 [16:57:34] thx [16:58:53] PROBLEM - Host ms-be1007 is DOWN: PING CRITICAL - Packet loss = 100% [16:59:56] New patchset: Mark Bergsma; "First naive attempt at fetching objects from Varnish" [operations/software] (master) - https://gerrit.wikimedia.org/r/46950 [16:59:58] PROBLEM - Host ms-be1007 is DOWN: PING CRITICAL - Packet loss = 100% [17:05:49] RECOVERY - Host ms-be1007 is UP: PING OK - Packet loss = 0%, RTA = 26.52 ms [17:06:33] RECOVERY - Host ms-be1007 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [17:06:53] let's see [17:07:00] damn [17:08:43] sigh [17:09:20] it always gets stuck for a little while [17:10:02] how crazy is it that it doesn't flinch when the whole box gets powered off [17:10:10] but it's a full scale outage when it's back on? [17:10:18] exactly [17:11:26] I think this is the issue we were having last week [17:11:36] when this is done I'm going to stop all of them [17:11:46] aaand we're back in business [17:11:58] why do you think so? [17:12:10] peering shouldn't take thsi long [17:12:31] does inktank say so? [17:12:34] yes [17:12:36] ok [17:12:53] i've only ever seen it take long [17:15:10] i'm going to play with my script [17:15:13] nevermind me though [17:17:08] can I stop swiftrepl? [17:17:43] cmjohnson1: thanks! [17:17:54] paravoid...yw [17:19:03] mark: can I stop swiftrepl? [17:19:56] I'll do it [17:20:31] so what happened last time is [17:20:36] set noup [17:20:41] stopped all the osds, started them again [17:20:48] then all of them booted up immediately [17:20:54] except 2-3 of them in random boxes [17:20:58] plus *all* of ms-be1001's [17:21:08] which were at 100% for several hours [17:21:10] weird [17:21:13] yes [17:22:45] go ahead [17:23:53] 5-7 rebooted immediately [17:24:10] 9-10-11 took something like a minute [17:24:13] 12 a bit more [17:24:46] uhm [17:24:49] they didn't peer? [17:24:49] wth? [17:25:04] all at the same epoch or whatever? [17:25:30] maybe [17:25:45] lemme know when I can restart [17:25:48] I will [17:28:58] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [17:29:47] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on port 8123 [17:29:52] mark: gotta minute? [17:54:11] mark: uhm [17:54:18] mark: ms-be1012 was quite slower than the rest [17:54:25] then I ran a df [17:54:31] what. the. hell. [18:01:01] paravoid: do you have a second? [18:01:25] kind of [18:02:56] just a question: this is going to go out soon: https://gerrit.wikimedia.org/r/#/c/46942/ it involves a change to the formatting of the nginx log formatting that is used for udp2log stuffs. based on your work earlier with ryan's nginx patch, do you think this is safe? [18:03:06] I am assuming yes, but, easier to ask first :) [18:03:55] I think it is [18:03:58] but stage it first [18:04:05] udp2log isn't exactly in a great shape, so you never know [18:04:10] I don't expect any trouble at all [18:04:13] but won't hur [18:04:16] *hurt [18:04:33] applying the patch into one of the boxes, restarting, then look if nginx is segfaulting, logging correctly etc. [18:04:39] mark: ready to restart swiftrepl again [18:04:57] paravoid: yep! sounds good [18:04:58] thanks! [18:07:03] ah! [18:07:07] I know why ms-be1012 has so much data [18:07:08] damn [18:08:46] i've tested the tab change on nginx on a labs instance [18:08:48] and it seems to work ok [18:08:51] ok [18:08:54] go ahead then [18:08:57] New patchset: Mattflaschen; "Add enableTooltip gate, default false and en, test, test2 true." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46962 [18:09:33] New review: Mattflaschen; "Do not submit until E3 deployment window." [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/46962 [18:09:52] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [18:10:50] ottomata1: yeah, I'm not tooooo worried. but I did manage to take down https by merging a seemingly reasonable and innocuous logging change to nginx :) [18:11:13] haha, nice [18:11:52] ottomata1: what of this all are you going to do and what to you want me to do? [18:12:29] i think I can do everything except for the squid change, and making sure that puppet runs ok on varnish and nginx [18:12:40] I actually have to amend that patchset with some udp2log filter changes [18:12:41] and [18:12:48] we need to get the new udp-filter out first [18:12:54] we're about to get that [18:13:17] ok, cool [18:13:42] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on port 8123 [18:13:44] New patchset: Mattflaschen; "Add enableTooltip gate, default false and en, test, test2 true." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46962 [18:13:56] Until recently, you could go to https://noc.wikimedia.org/conf/highlight.php?file=db.php to see which slave databases belonged to which wiki. That page no longer has the information though. Where do you find that out now? [18:14:33] sorry, I meant master databases, not slave dbs [18:15:32] Now db.php just says "$secretSitePassword = 'jgmeidj28gms';". lol [18:16:37] kaldari: there's now db-eqiad.php and db-pmtpa.php [18:16:59] ah, thanks! [18:17:03] that works [18:17:34] I'll update the documentation on wikitech.wiki [18:18:54] i still want to change that to password123 ;) [18:19:16] <^demon> I don't think it'll stop people from e-mailing us thinking they found something secret :p [18:20:16] ^demon, as if you don't want to laugh at those emails :P [18:20:53] <^demon> I just copy+paste the same response I always send. [18:20:54] who knows something about VUMI which is running on silver and zhen? [18:21:02] PROBLEM - Puppet freshness on stat1001 is CRITICAL: Puppet has not run in the last 10 hours [18:21:03] <^demon> Something about "Congrats about finding our little easter egg!" [18:29:06] that is not true! puppet just ran on stat1001 an hour ago [18:36:10] New patchset: Ottomata; "Now using tab as field delimiter in webrequest frontend cache logs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46942 [18:37:18] mark: you should 'ceph osd tree' :) [18:40:43] New patchset: Ottomata; "Now using tab as field delimiter in webrequest frontend cache logs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46942 [18:46:55] looks nice [18:48:34] yeah! [18:49:15] let's see how long will it take to sync [18:50:20] mark, just to confirm, the sqstat script is no longer used to supply page view metrics to ganglia, so we can disable it? [18:50:53] drdee: i didn't even know it ever was, so wouldn't know [18:51:16] ok, who might know? [18:51:19] just did ceph osd tell \* injectargs '--osd-recovery-max-active 20' [18:51:43] drdee: i don't know either, sorry :) [18:51:45] no idea who installed it [18:51:57] ok, thx [18:52:51] mark: so, ms-be1012 was the only one up in rack A2, so it has three times the data the other ones have [18:52:59] paravoid: does ceph -s show any problems on boxes? [18:53:17] * Aaron|home always has "pgs stuck unclean" types of message [18:53:25] mark: this means that it'll both be the most busy in syncing to the new boxes, as well as having most radosgw writes go there [18:53:36] Aaron|home: no [18:53:44] Aaron|home: this is a bug, I've had many of those [18:53:47] * Aaron|home wonders why the osdmap still things there are 3 osds [18:53:53] *thinks [18:54:07] health HEALTH_WARN 15751 pgs backfill; 617 pgs backfilling; 16353 pgs degraded; 1 pgs recovering; 1 pgs recovery_wait; 16370 pgs stuck unclean; recovery 69210298/211761342 degraded (32.683%) [18:54:11] monmap e13: 3 mons at {ms-be1003=10.64.0.175:6789/0,ms-fe1001=10.64.0.167:6789/0,ms-fe1002=10.64.0.168:6789/0}, election epoch 23810, quorum 0,1,2 ms-be1003,ms-fe1001,ms-fe1002 [18:54:15] osdmap e130500: 144 osds: 144 up, 144 in [18:54:17] pgmap v2216582: 16760 pgs: 388 active+clean, 10 active+remapped+wait_backfill, 4185 active+degraded+wait_backfill, 1 active+recovery_wait, 5 active+remapped+backfilling, 411 active+degraded+backfilling, 11556 active+degraded+remapped+wait_backfill, 201 active+degraded+remapped+backfilling, 2 active+clean+scrubbing+deep, 1 active+recovering; 25524 GB data, 54297 GB used, 208 TB / 261 TB avail; 69210298/211761342 degraded (32.683%) [18:54:23] mdsmap e1: 0/0/1 up [18:54:26] that's because I added more boxes now and increased replica count to 3 [18:55:28] * Aaron|home wishes there was more helpful info [18:56:01] it's just a personal install, but still, it would be nice to know how to make those go away or what the problem is [18:56:12] upgrade to 0.56.2 for starters [18:56:20] I'm on bobtail already [18:57:22] it's probably because I went from 2->3->2 osds [18:57:35] though that was ages ago [18:59:02] ceph can be like that :-) [19:02:13] mark: swiftrepl? [19:03:16] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [19:05:28] paravoid: i'm testing it in varnish mode [19:05:34] ah [19:06:59] paravoid: what's this for? [19:07:14] what? [19:07:21] "varnish mode"? [19:07:44] mark implemented a mode to sync thumbs using varnish from source, if it's in cache [19:08:17] https://ganglia.wikimedia.org/latest/?c=Ceph%20eqiad&h=ms-be1008.eqiad.wmnet&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 [19:08:23] filling up the gigabit, how nice [19:08:44] there is still no deletion handling though? [19:09:23] what do you mean? [19:09:26] in swiftrepl? [19:09:27] no [19:09:32] there is [19:09:38] but it's not a journal [19:09:40] well, we're not running it like that now though [19:09:43] yeah [19:10:16] how does it work? [19:10:37] swiftrepl or varnish mode? [19:11:07] the deletion handling in swiftrepl [19:12:26] like rsync -d [19:12:34] ahh [19:12:56] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [19:14:46] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on port 8123 [19:15:51] ok it seems to work but hit rate is really low [19:16:09] let me get stats on that... [19:26:04] hmm apparently hit rate just sucks in the beginning [19:26:41] oh? [19:26:54] and I shouldn't run it on hume [19:26:57] :) [19:28:04] hehehe [19:28:05] I guess not :) [19:28:19] why not on ms-fe1002? [19:28:27] can do [19:28:30] it's around 13-14% [19:28:35] so not sure that's worth it [19:28:39] wow [19:28:46] for enwiki thumbs [19:28:48] not sure about commons [19:28:50] lemme run that too [19:30:25] 0.9% down in 30' [19:30:54] ~18h for all in if the current rate persists (unlikely I think) [19:33:54] notpeter, ottomata: did you guys deploy that? [19:34:16] do any of you know how to do the squid change? [19:34:27] mark, asher, ryan and myself are all going to be away the next days [19:34:54] paravoid: yes, I do [19:35:00] oh [19:35:07] ah, great [19:35:07] we have not deployed yet [19:35:07] but yes [19:35:08] the docs are correct [19:35:17] and clear, even :) [19:35:17] great [19:35:28] http://wikitech.wikimedia.org/view/Squids [19:35:44] I knew us four knew how to do that [19:35:48] wasn't sure for anyone else [19:35:52] yeah, I have done so before [19:35:56] okay [19:36:01] (sorry 1on1) [19:36:11] I don't like to, as it's by far the easiest way to break the site.... [19:36:20] nonsense [19:36:23] but, totally can [19:36:23] DNS is a much better way [19:36:27] oh, true [19:36:29] and easier [19:36:34] no linting at all [19:36:34] but that's ryan's thing [19:36:46] with squid at least we have some rudimentary syntax checking [19:36:51] heh, true [19:36:59] :-) [19:37:04] databases are a much easier way [19:37:13] haha [19:37:45] so when is that changed rolled out? [19:37:46] today or tomorrow? [19:37:55] well, I meant "break the site" not "cause a catastrophic data loss and need to flee the country/internet" [19:38:01] hahahaha [19:38:17] <^demon> s/country\/internet/world/ [19:38:18] paravoid: well, the deployment window was about an hour ago ;) [19:38:25] paravoid: we are trying today [19:38:34] ^demon: basically, yeah :) [19:38:45] okay [19:38:49] I'll be around for a bit longer [19:39:04] I haven't packed yet [19:39:17] sad I won't be joining you :( [19:39:31] me too [19:39:59] moving to ms-fe1002 [19:40:36] does ganglia monitor loopback traffic? [19:41:57] don't think so [19:42:13] maybe have it connect to ms-fe1001 then? [19:43:06] there's still incoming traffic [19:43:17] true [19:43:29] so what we're seeing now is before I implement connection pooling [19:43:52] 6.8M/s? [19:43:54] New patchset: Mark Bergsma; "First naive attempt at fetching objects from Varnish" [operations/software] (master) - https://gerrit.wikimedia.org/r/46950 [19:43:55] New review: Spage; "LGTM but I've never made a config change." [operations/mediawiki-config] (master); V: 0 C: 1; - https://gerrit.wikimedia.org/r/46962 [19:44:02] seems so [19:44:34] but let it run for a while until it's actually copying instead of scanning [19:48:06] I suppose I could also just let varnish fetch the object from swift for me and not cache it [19:48:45] getting 500s now [19:49:02] nothing strange on the ceph side so far [19:50:20] Change merged: Kaldari; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46962 [19:53:25] https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=network_report&s=by+name&c=Ceph+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [19:53:28] wow [19:53:35] easily filled up gigabits [19:53:54] amazing [19:54:14] told you, we need more ;) [19:54:17] yep [19:54:25] you did [19:54:39] I doubt we can fill 2xGbE per box [19:54:51] but even a 30-50% increase would be nice [19:55:06] this of course isn't a very normal circumstance [19:55:30] no [20:00:08] New patchset: Mattflaschen; "Add GuidedTour, with gating." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46973 [20:05:12] 2% in an hour [20:05:13] cool! [20:05:22] 30.436% to go [20:05:44] I can't wait to do a rados bench on that [20:07:53] RECOVERY - Puppet freshness on stat1001 is OK: puppet ran at Thu Jan 31 20:07:42 UTC 2013 [20:07:53] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu Jan 31 20:07:44 UTC 2013 [20:08:13] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [20:08:22] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu Jan 31 20:08:20 UTC 2013 [20:09:13] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [20:09:22] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu Jan 31 20:09:16 UTC 2013 [20:10:12] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [20:10:46] ottomata: why is there a js.orig.otto in ganglia-web? [20:12:45] one sec, 1on1 meeting still [20:12:53] k [20:15:47] here's a message i sent to leslie and peter [20:15:52] Yoooo Leslie, [20:15:52] Ganglia got all funky today.  It was 404ing on a bunch of js/ and css/ requests.  I'm not really sure what's going on, but to make it work again I copied /srv/org/wikimedia/ganglia-web-3.5.1/{js,css} to /srv/org/wikimedia/ganglia-web-3.5.4/.  The 3.5.4 directory didn't have a css/ directory, and I renamed the existingjs/ directory to js.orig.otto. [20:15:52] Things look ok, except for there is a single CSS file still 404ing: css/jquery.flot.events.css [20:15:52] Just thought you should know.  There is probably a better way to fix this. [20:15:53] Thanks! [20:16:00] 12/20/2012 [20:16:10] this was after a recent ganglia upgrade i think [20:16:22] at the same time I was trying to get analytics nodes into ganglia [20:16:49] Leslie's response: [20:16:49] Thanks - that's probably how I would fix it ;) [20:16:55] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 195 seconds [20:16:56] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 195 seconds [20:17:57] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 220 seconds [20:17:57] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 220 seconds [20:17:58] and from SAL 12/20: [20:18:00] 20:16 ottomata: copied js/ and css/ dirs from nickel:/srv/org/wikimedia/ganglia-web-3.5.1 to nickel:/srv/org/wikimedia/ganglia-web-3.5.4 [20:18:02] ^ paravoid [20:18:47] yep [20:20:30] New review: Spage; "The ordering is random, but so is the rest of the file :o)" [operations/mediawiki-config] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/46973 [20:20:41] New patchset: Mark Bergsma; "Quick and Really Dirty connection pooling for Varnish connections" [operations/software] (master) - https://gerrit.wikimedia.org/r/46976 [20:20:56] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46973 [20:22:52] cloudfiles idiots use a FIFO queue for connection pooling [20:23:05] whereas a LIFO queue makes a whole lot more sense [20:26:24] Just want to note I'm working on an E3 deployment. We have permission to use E2's window too. [20:28:07] mark: let me guess, everything times out instead of some stuff or deadlocks? [20:28:58] well there's no point in waiting with a connection that is idle [20:29:10] it makes a lot of sense to use the most recent one isn't it, highst probability it hasn't timed out [20:30:25] this is really becoming a filthy "get the job done" script hehe [20:30:43] where I had to rewrite or override half of the cloudfiles code i'm using [20:31:06] New patchset: Faidon; "Switch Ganglia to a 3.5.4 + security patches root" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46984 [20:31:21] mark: where is the queue code? [20:31:23] paravoid: so if you don't want it to use varnish, don't pass "-v" at the end [20:31:33] ok [20:31:34] thanks :) [20:31:42] gerrit is sssslllow today [20:31:51] Aaron|home: the varnish queue code is in varnish_object_stream_prepare [20:32:10] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46984 [20:32:20] mark: I mean the connection pool? [20:32:25] Aaron|home: or do you mean cloudfiles? bottom of connection.py [20:32:55] * Aaron|home looks at https://github.com/rackspace/python-cloudfiles/blob/master/cloudfiles/connection.py [20:33:26] should change that into 'LifoQueue' [20:33:55] there was something else wrong with it too, why I descended from it: [20:33:56] class WorkingConnectionPool(cloudfiles.connection.ConnectionPool): [20:34:08] couldn't login on non-rackspace cluster or something silly like that [20:35:09] varnish has only 8-9% hit rate for commons it seems [20:35:27] enwiki is double that [20:36:35] PROBLEM - Puppet freshness on virt1000 is CRITICAL: Puppet has not run in the last 10 hours [20:36:56] ironic [20:38:39] PROBLEM - Puppet freshness on virt1000 is CRITICAL: Puppet has not run in the last 10 hours [20:39:15] mark: so, wait [20:39:33] varnish has 8-9% of the content and yet it has 99% hit rates [20:39:52] should we even copy thumbs that are not in varnish? [20:40:02] if they're not used I mean... [20:40:03] for purging [20:40:17] what do you mean? [20:40:38] could be in other (squid) caches [20:40:40] but I agree [20:40:43] we should not copy all thumbs [20:40:59] this 1% that varnish doesn't have can be regenerated via imagescalers [20:41:04] maybe [20:42:13] so 85-90% of 50% (thumbs) of our storage is basically not used [20:42:31] okay, that's a bit of an overestimation [20:42:34] but still [20:42:54] yeah no surprise [20:43:34] ok [20:43:36] i've had enough [20:43:43] i'll be back briefly tomorrow [20:44:40] seriously, I think this whole waiting maybe be for nothing [20:44:55] we should seriously consider copying just what's in varnish [20:45:08] you have the tools now ;) [20:45:16] needs just a few tiny changes in swiftrepl [20:45:23] and do the rest on demand, either via imagescalers or via fetching from swift [20:46:02] I'm flying in like 9h :) [20:46:09] but, tuesday [20:46:12] sure [20:46:45] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 181 seconds [20:46:49] New patchset: Mark Bergsma; "Accept -v to enable fetching from Varnish" [operations/software] (master) - https://gerrit.wikimedia.org/r/47023 [20:46:55] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 186 seconds [20:46:56] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 185 seconds [20:47:18] ok [20:47:24] have a nice flight [20:47:40] !log deploying patched ganglia and removing .htaccess. Ganglia is public again! [20:47:42] Logged the message, Master [20:48:15] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 208 seconds [20:49:27] yay [20:49:38] yay indeed [20:50:02] paravoid: expiring object support would help, though when heavily used stuff falls out of varnish there would be minor cache stampedes [20:50:24] Aaron|home: you mean in general? swift's expiry support? [20:50:30] yes [20:50:33] swifts/cephs [20:50:36] yeah, we've discussed this before [20:50:45] I didn't find anything in radosgw about that [20:50:49] I don't think it's supported [20:50:58] but this is different though [20:52:04] mark's If-Cached support lets him basically fetch objects that are in Varnish cache but bail out if they're not [20:52:29] (and fail if they're stale according to ETag too) [20:52:48] http://ceph.com/docs/master/radosgw/swift/ [20:52:53] what I was saying is that we can just sync just *those* objects in ceph [20:52:53] not supported indeed [20:53:17] paravoid: I was talking about the fact that there is a crapload of unused stuff [20:53:25] not swiftrepl [20:53:30] okay [20:54:57] yeah I'm not terribly excited with expiring objects that way [20:55:17] j^: how hard would it be to make UW only set async=true if the file is > x bytes? [20:55:28] it's not any different than randomly purging e.g. 10% of our objects every now and then [20:56:24] well, not too much different [20:56:28] random is a lot faster ;) [20:56:47] paravoid: maybe a script could pick random files periodically and purge those not in varnish :p [20:56:58] yep [20:57:00] that could also work [21:01:17] !log deployed new version (0.3.22) of udp-filter on emery, locke and oxygen. This version accepts —field-delimiter flag. [21:02:50] hmm [21:02:50] no loggy? [21:03:58] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [21:04:18] PROBLEM - Varnish traffic logger on cp1044 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [21:05:31] New patchset: Andrew Bogott; "Rework the RT manifests so it can be installed in Labs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47026 [21:07:09] PROBLEM - Varnish traffic logger on cp1044 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [21:07:36] <^demon> ottomata: Worked as of 20m ago :\ [21:08:10] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46942 [21:11:16] paravoid: can I close https://rt.wikimedia.org/Ticket/Display.html?id=4137 now that https://gerrit.wikimedia.org/r/#/c/46984/ got in? [21:12:51] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 26 seconds [21:12:58] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 3 seconds [21:12:59] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 1 seconds [21:13:27] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [21:13:33] andre__: just did, thanks for the pointer! [21:13:49] andre__: are you coming to fosdem? [21:13:57] paravoid, cool. I'll close the Bugzilla ticket then. [21:14:00] paravoid, yes, tomorrow [21:14:03] RECOVERY - Varnish traffic logger on cp1044 is OK: PROCS OK: 3 processes with command name varnishncsa [21:14:07] cool [21:14:11] see you there then :) [21:14:18] RECOVERY - Varnish traffic logger on cp1044 is OK: PROCS OK: 3 processes with command name varnishncsa [21:14:40] paravoid, great! You can find me outside tomorrow late evening, having a smoke, probably ;) [21:14:58] we'll both be otherwise busy I guess [21:15:04] but we'll find some time :) [21:17:16] !log deployed version 0.3.22 of udp-filter on emery, locke and oxygen.  This version accepts —field-delimiter flag. [21:17:45] grr [21:18:01] remove the -- [21:18:35] oh hm [21:18:49] !log deployed version 0.3.22 of udp-filter on emery, locke and oxygen.  This version accepts -F - -field-delimiter flag. [21:19:15] !log deployed version 0.3.22 of udp-filter on emery, locke and oxygen.  This version accepts the field-delimiter flag. [21:28:48] heya paravoid, you still around? [21:29:04] i'm trying to deploy squid confs for the first time, (with peter's help) and i'm getting somethign weird [21:36:30] PROBLEM - Backend Squid HTTP on sq63 is CRITICAL: Connection refused [21:36:49] PROBLEM - Backend Squid HTTP on amssq58 is CRITICAL: Connection refused [21:37:29] PROBLEM - Backend Squid HTTP on sq75 is CRITICAL: Connection refused [21:37:44] !log deloyed tab separator log format changes to squids, merged the corresponding puppet changes to varnish and nginx [21:37:45] Logged the message, Master [21:37:49] PROBLEM - Backend Squid HTTP on amssq34 is CRITICAL: Connection refused [21:37:57] uh oh, notpeter [21:38:01] revert [21:38:01] are those squid alerts my fault? [21:38:05] probably [21:38:06] ok [21:38:23] PROBLEM - Backend Squid HTTP on amssq58 is CRITICAL: Connection refused [21:38:59] PROBLEM - Varnish traffic logger on cp1036 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [21:39:13] ok, reverting... [21:40:19] ah! varnish too? [21:40:20] PROBLEM - Backend Squid HTTP on amssq34 is CRITICAL: Connection refused [21:40:20] PROBLEM - Backend Squid HTTP on sq63 is CRITICAL: Connection refused [21:40:29] PROBLEM - Backend Squid HTTP on sq75 is CRITICAL: Connection refused [21:40:32] no, don't worry about the varnish thing. that's just hte logger [21:40:35] !log reverted tab separartor change to squid [21:40:35] Logged the message, Master [21:40:51] i took out the tabs and made it just dash instead of X-CS [21:41:12] can we get a google hangout? [21:41:20] yup [21:41:59] PROBLEM - Varnish traffic logger on cp1036 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [21:42:30] PROBLEM - Backend Squid HTTP on sq57 is CRITICAL: Connection refused [21:42:30] PROBLEM - Backend Squid HTTP on amssq51 is CRITICAL: Connection refused [21:42:30] PROBLEM - Backend Squid HTTP on amssq39 is CRITICAL: Connection refused [21:42:49] PROBLEM - Backend Squid HTTP on amssq46 is CRITICAL: Connection refused [21:42:59] PROBLEM - Backend Squid HTTP on amssq42 is CRITICAL: Connection refused [21:43:17] ahhhhhhhhh [21:43:19] PROBLEM - Backend Squid HTTP on cp1012 is CRITICAL: Connection refused [21:43:20] PROBLEM - Backend Squid HTTP on cp1013 is CRITICAL: Connection refused [21:43:20] PROBLEM - Backend Squid HTTP on cp1009 is CRITICAL: Connection refused [21:43:24] notpeter: https://plus.google.com/hangouts/_/2da993a9acec7936399e9d78d13bf7ec0c0afdbc [21:44:05] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [21:44:05] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [21:44:05] PROBLEM - Puppet freshness on msfe1002 is CRITICAL: Puppet has not run in the last 10 hours [21:44:05] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [21:44:05] PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours [21:44:06] PROBLEM - Backend Squid HTTP on amssq51 is CRITICAL: Connection refused [21:44:14] PROBLEM - Backend Squid HTTP on amssq46 is CRITICAL: Connection refused [21:44:23] PROBLEM - Backend Squid HTTP on cp1013 is CRITICAL: Connection refused [21:45:18] RECOVERY - Backend Squid HTTP on amssq58 is OK: HTTP OK HTTP/1.0 200 OK - 660 bytes in 0.236 seconds [21:45:35] PROBLEM - Backend Squid HTTP on sq57 is CRITICAL: Connection refused [21:45:36] PROBLEM - Backend Squid HTTP on cp1009 is CRITICAL: Connection refused [21:45:36] PROBLEM - Backend Squid HTTP on cp1012 is CRITICAL: Connection refused [21:45:36] RECOVERY - Backend Squid HTTP on amssq34 is OK: HTTP OK HTTP/1.0 200 OK - 1414 bytes in 0.526 seconds [21:45:44] PROBLEM - Backend Squid HTTP on amssq42 is CRITICAL: Connection refused [21:45:49] RECOVERY - Backend Squid HTTP on amssq58 is OK: HTTP OK: Status line output matched 200 - 660 bytes in 0.184 second response time [21:45:50] RECOVERY - Backend Squid HTTP on amssq34 is OK: HTTP OK: HTTP/1.0 200 OK - 1423 bytes in 0.185 second response time [21:46:02] PROBLEM - Backend Squid HTTP on amssq39 is CRITICAL: Connection refused [21:46:02] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [21:46:29] RECOVERY - Backend Squid HTTP on amssq51 is OK: HTTP OK: Status line output matched 200 - 660 bytes in 0.182 second response time [21:47:29] RECOVERY - Backend Squid HTTP on sq57 is OK: HTTP OK: Status line output matched 200 - 495 bytes in 0.054 second response time [21:47:30] RECOVERY - Backend Squid HTTP on amssq39 is OK: HTTP OK: HTTP/1.0 200 OK - 1414 bytes in 0.475 second response time [21:47:32] RECOVERY - Backend Squid HTTP on amssq42 is OK: HTTP OK HTTP/1.0 200 OK - 1414 bytes in 0.528 seconds [21:47:41] RECOVERY - Backend Squid HTTP on amssq51 is OK: HTTP OK HTTP/1.0 200 OK - 660 bytes in 0.238 seconds [21:47:49] RECOVERY - Backend Squid HTTP on amssq46 is OK: HTTP OK: HTTP/1.0 200 OK - 1414 bytes in 0.470 second response time [21:47:51] RECOVERY - Backend Squid HTTP on amssq39 is OK: HTTP OK HTTP/1.0 200 OK - 1423 bytes in 0.237 seconds [21:47:51] RECOVERY - Backend Squid HTTP on amssq46 is OK: HTTP OK HTTP/1.0 200 OK - 1422 bytes in 0.238 seconds [21:47:59] RECOVERY - Backend Squid HTTP on amssq42 is OK: HTTP OK: HTTP/1.0 200 OK - 1423 bytes in 0.185 second response time [21:49:02] RECOVERY - Backend Squid HTTP on sq57 is OK: HTTP OK HTTP/1.0 200 OK - 495 bytes in 0.004 seconds [21:50:21] wooo! [21:50:24] no pages :) [21:50:53] lmk if there's anything the rest of the team can do to help [21:51:07] we're totes down to jump in, even if there's just grunt work [21:51:19] nope, it's recovering on its own [21:51:22] and never actually got bad [21:51:23] awesome [21:51:33] always happy to hear "there *is* no work" [21:51:35] ;) [21:56:14] RECOVERY - Backend Squid HTTP on sq75 is OK: HTTP OK HTTP/1.0 200 OK - 1249 bytes in 0.087 seconds [21:56:29] RECOVERY - Backend Squid HTTP on sq75 is OK: HTTP OK: HTTP/1.0 200 OK - 1258 bytes in 0.054 second response time [21:56:30] RECOVERY - Backend Squid HTTP on sq63 is OK: HTTP OK: HTTP/1.0 200 OK - 1249 bytes in 0.107 second response time [21:56:38] hey [21:56:43] what's going on? [21:56:45] * paravoid is packing [21:56:55] things are ok now, from what I understand... [21:57:05] squid ./deploy all will bork if there is a syntax error [21:57:14] but still push the change to puppet volatile [21:57:27] which causes puppet to do some stuff on the squids before you are ready [21:57:36] fix the script please [21:57:38] yeah.... [21:57:42] needs better ordering [21:57:44] aaand, notpeter, something about something cache rebuilding something? [21:57:53] RECOVERY - Backend Squid HTTP on sq63 is OK: HTTP OK HTTP/1.0 200 OK - 1258 bytes in 0.004 seconds [21:57:53] RECOVERY - Backend Squid HTTP on cp1012 is OK: HTTP OK HTTP/1.0 200 OK - 1249 bytes in 0.054 seconds [21:58:00] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 195 seconds [21:58:11] RECOVERY - Backend Squid HTTP on cp1009 is OK: HTTP OK HTTP/1.0 200 OK - 1249 bytes in 0.055 seconds [21:58:19] RECOVERY - Backend Squid HTTP on cp1013 is OK: HTTP OK: HTTP/1.0 200 OK - 1249 bytes in 0.002 second response time [21:58:20] RECOVERY - Backend Squid HTTP on cp1012 is OK: HTTP OK: HTTP/1.0 200 OK - 1258 bytes in 0.001 second response time [21:58:20] RECOVERY - Backend Squid HTTP on cp1009 is OK: HTTP OK: HTTP/1.0 200 OK - 1257 bytes in 0.002 second response time [21:58:21] RECOVERY - Backend Squid HTTP on cp1013 is OK: HTTP OK HTTP/1.0 200 OK - 1257 bytes in 0.053 seconds [21:58:22] so, the squid service is subscribed to the file in the volatile repo [21:58:30] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 204 seconds [21:58:30] which seems like a poor choice, as well [21:58:47] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 210 seconds [21:58:48] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 210 seconds [21:58:48] because it increases risk of squid restarting when you don't want it to [21:58:54] New patchset: Mattflaschen; "Add GuidedTour to extension-list" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47028 [22:00:37] we ready to try again? [22:00:41] yep! [22:02:57] New review: Spage; "Never done it before, but this looks like the rest of the lines." [operations/mediawiki-config] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/47028 [22:03:13] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47028 [22:04:01] New patchset: Jgreen; "switching fundraising dumps to --single-transaction" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47029 [22:04:02] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 15.6643542857 (gt 8.0) [22:04:34] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [22:04:58] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47029 [22:08:50] RECOVERY - Varnish traffic logger on cp1036 is OK: PROCS OK: 3 processes with command name varnishncsa [22:08:53] RECOVERY - Varnish traffic logger on cp1036 is OK: PROCS OK: 3 processes with command name varnishncsa [22:12:23] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [22:17:01] New patchset: Pyoungmeister; "remove squid subscribe to config file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47030 [22:18:34] PROBLEM - Varnish traffic logger on cp1043 is CRITICAL: PROCS CRITICAL: 1 process with command name varnishncsa [22:18:37] New patchset: Ottomata; "Saving new tab separated output into sampled-1000.tab.log" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47031 [22:18:59] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47031 [22:19:56] Just started the scap for E3's deploy. [22:20:31] !log deployed tab delimiter log format change on squids, varnishes and nginxes [22:20:32] Logged the message, Master [22:20:50] !restarted udp2log instances with new filters that use tab as separator [22:21:44] PROBLEM - Varnish traffic logger on cp1043 is CRITICAL: PROCS CRITICAL: 1 process with command name varnishncsa [22:24:28] hm... [22:24:34] RECOVERY - Varnish traffic logger on cp1043 is OK: PROCS OK: 3 processes with command name varnishncsa [22:24:35] restarted varnishncsa there [22:24:40] nice that's right! behave! [22:25:17] * Damianz gives ottomata a cookie [22:25:31] thanks, thought it was just me and the machines... [22:25:54] RECOVERY - Varnish traffic logger on cp1043 is OK: PROCS OK: 3 processes with command name varnishncsa [22:26:35] New patchset: Andrew Bogott; "Assume that irc messages are utf8." [operations/debs/adminbot] (master) - https://gerrit.wikimedia.org/r/47032 [22:26:53] !log mflaschen Started syncing Wikimedia installation... : Deploy E3Experiments, GettingStarted, GuidedTour (new), MoodBar, and WikiEditor [22:26:54] Logged the message, Master [22:27:31] are the bots talking to each other? [22:27:53] <^demon> logmsgbot says stuff to morebots. [22:28:02] hahah, lovr it [22:28:04] love it [22:28:37] I did a scap (from fenari), and I'm getting: [22:28:40] "mflaschen@mw28's password:" [22:28:44] <^demon> Bahhhh. [22:29:04] <^demon> superm401: Did you forward your agent to fenari when ssh'ing? [22:29:04] superm401 that can happen. Let me try [22:29:57] I can ssh without prompting. ^demon, superm401 also had problems fetching from gerrit on fenari. [22:30:21] <^demon> Sounds like you might've not forwarded your agent to fenari. [22:30:46] superm401 can you e.g. ssh to mw27 from fenari (in another terminal, obviously) ? [22:31:32] ^demon, no, I'm connected as "ssh -A fenari.wikimedia.org -v -v -v" [22:31:42] <^demon> :\ [22:31:46] spagewmf is right that I had problems earlier with the fetch. [22:32:01] Should I be specifying the username explicitly. [22:32:20] My local username, prod ssh username (mflaschen), and Gerrit are all different (mattflaschen), unfortunately. [22:32:32] <^demon> Ah, the gerrit username is going to be annoying then. [22:32:40] Does scap use that? [22:32:44] <^demon> No. [22:32:49] <^demon> But it'll make the fetching difficult. [22:33:08] <^demon> s/difficult/have to manually specify the remote/ [22:33:13] Yeah [22:33:27] Alright, I'll kill the scap and let spagewmf do it this time. [22:33:58] spagewmf, killed. Go for it. [22:34:32] OK. [22:35:24] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [22:35:43] New review: Asher; "It would be better to continue subscribing to the conf but ensure the resulting action on file chang..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/47030 [22:35:57] alright I'm running the scap job. [22:36:51] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [22:37:32] !log spage Started syncing Wikimedia installation... : take 2: Deploy E3Experiments, GettingStarted, GuidedTour (new), MoodBar, and WikiEditor [22:37:33] Logged the message, Master [22:37:34] PROBLEM - LDAP on virt0 is CRITICAL: Connection refused [22:37:53] PROBLEM - LDAPS on virt0 is CRITICAL: Connection refused [22:37:54] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 2.75354467626 [22:39:15] PROBLEM - LDAP on virt0 is CRITICAL: Connection refused [22:39:19] scap working for me, a few timeouts (srv266, srv278, mw1041). [22:40:00] PROBLEM - LDAPS on virt0 is CRITICAL: Connection refused [22:40:21] scap seems much faster! Did someone change it? [22:44:13] !log spage Finished syncing Wikimedia installation... : take 2: Deploy E3Experiments, GettingStarted, GuidedTour (new), MoodBar, and WikiEditor [22:44:14] Logged the message, Master [22:48:32] spagewmf: I made it network-aware [22:49:42] Fun times, labsconsole is down: http://paste.marktraceur.info/23 [22:50:16] is that a rare thing? [22:50:26] looks like it did about 3.2 GBps that time [22:50:33] It is for me... [22:50:34] well, Gbps rather [22:50:59] ganglia shows a spike of ~400 mbytes/s [22:51:31] New patchset: Cmjohnson; "adding db1051-1060 to dhcpd" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47034 [22:51:32] timstarling, Well thanks! scap finished in 10 minutes. \o/ [22:51:55] Holy eff. [23:01:15] New patchset: Ottomata; "Reenabling sqstat." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47035 [23:01:27] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47035 [23:03:16] Change merged: Cmjohnson; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47034 [23:03:29] TimStarling: speaking of spikes, this is the DB who run the worst of those special pages updates http://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&c=MySQL+pmtpa&h=db60.pmtpa.wmnet&tab=m&vn=&mc=2&z=medium&metric_group=ALLGROUPS [23:03:39] is that bad? [23:04:33] on a server that is not used for anything else, it is fine [23:10:05] does 5-10 % wait CPU mean heavily busy disk or is it too unreliable a piece of data [23:10:06] hm, 10-20 even [23:10:07] wait CPU is not really a kind of CPU, it's a complex thing to interpret [23:10:07] http://ganglia.wikimedia.org/latest/graph.php?h=db60.pmtpa.wmnet&m=cpu_report&r=week&s=by%20name&hc=4&mc=2&st=1359673416&g=cpu_report&z=medium&c=MySQL%20pmtpa [23:10:08] you see how when there was a user CPU spike, the top of the wait CPU seemed to stay level, rather than rising up as if it were stacked on top of the user CPU? [23:10:08] that's because there is I/O activity in that entire ~15% regardless of what the CPU is doing [23:10:09] I see [23:10:09] is that because the CPU was waiting for data and starts working only after a delay... or what? [23:10:42] wait CPU is any time when the CPU is idle and there is an I/O operation pending [23:10:43] PROBLEM - SSH on virt0 is CRITICAL: Connection refused [23:10:53] PROBLEM - Puppetmaster HTTPS on virt0 is CRITICAL: Connection refused [23:11:00] New patchset: Stefan.petrea; "Separtor change spaces => tabs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47036 [23:11:24] if the system had a single thread doing some I/O task, then that would be a reasonable metric of the time that thread spent waiting for the disk [23:12:02] http://etherpad.wmflabs.org/ is down. [23:12:03] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47036 [23:12:20] but mysql has lots of threads doing lots of different things [23:13:00] PROBLEM - Puppetmaster HTTPS on virt0 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:13:09] PROBLEM - SSH on virt0 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:13:10] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [23:13:20] PROBLEM - Host virt0 is DOWN: PING CRITICAL - Packet loss = 100% [23:14:03] disk utilisation (as seen with iostat -xd) is a related metric, it doesn't suffer from the stacking problem [23:14:16] !log rebooting virt0 in a fit of optimism and/or desperation [23:14:17] Logged the message, Master [23:14:19] I made a ganglia plugin to measure it, it's a pity it's not installed anywhere anymore [23:14:40] RECOVERY - SSH on virt0 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [23:14:47] I was just wondering why I didn't see it [23:14:49] RECOVERY - SSH on virt0 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [23:14:50] RECOVERY - Host virt0 is UP: PING OK - Packet loss = 0%, RTA = 26.59 ms [23:14:57] I thought it was on some other group only [23:14:58] but even that is a pretty poor measure of the actual I/O capacity of a server since throughput can continue to increase after 100% utilisation is reached [23:15:10] RECOVERY - LDAP on virt0 is OK: TCP OK - 0.027 second response time on port 389 [23:15:11] RECOVERY - LDAPS on virt0 is OK: TCP OK - 0.027 second response time on port 636 [23:15:15] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.004 second response time on port 11000 [23:15:21] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.027 second response time on port 11000 [23:15:30] RECOVERY - Puppetmaster HTTPS on virt0 is OK: HTTP OK: Status line output matched 400 - 336 bytes in 1.045 second response time [23:15:33] RECOVERY - LDAP on virt0 is OK: TCP OK - 0.010 second response time on port 389 [23:16:18] RECOVERY - LDAPS on virt0 is OK: TCP OK - 0.018 second response time on port 636 [23:16:18] RECOVERY - Puppetmaster HTTPS on virt0 is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.058 seconds [23:17:30] because a DB server has multiple spindles, so it can handle having more than one I/O request being active at a time [23:20:00] PROBLEM - Varnish traffic logger on cp1044 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [23:21:33] PROBLEM - Varnish traffic logger on cp1044 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [23:27:26] PROBLEM - Puppet freshness on dataset1001 is CRITICAL: Puppet has not run in the last 10 hours [23:27:53] RECOVERY - Varnish traffic logger on cp1044 is OK: PROCS OK: 3 processes with command name varnishncsa [23:28:00] RECOVERY - Varnish traffic logger on cp1044 is OK: PROCS OK: 3 processes with command name varnishncsa [23:33:20] PROBLEM - Auth DNS on labs-ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [23:38:51] RECOVERY - Auth DNS on labs-ns1.wikimedia.org is OK: DNS OK: 6.261 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.153.219 [23:50:46] PROBLEM - Auth DNS on labs-ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [23:50:47] RECOVERY - Auth DNS on labs-ns1.wikimedia.org is OK: DNS OK: 2.715 seconds response time. nagiostest.beta.wmflabs.org returns