[00:05:30] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [00:07:40] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu Jan 31 00:07:37 UTC 2013 [00:08:31] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [00:08:40] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu Jan 31 00:08:38 UTC 2013 [00:09:30] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [00:09:41] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu Jan 31 00:09:36 UTC 2013 [00:10:30] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [00:39:52] New review: Tim Starling; "If they don't break the site, then why not run them every week? Why only once every 6 months?" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/33713 [00:46:50] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu Jan 31 00:46:44 UTC 2013 [00:47:30] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [00:48:47] New review: Tim Starling; "Let me also say: the reason they broke whatever slave server they ran on was because special pages l..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/33713 [00:49:37] New review: Reedy; "Certainly running them more regularly on slaves in tampa that aren't in rotation would be fine and s..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/33713 [01:02:58] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [01:14:34] mark: how much varnish config work is needed for the thumbnail hashing stuff, I gather it's not terribly high? Is it worth working on the MW side now? [01:17:17] what do you mean? [01:17:50] mark's hashing idea isn't hard to implement, I think he already proof-of-concepted it [01:18:14] but varnish won't scale for the tons of thumb sizes that we currently have [01:19:14] how many files actually have a bunch of thumbs? [01:19:55] who knows? [01:19:55] lots of them I'd say [01:20:05] it's only that case that matters [01:20:35] Container: wikipedia-commons-local-thumb.00 Objects: 642534 [01:20:40] Container: wikipedia-commons-local-public.00 Objects: 69107 [01:20:45] very very rough [01:21:15] what is that a measure of? [01:21:25] avg thumbs per original? [01:21:34] avg thumb count that is [01:21:44] close to 10:1 [01:22:12] says nothing about distribution though [01:22:23] but even 10:1 is a lot I'd say [01:22:31] so on average the hash-chain has 10 items [01:22:46] might work, not optimal though [01:22:52] nod [01:23:05] it's really the bad cases that might suck [01:23:16] indeed [01:23:18] like those pages with thumbs 0-999ox [01:23:21] *999px [01:23:48] *hash-chain would have 10 items [01:24:24] anyway, there are serious problems with current css using fixed-size thumbs though [01:24:35] how come? [01:24:49] did you see Timo's email? [01:25:10] I buy that problem more than the "client scaling sucks" one [01:26:34] aha [01:27:42] you also saw Tim's suggestion about versioned urls? [01:27:57] btw, mark pointed me to this the other day [01:27:58] http://p.defau.lt/?QpJXBqzvRcQtMaf_LuWsAg [01:28:08] 1-minute varnish sample he took [01:32:27] I don't see what versioning has to do with this problem [01:32:39] disable purging completely? [01:32:49] which problem? [01:32:49] well, from a varnish pov it helps [01:33:06] but fixed sizes -if at all possible- would help in the media storage backend as well [01:33:14] there is the problem of purging being unreliable garbage now, which is compounded by items in cache and not swift [01:37:22] paravoid: oh, btw how about that swift patch? :) [01:37:49] heh, sorry about that [01:38:06] I've been swamped [01:38:11] how soon do you need that? [01:38:21] I'm really not looking forward into building our own swift packages [01:38:22] well, it's not urgent, it would be nice though [01:38:48] it seems like it will be a while before we are on ceph though [01:38:58] yeah, I've hit another wall lately... [01:39:14] apergos: around? [01:39:44] btw, I'll be there in less than two weeks [01:40:26] i think we could go to fixed sizes for images not included by css. only allow fixed sizes for new uploads and in new articles and hire thousands of cheap laborers via mechanical turk to fix the layout of every existing article [01:41:03] binasher: can you cr https://gerrit.wikimedia.org/r/#/c/46824/1 ? [01:41:19] haha [01:41:21] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [01:41:21] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [01:41:21] PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours [01:41:22] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [01:41:52] binasher: that's what you get for making useful comments when you're supposed to be packing! [01:42:36] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46824 [01:42:45] bah! [01:42:55] AaronSchulz: that looks sensible, so i merged it [01:43:09] binasher: can you restart the runners? [01:43:18] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [01:43:36] it should be fine, but I want to confirm that the hack is not needed anymore [01:44:03] i actually need to log off for a bit [01:44:20] * AaronSchulz looks at paravoid [01:44:21] i'll be back in an hour or so [01:44:21] I'm also hitting my bed [01:44:26] heh [01:44:38] I can do a simple restart [01:44:42] but I won't stay to babysit it [01:44:57] you tell me [01:45:06] go ahead, it won't take long to tell [01:45:14] and tim is around [01:48:30] done [01:54:56] paravoid: seems fine [01:55:03] thanks [01:56:50] paravoid: I'm actually a little surprised at how well the queue has been doing the last few days [01:57:51] eqiad boxes seem to be the same [01:59:36] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 181 seconds [01:59:37] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 185 seconds [02:01:00] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 213 seconds [02:01:01] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 213 seconds [02:08:03] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 223 seconds [02:08:21] https://wikimediafoundation.org/w/index.php?title=Peering&diff=87455&oldid=82659 [02:10:00] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 9 seconds [02:10:33] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 1 seconds [02:10:42] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [02:11:22] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [02:11:23] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [02:11:39] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 1 seconds [02:12:01] Susan: hah! [02:12:08] what's the current percentage? [02:12:32] I believe the cost of keeping the sites up and running is about $2.5M/year. [02:12:40] And the total budget is about $35M? [02:12:46] So whatever that percent is. [02:15:06] are you amortizing ulsfo, eqiad, etc.? [02:17:09] ugh. echo's css is specific to vector, it seems [02:17:20] yeah, it's on my list to file [02:17:28] at least doesn't work with monobook [02:17:38] it's under the body? (z-level) [02:17:49] I'm trying to make it work with this: https://github.com/OSAS/strapping-mediawiki [02:17:52] i guess it's your strapping ? [02:17:53] right [02:18:07] but monobook at least should be tested :) [02:18:27] There's some bug with Echo + Monobook where I can't see any of the notifications. [02:18:32] They go under some element, I think. [02:18:41] Is that the bug you're hitting? It's kind of aggravating. [02:19:01] no. I can see notifications [02:19:10] but it's improperly positioned [02:19:15] s/body/content container/ [02:19:22] oh, that's not so bad [02:19:42] well, the position is hardcoded for vector [02:19:50] I can't seem to make my skin override it [02:20:21] specificity? [02:20:29] or !important [02:20:34] I tried !important [02:20:48] try specificity [02:21:02] this is nova-precise2? [02:28:19] !log LocalisationUpdate completed (1.21wmf8) at Thu Jan 31 02:28:18 UTC 2013 [02:28:21] Logged the message, Master [02:29:48] !log LocalisationUpdate completed (1.21wmf7) at Thu Jan 31 02:29:47 UTC 2013 [02:29:49] Logged the message, Master [02:34:22] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 191 seconds [02:34:27] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 190 seconds [02:34:33] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 193 seconds [02:36:15] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [02:36:22] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [02:36:33] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 3 seconds [03:25:24] PROBLEM - Puppet freshness on dataset1001 is CRITICAL: Puppet has not run in the last 10 hours [03:35:31] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 185 seconds [03:35:37] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 185 seconds [03:35:40] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 191 seconds [03:37:24] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [03:37:31] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [03:37:40] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [04:07:45] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu Jan 31 04:07:43 UTC 2013 [04:08:45] PROBLEM - Puppet freshness on dataset1001 is CRITICAL: Puppet has not run in the last 10 hours [04:08:45] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [04:08:55] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu Jan 31 04:08:44 UTC 2013 [04:09:44] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [04:20:20] * jeremyb wonders what happened to mwalker's client [04:22:15] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu Jan 31 04:22:11 UTC 2013 [04:22:44] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [04:59:46] New review: Tim Starling; "If you just want more output when you run it manually, why not add a --verbose option?" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/42970 [05:35:37] New patchset: Tim Starling; "Add a --verbose parameter to mw-update-l10n" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46907 [05:38:40] New review: Tim Starling; "I mean like Ic6db1d8a. I also took the liberty of suppressing non-error output from mergeMessageFile..." [operations/puppet] (production); V: 0 C: -2; - https://gerrit.wikimedia.org/r/42970 [05:40:04] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 182 seconds [05:40:04] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 185 seconds [05:40:21] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 183 seconds [05:41:03] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 11 seconds [05:42:03] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [05:42:10] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [06:02:24] AaronSchulz: at 3:30 a.m. I was definitely not around :-D [06:05:33] PROBLEM - MySQL Replication Heartbeat on db36 is CRITICAL: CRIT replication delay 192 seconds [06:06:33] RECOVERY - MySQL Replication Heartbeat on db36 is OK: OK replication delay 0 seconds [06:09:07] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [06:28:18] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu Jan 31 06:28:15 UTC 2013 [06:29:07] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [06:36:08] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [06:36:58] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on port 8123 [06:41:58] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 197 seconds [06:42:08] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 196 seconds [06:42:27] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 190 seconds [06:42:28] RECOVERY - swift-account-reaper on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [06:42:54] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 199 seconds [06:46:08] PROBLEM - swift-account-reaper on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [06:46:58] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [06:47:08] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [06:47:09] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [06:47:42] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [06:47:59] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on port 8123 [06:48:09] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [07:15:18] PROBLEM - LVS Lucene on search-pool2.svc.eqiad.wmnet is CRITICAL: Connection timed out [07:17:08] RECOVERY - LVS Lucene on search-pool2.svc.eqiad.wmnet is OK: TCP OK - 0.001 second response time on port 8123 [07:18:21] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [07:33:42] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 182 seconds [07:34:02] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 195 seconds [07:34:02] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 195 seconds [07:34:22] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 209 seconds [07:35:41] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [07:35:41] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [07:35:59] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [07:36:22] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [07:59:33] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [08:00:33] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 1.000 second response time on port 8123 [08:07:42] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu Jan 31 08:07:33 UTC 2013 [08:08:23] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [08:20:22] PROBLEM - Puppet freshness on stat1001 is CRITICAL: Puppet has not run in the last 10 hours [08:20:53] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu Jan 31 08:20:44 UTC 2013 [08:21:22] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [08:24:50] PROBLEM - Puppet freshness on stat1001 is CRITICAL: Puppet has not run in the last 10 hours [08:34:32] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 181 seconds [08:34:35] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 183 seconds [08:36:14] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [08:36:32] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [09:02:54] New patchset: Hashar; "(bug 44041) adapt role::cache::mobile for beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44709 [09:26:56] PROBLEM - Host mw1085 is DOWN: PING CRITICAL - Packet loss = 100% [09:27:35] PROBLEM - Host mw1085 is DOWN: PING CRITICAL - Packet loss = 100% [09:28:17] RECOVERY - Host mw1085 is UP: PING OK - Packet loss = 0%, RTA = 26.57 ms [09:28:35] RECOVERY - Host mw1085 is UP: PING OK - Packet loss = 0%, RTA = 0.59 ms [09:36:25] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 184 seconds [09:37:13] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 209 seconds [09:37:25] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 217 seconds [09:39:25] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [09:39:26] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [09:40:40] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 1 seconds [10:23:35] New patchset: Reedy; "Remove strategyappswiki from wikiversions.dat" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46914 [10:24:13] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46914 [10:24:52] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Remove strategyappswiki [10:24:53] Logged the message, Master [10:36:06] PROBLEM - Puppet freshness on virt1000 is CRITICAL: Puppet has not run in the last 10 hours [10:37:09] PROBLEM - Puppet freshness on virt1000 is CRITICAL: Puppet has not run in the last 10 hours [10:38:12] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 187 seconds [10:38:35] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 199 seconds [10:38:45] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 200 seconds [10:38:48] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 199 seconds [10:39:45] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [10:40:00] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [10:40:35] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [10:40:36] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [11:02:09] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [11:03:28] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [11:42:43] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [11:42:43] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [11:42:43] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [11:42:43] PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours [11:42:43] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [11:42:44] PROBLEM - Puppet freshness on msfe1002 is CRITICAL: Puppet has not run in the last 10 hours [11:43:09] New patchset: Reedy; "Use overriding to muchly simplify wgNamespacesWithSubpages" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46826 [11:43:14] New patchset: Reedy; "Use overriding to muchly simplify wgNamespacesToBeSearchedDefault" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46873 [11:44:46] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [12:11:11] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [12:12:13] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [12:12:36] !log reedy synchronized php-1.21wmf8/extensions/Wikibase [12:12:37] Logged the message, Master [12:44:23] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [12:47:13] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 3.008 second response time on port 8123 [13:10:21] New patchset: Hashar; "contint: install bzr package on gallium" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46931 [13:11:10] New review: Hashar; "Already deployed manually on gallium. Be bold and merge on sight :-]" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/46931 [13:16:39] New review: Demon; "Don't merge, don't need after all." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/46931 [13:19:21] !log gallium: manually installed "bzr" [13:19:22] Logged the message, Master [13:19:28] !log gallium manually removed bzr: apt-get remove bzr python-keyring python-httplib2 python-launchpadlib python-zope.interface python-oauth python-bzrlib bzr python-simplejson python-configobj python-lazr.uri python-lazr.restfulclient python-wadllib [13:19:29] Logged the message, Master [13:21:22] New patchset: Hashar; "contint: install mercurial package on gallium" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46931 [13:21:45] !log gallium: installed mercurial manually (puppet change is {{gerrit|46931}} PS2) [13:21:46] Logged the message, Master [13:21:58] New review: Hashar; "Mercurial installed" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/46931 [13:22:45] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 183 seconds [13:23:15] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 193 seconds [13:23:40] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 200 seconds [13:23:58] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 208 seconds [13:26:40] PROBLEM - Puppet freshness on dataset1001 is CRITICAL: Puppet has not run in the last 10 hours [13:29:34] PROBLEM - SSH on pdf3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:30:24] RECOVERY - SSH on pdf3 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [13:51:11] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [13:51:19] New patchset: Faidon; "Switch Ceph to the stable train repository" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46935 [13:51:38] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [13:52:10] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46935 [13:54:45] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [13:54:45] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [14:09:43] PROBLEM - Puppet freshness on dataset1001 is CRITICAL: Puppet has not run in the last 10 hours [14:18:34] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [14:28:34] New patchset: Mark Bergsma; "Implement If-Cached request header feature" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46937 [14:29:56] New patchset: Mark Bergsma; "Implement If-Cached request header feature" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46937 [14:32:58] New patchset: Mark Bergsma; "Increase connection pool size" [operations/software] (master) - https://gerrit.wikimedia.org/r/46938 [14:32:58] New patchset: Mark Bergsma; "Fix socket.timeout exception in send_object" [operations/software] (master) - https://gerrit.wikimedia.org/r/46939 [14:32:59] New patchset: Mark Bergsma; "Sync deletes on the source to the destination" [operations/software] (master) - https://gerrit.wikimedia.org/r/44422 [14:33:34] Change merged: Mark Bergsma; [operations/software] (master) - https://gerrit.wikimedia.org/r/44422 [14:34:02] Change merged: Mark Bergsma; [operations/software] (master) - https://gerrit.wikimedia.org/r/46938 [14:34:13] Change merged: Mark Bergsma; [operations/software] (master) - https://gerrit.wikimedia.org/r/46939 [14:37:54] PROBLEM - Host mw1085 is DOWN: PING CRITICAL - Packet loss = 100% [14:38:09] is there a database of our hardware somewhere? [14:38:47] New patchset: Ottomata; "Now using tab as field delimiter in webrequest frontend cache logs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46942 [14:39:44] RECOVERY - Host mw1085 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [14:39:58] New review: Ottomata; "I have tested this change on the log1.pmtpa.wmflabs instance." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/46942 [14:41:37] MaxSem: not really [14:41:41] what are you looking for? [14:42:08] a volunteer ask for configuration of our OSM servers [14:42:20] I know they're 720, but nothing else [14:43:57] 720? i don't think so [14:45:45] they're dell R610s, R620s, R410s [14:46:48] what's their exact configuration? [14:47:03] differs, which ones do you want to know? [14:47:20] databases, apaches, caches [14:47:48] mark, i'm thinking about RT 4433 (ACLs for analytics cluster) [14:48:28] and as far as I know, the analytics cluster does not need to initiate any connections to anything outside of itself (except maybe to brewster or outside internet to dl / apt things) [14:48:29] mark, he wants to know about all of these [14:49:29] is there a way to ACL it such incoming / established connections are allowed, but it analytics can't initiate anything to other VLANs? [14:50:16] yes [14:50:30] MaxSem: can you ask robh later? [14:50:42] sure [14:51:00] ok, cool, i'll update the RT with that request then. There are a few machines it would be handy to be able to initiate connections to (stat1, oxygen. etc.) but not necessary [14:51:04] I'll list those there [14:52:49] ottomata: specify which protocol, tcp port, destination then [14:53:03] for those exceptions? [14:53:14] for analytics to initiate? [14:53:30] MaxSem: why does he care? [14:53:45] we have no reason to not provide the information [14:54:02] he's interested in helping out with OSM [14:54:03] but I don't think it matters for anyone [14:54:21] and? [14:54:21] well, SSD vs HDD matters [14:54:31] RAM size matters [14:54:41] total storage size matters [14:55:04] how's this volunteer is going to help? [14:58:08] !log taking down db1047 for upgrade to precise [14:58:09] Logged the message, notpeter [14:59:18] !log reedy synchronized php-1.21wmf8/extensions/Wikibase [14:59:19] Logged the message, Master [14:59:21] ottomata: yes [14:59:40] k, tahnks [15:03:31] New patchset: Pyoungmeister; "switching db1047 to coredb reasearchdb role class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46948 [15:13:26] PROBLEM - Full LVS Snapshot on db1047 is CRITICAL: Connection refused by host [15:13:35] PROBLEM - MySQL disk space on db1047 is CRITICAL: Connection refused by host [15:13:39] LVS snapshot? [15:13:45] PROBLEM - mysqld processes on db1047 is CRITICAL: Connection refused by host [15:13:45] PROBLEM - MySQL Idle Transactions on db1047 is CRITICAL: Connection refused by host [15:14:05] paravoid: I'm upgarding that host [15:14:11] no I mean [15:14:17] that should be LVM, not LVS, no? [15:14:25] bahahaha [15:14:25] RECOVERY - Full LVS Snapshot on db1047 is OK: OK no full LVM snapshot volumes [15:14:29] indeed, sir. indeed [15:14:33] okay :) [15:14:35] RECOVERY - MySQL disk space on db1047 is OK: DISK OK [15:14:45] RECOVERY - MySQL Idle Transactions on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [15:14:46] RECOVERY - mysqld processes on db1047 is OK: PROCS OK: 1 process with command name mysqld [15:15:11] using do-release-upgrade is weird.... [15:15:15] yes [15:15:54] VOLS="$(lvs | awk '$1 != "LV" && $6 > 90 {print $5 "=" $6 "%"}')" [15:15:58] hm, it might actually be lvs [15:16:02] I wonder what lvs is in that context [15:16:31] oh no [15:16:34] lvs is lvscan [15:16:44] er, no, but close [15:16:49] anyway [15:18:55] PROBLEM - Host db1047 is DOWN: PING CRITICAL - Packet loss = 100% [15:19:38] PROBLEM - Host db1047 is DOWN: PING CRITICAL - Packet loss = 100% [15:19:56] New patchset: Mark Bergsma; "First naive attempt at fetching objects from Varnish" [operations/software] (master) - https://gerrit.wikimedia.org/r/46950 [15:20:51] paravoid, so are you going to Copenhagen? [15:21:05] haven't I already replied? [15:21:06] RECOVERY - Host db1047 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [15:21:17] RECOVERY - Host db1047 is UP: PING OK - Packet loss = 0%, RTA = 26.53 ms [15:21:25] like three times now? :) [15:21:45] ah, my bad:) [15:21:59] bzzzt, stack overflow [15:23:09] but if this happening (I'm still waiting to hear a confirmation) [15:23:20] we should book tickets asap [15:23:41] airfares tend to go up [15:23:43] Don't be daft [15:23:45] WMF don't do that [15:23:45] PROBLEM - mysqld processes on db1047 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [15:24:20] heh [15:24:24] * notpeter sighs [15:24:31] Though, ~5 weeks is probably about right [15:24:31] Reedy speaks truth [15:24:36] they do [15:24:41] it was me who waited too long this time [15:24:46] haha [15:24:47] * ^demon headdesks [15:25:07] <^demon> Someone have a bit of rope I can hang myself with? [15:25:08] I've still nothing booked for the tech meet in SF at end of feb [15:25:11] i booked last night [15:25:29] PROBLEM - mysqld processes on db1047 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [15:26:04] ^demon, wassup? [15:26:32] <^demon> https://groups.google.com/d/topic/repo-discuss/Xs5NDXBvCFw/discussion :\ [15:26:39] MAN-IAD-SFO, SFO-FRA-MAN [15:26:40] Screw that [15:26:55] the other thing about booking late is not finding good flights [15:27:07] so for CPH I want to get a direct flight and there's only one via SAS [15:28:22] * notpeter hands ^demon a yahoo group [15:28:56] <^demon> notpeter: When you play in someone else's sandbox, you've got to use their toys ;-) [15:29:00] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46948 [15:29:02] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46937 [15:29:46] mark: I shall merge yours [15:30:09] ouch [15:30:11] we were merging together [15:30:20] did you merge? [15:30:35] yep [15:30:51] sweet! [15:30:53] not on stafford [15:31:07] will fix [15:31:32] done [15:31:41] cool [15:31:42] tanks [15:33:25] RECOVERY - MySQL Replication Heartbeat on db1047 is OK: OK replication delay seconds [15:33:35] RECOVERY - Puppet freshness on db1047 is OK: puppet ran at Thu Jan 31 15:33:07 UTC 2013 [15:34:56] RECOVERY - MySQL Replication Heartbeat on db1047 is OK: OK replication delay seconds [15:35:36] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [15:37:26] RECOVERY - swift-account-reaper on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [15:37:26] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on port 8123 [15:37:36] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 195 seconds [15:37:56] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 201 seconds [15:37:57]