[00:05:29] Ryan_Lane: oh yeah, see netvibes.com , that was pretty nice, got reminded by http://web.appstorm.net/roundups/top-10-web-based-rss-readers/ [00:06:50] RECOVERY - Varnish traffic logger on cp1023 is OK: PROCS OK: 3 processes with command name varnishncsa [00:08:40] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Thu Mar 14 00:08:28 UTC 2013 [00:09:16] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [00:10:05] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Thu Mar 14 00:10:00 UTC 2013 [00:10:16] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [00:11:06] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Thu Mar 14 00:11:02 UTC 2013 [00:11:15] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [00:12:06] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Thu Mar 14 00:11:59 UTC 2013 [00:12:19] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [00:12:45] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Thu Mar 14 00:12:44 UTC 2013 [00:13:15] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [00:21:54] New patchset: Ryan Lane; "Fix novaconfig include for labs puppetmaster" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53708 [00:22:30] New patchset: Krinkle; "Integration: Move to wikimedia.org" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53513 [00:23:26] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53708 [00:23:44] New review: Krinkle; "@Hashar: Moved update of integration site and doc_index.html to I09225307686dcd07" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53513 [00:25:16] RECOVERY - Puppet freshness on virt0 is OK: puppet ran at Thu Mar 14 00:25:04 UTC 2013 [00:26:04] New patchset: Krinkle; "Integration: Move to wikimedia.org" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53513 [00:27:25] RECOVERY - Puppet freshness on virt1000 is OK: puppet ran at Thu Mar 14 00:27:18 UTC 2013 [00:37:27] New patchset: Ryan Lane; "Deny another beam port and amanda by default" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53709 [00:37:35] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 187 seconds [00:37:35] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 187 seconds [00:38:23] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53709 [00:41:46] New patchset: Ryan Lane; "Refer to inetd for amanda service" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53710 [00:42:36] PROBLEM - Varnish traffic logger on cp1028 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [00:42:39] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53710 [00:48:25] PROBLEM - Puppet freshness on europium is CRITICAL: Puppet has not run in the last 10 hours [00:52:47] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [00:53:07] !log reedy synchronized wmf-config/ [00:53:13] Logged the message, Master [00:56:47] PROBLEM - Varnish traffic logger on cp1033 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [00:59:25] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Thu Mar 14 00:59:18 UTC 2013 [00:59:54] !log reedy synchronized php-1.21wmf11/extensions/Scribunto [01:00:00] Logged the message, Master [01:00:15] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [01:07:36] RECOVERY - Varnish traffic logger on cp1028 is OK: PROCS OK: 3 processes with command name varnishncsa [01:08:43] !log reedy synchronized php-1.21wmf11/cache/l10n/ [01:10:32] Logged the message, Master [01:14:25] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: Puppet has not run in the last 10 hours [01:14:25] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: Puppet has not run in the last 10 hours [01:14:26] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: Puppet has not run in the last 10 hours [01:14:26] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [01:14:26] PROBLEM - Puppet freshness on msfe1002 is CRITICAL: Puppet has not run in the last 10 hours [01:16:48] RECOVERY - Varnish traffic logger on cp1023 is OK: PROCS OK: 3 processes with command name varnishncsa [01:18:33] New patchset: Ryan Lane; "Include the icinga system user in nrpe" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53711 [01:21:18] New patchset: Hashar; "contint: xdebug + code coverage directory" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53531 [01:21:18] New patchset: Hashar; "contint: move apache proxy configuration to module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53425 [01:21:18] New patchset: Hashar; "contint: move tmpfs disk to the module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53424 [01:21:19] New patchset: Hashar; "contint: get rid of Sun JDK" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53423 [01:21:47] RECOVERY - Varnish traffic logger on cp1033 is OK: PROCS OK: 3 processes with command name varnishncsa [01:21:54] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53711 [01:22:18] paravoid: I have rebased my contint module to get rid of the only change you have rejected ( https://gerrit.wikimedia.org/r/#/c/53422/ which made contint to depend on geoip ) [01:22:19] ;) [01:22:50] I'll have a look tomorrow [01:22:51] now sleep [01:22:59] thanks :-] [01:23:08] will patch the testswarm one on top of that [01:23:27] PROBLEM - Varnish traffic logger on cp1026 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [01:23:46] hashar: Hi again, still adjusting time rotation? [01:24:04] Krinkle: yeah. Went to sleep at 7pm completely vasted [01:24:19] and had some meal at 1am :/ [01:24:40] I am going to work overnight and have a nap in the morning :-] [01:24:47] New review: Hashar; "I have rebased the other contint patches to get rid of this dependency." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53422 [01:25:19] Does anyone know what version of Redis we run in production? [01:27:58] 2.6 something [01:28:38] Thanks, AaronSchulz. [01:30:25] New patchset: Hashar; "migrate testswarm module to contint module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53712 [01:30:39] 2.6.3-wmf1 [01:31:22] New review: Krinkle; "(1 comment)" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/53712 [01:32:01] New review: Hashar; "Faidon, here is the patch we talked about this afternoon. I guess we will reinstate the testswarm m..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53712 [01:32:16] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Thu Mar 14 01:32:06 UTC 2013 [01:32:17] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [01:32:32] New review: Hashar; "(1 comment)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53712 [01:33:45] PROBLEM - Varnish traffic logger on cp1033 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [01:33:51] New patchset: Hashar; "migrate testswarm module to contint module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53712 [01:34:02] New patchset: Krinkle; "contint: xdebug + code coverage directory" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53531 [01:34:25] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Thu Mar 14 01:34:24 UTC 2013 [01:35:17] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [01:35:22] !log running puppet on celsus to investigate puppet freshness reports from icinga - it started parsoid, but also celsus is a Wikimedia DECOMMISSIONED server (base::decommissioned). [01:35:28] Logged the message, Master [01:36:15] ACKNOWLEDGEMENT - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours daniel_zahn stop channel spam - celsus is a Wikimedia DECOMMISSIONED server (base::decommissioned). [01:36:25] RECOVERY - Varnish traffic logger on cp1026 is OK: PROCS OK: 3 processes with command name varnishncsa [01:40:47] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [01:48:35] PROBLEM - Varnish traffic logger on cp1028 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [01:49:45] RECOVERY - Varnish traffic logger on cp1023 is OK: PROCS OK: 3 processes with command name varnishncsa [01:53:26] PROBLEM - Puppet freshness on kuo is CRITICAL: Puppet has not run in the last 10 hours [01:54:46] RECOVERY - Varnish traffic logger on cp1033 is OK: PROCS OK: 3 processes with command name varnishncsa [01:55:58] Change merged: Ryan Lane; [operations/debs/ircecho] (master) - https://gerrit.wikimedia.org/r/53360 [01:56:25] PROBLEM - Puppet freshness on virt1005 is CRITICAL: Puppet has not run in the last 10 hours [01:56:28] Change merged: Ryan Lane; [operations/debs/ircecho] (master) - https://gerrit.wikimedia.org/r/53361 [01:56:42] Change merged: Ryan Lane; [operations/debs/ircecho] (master) - https://gerrit.wikimedia.org/r/53363 [01:57:28] New patchset: Hashar; "move geoip to a module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53714 [02:01:49] tfinc: you know from here, it *almost* looks like Jon is here at first glance [02:02:10] AaronSchulz: its pretty compelling isn't it? [02:02:29] i had to add the hat to complete it [02:02:55] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Thu Mar 14 02:02:52 UTC 2013 [02:03:16] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [02:03:25] PROBLEM - Puppet freshness on hume is CRITICAL: Puppet has not run in the last 10 hours [02:06:25] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Puppet has not run in the last 10 hours [02:09:46] PROBLEM - Varnish traffic logger on cp1033 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [02:12:26] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [02:13:26] PROBLEM - MySQL Replication Heartbeat on db69 is CRITICAL: CRIT replication delay 204 seconds [02:13:26] PROBLEM - MySQL Slave Delay on db69 is CRITICAL: CRIT replication delay 207 seconds [02:13:35] RECOVERY - Varnish traffic logger on cp1028 is OK: PROCS OK: 3 processes with command name varnishncsa [02:14:26] PROBLEM - Puppet freshness on tola is CRITICAL: Puppet has not run in the last 10 hours [02:14:46] PROBLEM - Varnish traffic logger on cp1025 is CRITICAL: PROCS CRITICAL: 1 process with command name varnishncsa [02:17:00] TimStarling: expect a huge redis queue patch sometime, just saying :) [02:17:25] Change restored: Hashar; "(no reason)" [operations/debs/ircecho] (master) - https://gerrit.wikimedia.org/r/53371 [02:17:25] PROBLEM - Puppet freshness on capella is CRITICAL: Puppet has not run in the last 10 hours [02:17:30] New patchset: Hashar; "Jenkins job validation (DO NOT SUBMIT)" [operations/debs/ircecho] (master) - https://gerrit.wikimedia.org/r/53371 [02:18:26] RECOVERY - MySQL Replication Heartbeat on db69 is OK: OK replication delay 0 seconds [02:18:37] RECOVERY - MySQL Slave Delay on db69 is OK: OK replication delay 0 seconds [02:18:58] Change abandoned: Hashar; "(no reason)" [operations/debs/ircecho] (master) - https://gerrit.wikimedia.org/r/53371 [02:20:50] !log jenkins: made Zuul to block changes on operations/debs/ircecho whenever pep8/pyflakes fails. [02:20:58] Logged the message, Master [02:22:46] RECOVERY - Varnish traffic logger on cp1025 is OK: PROCS OK: 3 processes with command name varnishncsa [02:25:20] !log LocalisationUpdate completed (1.21wmf11) at Thu Mar 14 02:25:19 UTC 2013 [02:25:26] Logged the message, Master [02:27:26] PROBLEM - MySQL Slave Delay on db66 is CRITICAL: CRIT replication delay 188 seconds [02:27:45] RECOVERY - Varnish traffic logger on cp1033 is OK: PROCS OK: 3 processes with command name varnishncsa [02:27:45] PROBLEM - MySQL Replication Heartbeat on db66 is CRITICAL: CRIT replication delay 213 seconds [02:28:27] PROBLEM - Varnish traffic logger on cp1034 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [02:29:33] New review: Hashar; "That may be working. We cant really test GeoIP in labs since virt0 does not have the GeoIP files und..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53714 [02:33:47] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Thu Mar 14 02:33:42 UTC 2013 [02:34:15] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [02:39:27] RECOVERY - Varnish traffic logger on cp1034 is OK: PROCS OK: 3 processes with command name varnishncsa [03:04:46] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Thu Mar 14 03:04:41 UTC 2013 [03:05:16] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [03:06:25] RECOVERY - MySQL Slave Delay on db66 is OK: OK replication delay 0 seconds [03:06:47] RECOVERY - MySQL Replication Heartbeat on db66 is OK: OK replication delay 0 seconds [03:14:47] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [03:25:48] RECOVERY - Varnish traffic logger on cp1023 is OK: PROCS OK: 3 processes with command name varnishncsa [03:34:09] New patchset: Hashar; "migrate testswarm module to contint module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53712 [03:35:27] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Thu Mar 14 03:35:21 UTC 2013 [03:36:15] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [03:40:48] New review: Hashar; "I will take care of integrating this change in the puppet 'contint' module whenever it is merged in ..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/53513 [03:45:36] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 189 seconds [03:45:36] PROBLEM - MySQL Replication Heartbeat on db1020 is CRITICAL: CRIT replication delay 182 seconds [03:45:45] PROBLEM - MySQL Slave Delay on db1020 is CRITICAL: CRIT replication delay 188 seconds [03:45:58] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 195 seconds [03:58:36] RECOVERY - MySQL Replication Heartbeat on db1020 is OK: OK replication delay 0 seconds [03:58:46] RECOVERY - MySQL Slave Delay on db1020 is OK: OK replication delay 1 seconds [04:00:06] New patchset: Krinkle; "Integration: Move to wikimedia.org" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53513 [04:03:25] PROBLEM - Puppet freshness on colby is CRITICAL: Puppet has not run in the last 10 hours [04:07:15] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Thu Mar 14 04:07:10 UTC 2013 [04:07:15] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [04:10:55] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 181 seconds [04:11:36] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 188 seconds [04:20:25] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [04:43:05] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Thu Mar 14 04:43:02 UTC 2013 [04:43:15] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [04:50:53] New patchset: Hashar; "(bug 44061) initial release" [operations/debs/python-voluptuous] (master) - https://gerrit.wikimedia.org/r/44408 [04:53:09] New patchset: Hashar; "(bug 44061) initial release" [operations/debs/python-voluptuous] (master) - https://gerrit.wikimedia.org/r/44408 [04:53:16] New review: Hashar; "Patchset 9 was a mistake" [operations/debs/python-voluptuous] (master) - https://gerrit.wikimedia.org/r/44408 [04:53:36] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [04:53:40] New review: Hashar; "PS10:" [operations/debs/python-voluptuous] (master) - https://gerrit.wikimedia.org/r/44408 [04:53:46] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [04:56:22] New review: Hashar; "I could not manage to build the package. I am giving up any attempt to build the package and will le..." [operations/debs/python-voluptuous] (master) - https://gerrit.wikimedia.org/r/44408 [05:13:35] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Thu Mar 14 05:13:29 UTC 2013 [05:14:17] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [05:24:07] New patchset: Krinkle; "Initial release" [operations/debs/python-voluptuous] (master) - https://gerrit.wikimedia.org/r/44408 [05:44:17] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Thu Mar 14 05:44:05 UTC 2013 [05:44:17] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [06:07:25] PROBLEM - Puppet freshness on search1019 is CRITICAL: Puppet has not run in the last 10 hours [06:14:35] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Thu Mar 14 06:14:32 UTC 2013 [06:15:15] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [06:45:05] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Thu Mar 14 06:44:57 UTC 2013 [06:45:15] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [07:15:45] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Thu Mar 14 07:15:38 UTC 2013 [07:16:16] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [07:46:05] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Thu Mar 14 07:46:02 UTC 2013 [07:46:16] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [08:13:25] PROBLEM - Puppet freshness on amslvs1 is CRITICAL: Puppet has not run in the last 10 hours [08:16:35] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Thu Mar 14 08:16:26 UTC 2013 [08:17:16] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [08:26:26] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 192 seconds [08:26:55] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 204 seconds [08:38:56] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Thu Mar 14 08:38:46 UTC 2013 [08:39:15] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [08:45:25] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [08:50:00] mark: apergos: I have noticed that niobium.wikimedia.org (eqiad bits cache) went to swap yesterday. Maybe because of all wikis switching to a new wmf branch [08:53:07] but only that one [09:03:38] !log restarted varnishncsa.vanadium on niobium, it was using 9gb memory or so [09:03:45] Logged the message, Master [09:07:41] \O/ [09:09:25] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Thu Mar 14 09:09:24 UTC 2013 [09:10:15] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [09:11:42] hrm [09:11:48] memory leak in varnishncsa :/ [09:14:40] ori-l: it went up in just a few minutes [09:14:48] probably has an other cause [09:15:02] hm [09:15:36] anyways, apergos, hashar -- thanks [09:16:03] sure [09:16:17] yeah I found a few refs to memory leaks but it was a very sudden jump [09:16:52] yeah, looking at http://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&h=niobium.wikimedia.org&m=cpu_report&s=descending&mc=2&g=mem_report&c=Bits+caches+eqiad [09:17:20] it's a boa constrictor digesting an elephant [09:20:56] PROBLEM - SSH on amslvs1 is CRITICAL: Server answer: [09:21:55] RECOVERY - SSH on amslvs1 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [09:28:13] New patchset: Matthias Mullie; "Prepare AFTv5 config for deployment new features" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50744 [09:40:56] \O/ [09:41:06] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Thu Mar 14 09:40:55 UTC 2013 [09:41:15] New patchset: Hashar; "puppet now manage jenkins ssh authorized_keys" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53736 [09:41:18] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [10:11:35] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Thu Mar 14 10:11:31 UTC 2013 [10:12:15] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [10:13:25] PROBLEM - Puppet freshness on strontium is CRITICAL: Puppet has not run in the last 10 hours [10:28:59] New review: Faidon; "I don't really see the need for all these classes like geoip::packages and geoip::packages::python (..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/53714 [10:30:44] :-D [10:30:55] paravoid: how to end up maintaining the geopip module hehe [10:31:12] haha [10:31:21] I said to get ottomata on board [10:31:41] I might do it myself, but I thought I should give otto the benefit of review and fixes [10:32:18] that is a great way to leveup us [10:32:44] my idea was to do the grunt work and let otto take care of it :-] [10:33:26] the other contint changes have been rebased on production (i.e. they no more depends on the change that has geoip) [10:33:28] https://gerrit.wikimedia.org/r/#/q/status:open+project:operations/puppet+branch:production+topic:contintrefactor,n,z [10:33:50] that is where I would love gerrit to support feature branches properly [10:34:10] or maybe I should have created a branch :-] [10:42:07] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Thu Mar 14 10:42:00 UTC 2013 [10:42:15] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [10:49:25] PROBLEM - Puppet freshness on europium is CRITICAL: Puppet has not run in the last 10 hours [10:56:25] paravoid: this is what happens when you talk about puppet code review on a public channel! [10:56:36] (i.e. i notice and then add you to my outstanding review requests :P) [11:00:06] New review: Mark Bergsma; "So because generic-definitions.pp stuff lives outside a module, it shouldn't be used within modules,..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/49710 [11:12:35] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Thu Mar 14 11:12:27 UTC 2013 [11:13:15] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [11:15:25] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: Puppet has not run in the last 10 hours [11:15:25] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: Puppet has not run in the last 10 hours [11:15:25] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: Puppet has not run in the last 10 hours [11:15:25] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [11:15:25] PROBLEM - Puppet freshness on msfe1002 is CRITICAL: Puppet has not run in the last 10 hours [11:25:09] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53423 [11:25:33] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53424 [11:26:08] New review: Faidon; "geoip module is coming soon" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/53422 [11:26:16] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53422 [11:26:37] \O/ [11:27:06] New review: Hashar; "thanks! :-]" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53422 [11:28:02] ah crap [11:28:14] wasn't a reviewer for 53425 [11:28:17] and it's a dependency [11:28:21] I have to review it now I gues :) [11:28:45] oooops [11:29:13] oh that one is a bit crazy :/ [11:30:04] New review: Faidon; "not terribly excited, but meh" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/53425 [11:30:19] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53425 [11:31:02] https://gerrit.wikimedia.org/r/#/c/53531/ says needs rebase or has dependency [11:31:09] but the dep is merged [11:31:09] wtf? [11:31:23] New patchset: Hashar; "contint: xdebug + code coverage directory" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53531 [11:31:28] if in doubt [11:31:30] press [Rebase] [11:32:07] I know that but I was wondering what happened [11:32:37] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53531 [11:32:45] New patchset: Hashar; "migrate testswarm module to contint module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53712 [11:32:55] and the last one get rid of the testswarm module [11:33:34] waiting for V+1 [11:33:50] yeah there is a bug in Zuul [11:34:05] that make it wait for all jobs currently running to complete before reporting [11:34:12] that is supposedly fixed upstream [11:34:14] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53712 [11:36:09] all done and merged in sockpuppet [11:36:32] hashar: I don't mind much but contint might benefit from a different split [11:36:38] rather than packages that install a bunch of unrelated packages [11:36:57] do you mean the huge package class? [11:37:02] yes [11:37:12] hm [11:37:14] like a class for testing mediawiki, another for testing mobile apps etc. [11:37:18] yeah I am not sure what to do with it [11:37:40] another for testing debian packages (package builder) [11:37:44] i think something is funky with the permissions on fenari: 'error: unable to unlink old '.gitmodules' (Permission denied)' when git-pull; reedy owns all files [11:38:12] ori-l: path? [11:38:15] erm, this is common/php-1.21wmf11 [11:38:20] sorry, it's late [11:38:27] isn't set-group-write` supposed to fix that ? :-D [11:39:00] ori-l: 3am30 ? :( [11:39:05] ori-l: you should probably sleep a bit [11:39:07] 4:39 [11:39:12] ah even worse [11:39:12] the clock changed [11:39:28] ori-l: I refuse to fix it for you, go to sleep [11:39:28] on the plus side, if you wait a couple hours you can prepare breakfast for your familly [11:39:34] ;-) [11:39:46] O^o [11:39:56] blah. FINE :P [11:40:16] good night / morning / whatever the hell it is [11:40:23] bonne nuit ori! [11:40:44] haha [11:42:53] paravoid: success. puppet ran properly [11:42:57] and zuul is still running [11:42:58] ;-) [11:43:06] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Thu Mar 14 11:42:58 UTC 2013 [11:43:15] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [11:43:34] apachectl configtest [11:43:35] Syntax OK [11:43:42] I am so happy [11:44:48] paravoid: I got another one to publish jenkins ssh public key on servers that have the jenkins user : https://gerrit.wikimedia.org/r/#/c/53736/ [11:45:01] that is to let me setup jenkins slaves boxes easily [11:45:17] ideally the UID should be the same everywhere but I have no idea how to do that [11:45:27] beside creating an admin user [11:45:33] sec [11:54:27] PROBLEM - Puppet freshness on kuo is CRITICAL: Puppet has not run in the last 10 hours [11:57:25] PROBLEM - Puppet freshness on virt1005 is CRITICAL: Puppet has not run in the last 10 hours [12:04:25] PROBLEM - Puppet freshness on hume is CRITICAL: Puppet has not run in the last 10 hours [12:07:25] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Puppet has not run in the last 10 hours [12:13:26] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [12:13:45] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Thu Mar 14 12:13:36 UTC 2013 [12:14:15] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [12:15:26] PROBLEM - Puppet freshness on tola is CRITICAL: Puppet has not run in the last 10 hours [12:16:02] New patchset: Hashar; "remove contint web material" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53744 [12:16:51] !log gallium setting up /srv/ as a copy of integration/docroot.git [12:16:57] Logged the message, Master [12:18:25] PROBLEM - Puppet freshness on capella is CRITICAL: Puppet has not run in the last 10 hours [12:21:47] mark: I am getting the files that serves https://integration.mediawiki.org/ out of puppet to a new independent repo. Would you mind merging/sockpuppeting https://gerrit.wikimedia.org/r/#/c/53744/ please ? :-] [12:41:05] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Thu Mar 14 12:41:04 UTC 2013 [12:41:25] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [13:11:35] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Thu Mar 14 13:11:31 UTC 2013 [13:12:25] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [13:12:30] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53744 [13:12:37] mark: solved by faidon :-] [13:14:54] :) [13:15:12] !log Fixed OSPF on cr2-knams:xe-0/0/0 - csw1-esams:e8/2 (earlier) [13:15:19] Logged the message, Master [13:15:31] !log Set OSPF cost to 10 on csw1-esams:ve7 to facilitate ip multipath [13:15:37] Logged the message, Master [13:21:58] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 17 seconds [13:22:35] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [13:29:48] New patchset: Matthias Mullie; "Prepare AFTv5 config for deployment new features" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50744 [13:41:56] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Thu Mar 14 13:41:52 UTC 2013 [13:42:25] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [13:53:39] paravoid: wanna talk about the python-voluptuous packaging ? :-] [13:56:35] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 182 seconds [13:56:56] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 185 seconds [14:04:25] PROBLEM - Puppet freshness on colby is CRITICAL: Puppet has not run in the last 10 hours [14:12:35] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Thu Mar 14 14:12:25 UTC 2013 [14:13:05] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 30 seconds [14:13:26] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [14:16:36] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 21 seconds [14:21:25] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [14:28:26] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 8.74939815603 (gt 8.0) [14:42:23] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 3.73463129496 [14:42:54] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Thu Mar 14 14:42:50 UTC 2013 [14:43:23] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [15:14:38] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Thu Mar 14 15:13:16 UTC 2013 [15:14:38] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [15:14:38] hello good people, I know you're busy but to whom should I speak about an RT ticket? [15:14:39] milimetric: which ticket? [15:14:39] #4730 [15:14:39] the one you filed yesterday for us [15:14:56] it's linked to a deliverable for us [15:15:08] ah this one, yeah [15:15:26] lemme harass some other opsen about it and see if I can get some feedback [15:15:50] cool, thanks Jeff [15:16:01] sure [15:22:18] New review: Daniel Kinzler; "After some discussion: I fear it won't help much, but it's better then nothing. So let's give it a try." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/52797 [15:36:54] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [15:37:04] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [15:42:32] robla: you've been paged recently? [15:43:30] jeremyb_: yeah, I have. I was about to describe it, but then I realized that I should double check the assumption that I shouldn't be getting these before getting aggressive about getting unsubscribed [15:43:56] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Thu Mar 14 15:43:50 UTC 2013 [15:44:24] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [15:44:26] why doesn't it let me know when you've reopened? [15:44:33] mysterious software [15:46:15] hashar: fyi, performance not performances. (for computers at least. performances is e.g. opera or shakespeare) [15:46:58] unless you mean individual jobs. but that sounds weird [15:51:37] jeremyb_: where ? [15:54:29] hashar: 4733 [15:54:38] 42 ;) [15:54:47] I am out for now, going to get my daughter :-] [16:02:23] New patchset: Demon; "Don't announce comments on drafts to IRC" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53759 [16:03:09] New review: Demon; "Needs upstream change merged & deployed first: https://gerrit-review.googlesource.com/#/c/43490/" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/53759 [16:08:23] PROBLEM - Puppet freshness on search1019 is CRITICAL: Puppet has not run in the last 10 hours [16:14:13] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Thu Mar 14 16:14:12 UTC 2013 [16:14:24] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [16:14:24] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Thu Mar 14 16:14:17 UTC 2013 [16:15:23] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [16:21:26] Change merged: Matthias Mullie; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50744 [16:21:44] seems to be having a problem opening pt.wikipedia pages with the math tag. a user reported me that opening the page http://pt.wikipedia.org/wiki/Pi for example, gets the errror Error: 1048 Column 'math_outputhash' cannot be null (10.64.16.23) [16:22:08] this was reported from brasil, i'm in portugal, and i can read it perfectly [16:22:38] New patchset: Mark Bergsma; "Make ganglia-monitor-aggregator not fail if an instance is running" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53761 [16:23:43] PROBLEM - Host mw1041 is DOWN: PING CRITICAL - Packet loss = 100% [16:24:09] New patchset: Mark Bergsma; "Aggregators should not collect metrics themselves" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53762 [16:24:58] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53761 [16:25:24] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53762 [16:26:01] !log mlitn synchronized wmf-config 'Prepare AFTv5 config for deployment new features' [16:26:08] Logged the message, Master [16:26:46] quit [16:31:09] New patchset: Mark Bergsma; "Invert" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53763 [16:31:13] Alchimista: Known bug [16:31:36] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53763 [16:32:35] Reedy: kenair is discussing it on tech, thanks [16:38:31] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Thu Mar 14 16:38:21 UTC 2013 [16:38:31] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [16:39:40] RECOVERY - Host mw1041 is UP: PING OK - Packet loss = 0%, RTA = 1.22 ms [16:42:05] New patchset: Mark Bergsma; "Manage aggregator instances through upstart directly" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53765 [16:42:33] PROBLEM - Apache HTTP on mw1041 is CRITICAL: Connection refused [16:43:32] New patchset: Mark Bergsma; "Manage aggregator instances through upstart directly" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53765 [16:44:28] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53765 [16:49:36] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 189 seconds [16:49:40] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 191 seconds [16:53:58] New patchset: Asher; "pulling db1009 for upgrade" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53766 [16:54:31] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 183 seconds [16:54:41] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 185 seconds [16:56:15] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53766 [16:56:27] New patchset: Asher; "db1009 -> mariadb, max_cons -> 1000" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53767 [16:56:30] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 195 seconds [16:56:51] I'm attempting to git pull on fenari, but I'm getting "error: unable to unlink old '.gitmodules' (Permission denied)" [16:56:59] anyone here who does have sufficient permissions to pull it in? [16:57:41] !log addin mw1041 back to dsh groups [16:57:47] Logged the message, Master [16:58:14] mlitn: me, prbably... [16:58:21] binasher: is it okay to move labsdb1002 and 3? [16:58:39] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53767 [16:58:39] cmjohnson1: yep, at any time [16:58:49] cool...will do that now [16:59:07] !log asher synchronized wmf-config/db-eqiad.php 'pulling db1009 for mariadb migration' [16:59:14] Logged the message, Master [16:59:19] mlitn: AFTv5 submodule updated, NavigationTiming (no idea who's that is) running --init now [16:59:39] !log powering down labsdb1002 and labsdb1003 [16:59:45] Logged the message, Master [16:59:45] Reedy: thanks! [17:06:10] PROBLEM - Host labsdb1003 is DOWN: PING CRITICAL - Packet loss = 100% [17:08:45] New patchset: Jgreen; "puppetizing apache-fast-test qa script" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53773 [17:08:51] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Thu Mar 14 17:08:46 UTC 2013 [17:09:30] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [17:13:17] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53773 [17:13:37] Anyone know what's happened to wikibugs [17:13:39] ? [17:13:57] James_F: yes [17:14:10] it got kickbanned [17:14:40] [09:13:54 AM] ^demon sets mode +b wikibugs!*@* [17:14:40] [09:14:10 AM] ^demon kicked wikibugs from the channel. (wikibugs) [17:14:45] (CDT) [17:14:57] New patchset: Aaron Schulz; "Set job queue aggregator to redis." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53774 [17:15:09] legoktm: OK. Interesting. [17:15:22] <^demon> Join #mediawiki-feed :) [17:15:33] ^demon: Oh, we've switched to that? Yay! [17:15:41] Finally. [17:15:45] <^demon> I was tired of bikeshedding. [17:15:50] <^demon> So I was bold. [17:16:08] * YuviPanda adds appropriate number of quotes to ^demon [17:16:09] ^demon: Go you. [17:16:10] Someone should confiscate your paintbrush [17:16:11] <^demon> (Plus, we had someone actively complaining that "This obviously isn't the channel to ask questions about MediaWiki" due to the noise from bots. [17:16:31] ^demon: Is there a possibility that gerrit-wm will do the same? [17:16:36] <^demon> Possibility. [17:16:39] (clap clap) [17:16:42] Yay! [17:16:52] <^demon> gerrit-wm isn't nearly as noisy though, since we differentiate by project. [17:16:58] <^demon> wikibugs was worse since it was a firehose. [17:17:15] Oh, indeed. [17:17:45] Though if we could get wikibugs to work like gerrit-wm (so we could get a feed of bugs in the -visualeditor channel like we have of commits) I'd be delirious. :-) [17:17:47] New patchset: Mark Bergsma; "Revert "Manage aggregator instances through upstart directly"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53775 [17:18:28] ^demon: To be fair, I had tried to differentiate by project, but nobody would deploy wikibugs so the patch never got merged. [17:18:42] Change merged: Aaron Schulz; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53774 [17:18:43] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53775 [17:19:13] <^demon> marktraceur: Because it's only half-puppetized. [17:19:14] More accurately, nobody wanted to deploy wikibugs in its current state. [17:19:21] Right, exactly. [17:20:17] <^demon> wm-bot also does per-project notifs, so it can be introduced to any channels that want selective notification. [17:22:32] notpeter: Hi, got a question about package installment. On host gallium (integration.wikimedia.org) I need package 'jsduck' (a ruby gem) soon. What is the procedure for this, do we have our own repo? Can we use gem install from puppet? Do we need to fork it? [17:23:01] It is a package that generates documentation for javascript files, which we'll publish on doc.wikimedia.org (like php doxygen for php docs) [17:23:49] !log shutting down mysql on db1009, upgrading to mariadb [17:23:55] Logged the message, Master [17:24:40] MariaDB 5.5.30 [17:25:40] i still need to repackage 5.5.30 [17:27:11] yay for just re-using other peoples packages ;) [17:27:12] !log reedy synchronized php-1.21wmf11/extensions/Math/ [17:27:19] Logged the message, Master [17:29:31] PROBLEM - mysqld processes on db1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [17:31:00] New patchset: Aaron Schulz; "Set serializer to php instead." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53777 [17:31:22] Change merged: Aaron Schulz; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53777 [17:32:26] Krinkle: we have our own apt-repo [17:32:31] Reedy: is that NULL column bug fixed? [17:32:45] Aaron|home: It should be fixed now.. [17:33:08] but, as far as I know, ruby likes getting gems its own way... [17:34:22] RobH: do you know if ram for the rdb upgrade arrived at eqiad? [17:34:43] Aaron|home: serializer to php instead why? [17:35:06] Krinkle: but just writing a puppet class to install gems the ruby way will mean external software sources, which we don't like [17:36:39] notpeter: meeting, sorry. I'll get back to you in a few minutes [17:39:11] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Thu Mar 14 17:39:07 UTC 2013 [17:39:22] Krinkle: ok [17:39:30] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [17:41:32] RECOVERY - mysqld processes on db1009 is OK: PROCS OK: 1 process with command name mysqld [17:41:55] !log aaron synchronized wmf-config 'Set redis job aggregator on for testwiki.' [17:42:02] Logged the message, Master [17:42:52] binasher the ram arrived in eqiad [17:43:52] PROBLEM - MySQL Slave Delay on db1009 is CRITICAL: CRIT replication delay 1131 seconds [17:44:00] cmjohnson1: great, would it be possible to get it installed today? [17:44:19] yes, i will get it done today [17:45:34] hah, does icinga really say "Love, Icinga"? [17:47:19] ottomata: do you want to take 4724? [17:47:23] it does:) [17:47:48] !log labsdb1001 powering down [17:47:54] Logged the message, Master [17:48:32] jeremyb-phone, sure, i will get to it next week [17:49:05] !log aaron synchronized wmf-config 'Enabled redis job aggregator fully' [17:49:12] Logged the message, Master [17:49:20] PROBLEM - Host rdb1001 is DOWN: PING CRITICAL - Packet loss = 100% [17:49:32] ottomata: danke [17:49:40] PROBLEM - Host rdb1002 is DOWN: PING CRITICAL - Packet loss = 100% [17:50:21] hm [17:53:31] binasher: seems to work from all I can tell [17:53:42] * Aaron|home looks at those host notices [17:54:57] oh, nvm [17:55:06] that's something totally different [17:55:07] aaron|home: labsdb's and rdb's are me [17:55:33] * Aaron|home thought that was redis for a second ;) [17:55:51] RECOVERY - MySQL Slave Delay on db1009 is OK: OK replication delay NULL seconds [17:57:30] PROBLEM - MySQL Slave Running on db1009 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error Table ./nlwiki/moodbar_feedback is marked as crashed and sho [17:59:40] paravoid: No real rush, but would appreciate a glance at this: https://gerrit.wikimedia.org/r/#/c/43886/ [17:59:54] Mostly wondering what should be in the module vs. what should be outside of it [18:02:35] New patchset: Aaron Schulz; "Enabled $wgEnableAsyncUploads on all wikis." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53780 [18:02:48] Aaron|home: tired of waiting? :) [18:02:53] paravoid: some error about the igbinary constant not being defined for srv193 [18:03:12] I'm talking about the swift double-get [18:03:15] and async uploads [18:04:12] paravoid: ha, you said that right as I was responding to some backscroll [18:04:39] marked as crashed sounds like MyISAM :/ [18:05:01] paravoid: things probably won't get worse than now, but I want to get this aspect stabilized in the meantime instead of waiting [18:05:30] RECOVERY - MySQL Slave Running on db1009 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [18:05:43] we still have the 500mb limit to keep things in check [18:06:31] jeremyb_: yeah, moodbar did that :/ [18:06:43] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53780 [18:07:09] binasher: hrm? [18:07:23] nothing new [18:08:01] PROBLEM - MySQL Slave Delay on db1009 is CRITICAL: CRIT replication delay 494 seconds [18:08:12] I don't think I've heard of that before, just the aft one [18:08:15] !log cleaned up myisam cruft (unused tables) on s2 [18:08:21] Logged the message, Master [18:08:37] moodbar was created as myisam on all wikis [18:09:28] Aaron|home: re: 500mb limit, were you talking about redis on mc1001? [18:09:40] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Thu Mar 14 18:09:33 UTC 2013 [18:09:52] RECOVERY - MySQL Slave Delay on db1009 is OK: OK replication delay seconds [18:09:54] binasher: no, this is about upload file size limits [18:10:16] oh! ok [18:10:31] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [18:11:50] PROBLEM - Host db1009 is DOWN: PING CRITICAL - Packet loss = 100% [18:12:14] !log aaron synchronized wmf-config/InitialiseSettings.php 'Enabled $wgEnableAsyncUploads on all wikis' [18:12:20] Logged the message, Master [18:13:30] PROBLEM - Puppet freshness on amslvs1 is CRITICAL: Puppet has not run in the last 10 hours [18:14:00] RECOVERY - Host db1009 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [18:14:40] RECOVERY - Full LVS Snapshot on rdb1001 is OK: OK no full LVM snapshot volumes [18:14:40] RECOVERY - MySQL Slave Delay on rdb1001 is OK: OK replication delay seconds [18:14:51] RECOVERY - Host rdb1001 is UP: PING OK - Packet loss = 0%, RTA = 0.48 ms [18:15:01] RECOVERY - MySQL Idle Transactions on rdb1001 is OK: OK longest blocking idle transaction sleeps for seconds [18:15:01] RECOVERY - MySQL Recent Restart on rdb1001 is OK: OK seconds since restart [18:15:01] RECOVERY - MySQL Slave Running on rdb1001 is OK: OK replication [18:15:01] RECOVERY - MySQL disk space on rdb1001 is OK: DISK OK [18:15:30] RECOVERY - MySQL Replication Heartbeat on rdb1001 is OK: OK replication delay seconds [18:16:31] PROBLEM - mysqld processes on db1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [18:17:33] RECOVERY - mysqld processes on db1009 is OK: PROCS OK: 1 process with command name mysqld [18:20:30] RECOVERY - Full LVS Snapshot on rdb1002 is OK: OK no full LVM snapshot volumes [18:20:31] RECOVERY - MySQL Idle Transactions on rdb1002 is OK: OK longest blocking idle transaction sleeps for seconds [18:20:31] RECOVERY - MySQL Slave Running on rdb1002 is OK: OK replication [18:20:38] uh [18:20:42] RECOVERY - Host rdb1002 is UP: PING OK - Packet loss = 0%, RTA = 0.98 ms [18:20:42] RECOVERY - MySQL Recent Restart on rdb1002 is OK: OK seconds since restart [18:20:42] RECOVERY - MySQL disk space on rdb1002 is OK: DISK OK [18:21:00] RECOVERY - MySQL Slave Delay on rdb1002 is OK: OK replication delay seconds [18:21:00] RECOVERY - MySQL Replication Heartbeat on rdb1002 is OK: OK replication delay seconds [18:21:55] hah, guess i need to change all "node /db1../ " stanzas to "node /^db1../" [18:22:23] haha [18:23:08] Can someone fix some file permissions for me please? chmod g+w -R /home/wikipedia/common/php-1.21wmf11/ [18:23:47] New patchset: Asher; "s@node /db@node /^db@g" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53782 [18:24:09] set-group-write? [18:24:57] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53782 [18:26:30] RECOVERY - Puppet freshness on search1019 is OK: puppet ran at Thu Mar 14 18:26:23 UTC 2013 [18:28:29] binasher: rdb1001/2 upgraded [18:28:35] New patchset: Asher; "Revert "pulling db1009 for upgrade"" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53783 [18:28:39] in case you didn't see all the noise [18:28:41] cmjohnson1: thanks! [18:30:40] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53783 [18:31:47] !log asher synchronized wmf-config/db-eqiad.php 'returning db1009 at a low weight for warmup' [18:31:53] Logged the message, Master [18:34:15] binasher: sorry about the lack of ^ [18:34:20] little hats for all the nodes! [18:34:34] New review: Dzahn; "i took a look with Chris Steipp, and while this does not look like a risk of SQL injection, please e..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53387 [18:35:01] New review: Dzahn; "i took a look with Chris Steipp, and while this does not look like a risk of SQL injection, please e..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/53387 [18:40:02] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Thu Mar 14 18:39:57 UTC 2013 [18:40:30] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [18:41:07] LeslieCarr: Any idea why it keeps doing that? It says "puppet freshness is OK, puppet run at $NOW" followed by "CRITICAL, puppet hasn't run in the last 10h" [18:41:30] oh that's due to how naggen runs and then the decommissioned server script [18:41:36] !log asher synchronized wmf-config/db-eqiad.php 'returning db1009 to full weight' [18:41:41] and sadly neon hasn't been able to get a puppet run since the mysql manifest change [18:41:43] Logged the message, Master [18:41:50] and i keep working on other stuffs instead of fixing that [18:42:00] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Thu Mar 14 18:41:59 UTC 2013 [18:42:17] (That was me running puppet) [18:42:21] Oooooh wait [18:42:31] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [18:42:34] You're saying it's confusing celsus.pmtpa.wmnet with celsus.wm.o ? [18:42:42] The new one vs the decommissioned one [18:42:59] well it hasn't had a puppet run since the change [18:43:16] What hasn't? [18:43:20] celsus has, I just ran it [18:43:42] ^demon: can you cr https://gerrit.wikimedia.org/r/#/c/53785/ ? [18:44:40] !log mlitn synchronized php-1.21wmf11/extensions/ArticleFeedbackv5/ 'Update ArticleFeedbackv5 to master' [18:44:47] Logged the message, Master [18:45:31] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [18:46:05] New patchset: Aaron Schulz; "Enable wgEnableAsyncUploads only on testwikis for now." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53786 [18:48:40] Change merged: Aaron Schulz; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53786 [18:48:44] neon [18:48:53] Oh right [18:49:27] !log aaron synchronized wmf-config/InitialiseSettings.php 'Enable wgEnableAsyncUploads only on testwikis for now' [18:49:35] Logged the message, Master [18:49:50] PROBLEM - Host rdb1001 is DOWN: PING CRITICAL - Packet loss = 100% [18:51:42] PROBLEM - MySQL Slave Delay on db66 is CRITICAL: CRIT replication delay 205 seconds [18:52:02] PROBLEM - MySQL Replication Heartbeat on db66 is CRITICAL: CRIT replication delay 223 seconds [18:52:40] PROBLEM - Host rdb1002 is DOWN: PING CRITICAL - Packet loss = 100% [18:55:01] RECOVERY - Host rdb1001 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [18:57:01] PROBLEM - MySQL disk space on rdb1001 is CRITICAL: Connection refused by host [18:57:01] PROBLEM - MySQL Recent Restart on rdb1001 is CRITICAL: Connection refused by host [18:57:01] PROBLEM - MySQL Slave Running on rdb1001 is CRITICAL: Connection refused by host [18:57:01] PROBLEM - MySQL Idle Transactions on rdb1001 is CRITICAL: Connection refused by host [18:57:31] PROBLEM - MySQL Replication Heartbeat on rdb1001 is CRITICAL: Connection refused by host [18:57:51] PROBLEM - SSH on rdb1001 is CRITICAL: Connection refused [18:57:51] PROBLEM - MySQL Slave Delay on rdb1001 is CRITICAL: Connection refused by host [18:57:51] PROBLEM - Full LVS Snapshot on rdb1001 is CRITICAL: Connection refused by host [18:57:51] RECOVERY - Host rdb1002 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [19:00:00] PROBLEM - SSH on rdb1002 is CRITICAL: Connection refused [19:00:01] PROBLEM - MySQL Slave Delay on rdb1002 is CRITICAL: Connection refused by host [19:00:01] PROBLEM - MySQL Replication Heartbeat on rdb1002 is CRITICAL: Connection refused by host [19:00:30] PROBLEM - MySQL Slave Running on rdb1002 is CRITICAL: Connection refused by host [19:00:31] PROBLEM - MySQL Idle Transactions on rdb1002 is CRITICAL: Connection refused by host [19:00:31] PROBLEM - Full LVS Snapshot on rdb1002 is CRITICAL: Connection refused by host [19:00:40] PROBLEM - MySQL Recent Restart on rdb1002 is CRITICAL: Connection refused by host [19:00:41] PROBLEM - MySQL disk space on rdb1002 is CRITICAL: Connection refused by host [19:09:41] PROBLEM - NTP on rdb1001 is CRITICAL: NTP CRITICAL: No response from NTP server [19:09:43] RECOVERY - SSH on rdb1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [19:10:00] RECOVERY - SSH on rdb1002 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [19:10:41] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Thu Mar 14 19:10:33 UTC 2013 [19:11:30] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [19:24:02] PROBLEM - NTP on rdb1002 is CRITICAL: NTP CRITICAL: No response from NTP server [19:32:01] !log aaron synchronized php-1.21wmf11/includes/job/jobs 'deployed 6f76ede163cb114724bce7a1c1b8938e1e30606f ' [19:32:08] Logged the message, Master [19:32:41] RECOVERY - MySQL Slave Delay on db66 is OK: OK replication delay 0 seconds [19:33:01] RECOVERY - MySQL Replication Heartbeat on db66 is OK: OK replication delay 0 seconds [19:41:10] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Thu Mar 14 19:41:04 UTC 2013 [19:41:31] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [19:43:40] PROBLEM - MySQL Slave Delay on db66 is CRITICAL: CRIT replication delay 203 seconds [19:44:04] PROBLEM - MySQL Replication Heartbeat on db66 is CRITICAL: CRIT replication delay 223 seconds [19:45:32] RECOVERY - NTP on rdb1001 is OK: NTP OK: Offset 0.0008583068848 secs [20:05:27] !log olivneh synchronized php-1.21wmf11/includes/WikiPage.php 'Adds CategoryAfterPageAdded / CategoryAfterPageRemoved hooks' [20:05:36] Logged the message, Master [20:09:43] LeslieCarr: there's a ticket for that :) rt 4727 [20:09:54] * jeremyb_ pastes LeslieCarr into the ticket [20:11:00] RECOVERY - NTP on rdb1002 is OK: NTP OK: Offset -0.006113052368 secs [20:11:40] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Thu Mar 14 20:11:34 UTC 2013 [20:12:30] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [20:13:30] PROBLEM - Puppet freshness on strontium is CRITICAL: Puppet has not run in the last 10 hours [20:13:45] binasher: do you want me to set the raid on the arrays for labsdb100x [20:14:39] cmjohnson1: that would be great, raid-10 with no read-ahead [20:14:47] k [20:19:23] ah more tickets to look at ? [20:19:24] noooes [20:19:33] also i just forgot what i was meaning to work on this afternoon [20:19:58] !log olivneh synchronized php-1.21wmf11/extensions/EventLogging 'Log ['HTTP_HOST'] as webHost' [20:20:07] Logged the message, Master [20:20:48] stupid bash variable expansion [20:21:12] and stupid double double quotes [20:22:40] RECOVERY - MySQL Slave Delay on db66 is OK: OK replication delay 0 seconds [20:23:01] RECOVERY - MySQL Replication Heartbeat on db66 is OK: OK replication delay 0 seconds [20:25:50] mutante: pa.us.wikimedia.org might be a good first target being a closed wiki - https://bugzilla.wikimedia.org/show_bug.cgi?id=38763 [20:25:53] LeslieCarr: no, you don't have to look :) i just pasted you into it [20:26:13] hehe [20:26:27] Reedy: then what am i going to use when i need a broken domain to take screenshots of cert errors? [20:26:36] New patchset: Asher; "initial redisdb role config" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53800 [20:26:45] :) [20:26:54] jeremyb_: https://arbcom.de.wikipedia.org/ [20:26:58] Until we fix that one too ;) [20:27:05] bad Reedy [20:27:20] Gotta fix 'em all! [20:29:09] * Aaron|home grins at binasher [20:29:51] ori-l: logged the wrong thing? [20:30:37] !log Added article edit event to EventLogging, which upped events/sec from ~3.5 to ~30. Expect increased load on vanadium and db1047; bits caches unaffected. [20:30:44] Logged the message, Master [20:31:05] $ echo $'foo \'$bar\' baz' [20:31:05] foo '$bar' baz [20:31:05] jeremyb_: nah, it just swallowed '$_SERVER' (expanded it to empty string) because i was silly [20:31:05] yeah, i know. [20:31:31] ori-l: very silly! [20:34:08] !log installed WikiLove on wikitech [20:34:15] Logged the message, Master [20:34:31] lolol [20:34:49] Reedy: ? [20:35:12] hah [20:35:18] wikitech, srsbizness [20:36:14] :) [20:36:43] !log DNS update - add zuul as CNAME for gallium [20:36:48] Logged the message, Master [20:37:22] hashar: zuul.wikimedia.org. 3600 IN CNAME gallium.wikimedia.org. [20:37:46] mutante: you are the best. WFM :-] [20:41:21] * mutante sends wikilove to Ryan for enabling wikilove [20:42:23] New patchset: Reedy; "Bug 38763 - Our *.wikimedia.org cert doesn't properly cover https://pa.us.wikimedia.org/" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53807 [20:43:16] Reedy: omg, you're renaming wikis [20:43:18] <3 [20:43:55] Reedy: let's start the wiki renaming sprint , heh [20:44:01] ready now [20:45:40] !log deleting the jenkins user from ldap [20:45:46] Logged the message, Master [20:49:30] PROBLEM - Puppet freshness on europium is CRITICAL: Puppet has not run in the last 10 hours [20:50:04] !log authdns update (labsdb1002-3 vlan) [20:50:11] Logged the message, Master [20:50:58] New patchset: Krinkle; "Integration: Move to wikimedia.org" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53513 [20:51:30] New patchset: Ryan Lane; "Allow multiple users/groups in access.conf" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53844 [20:52:09] Anyone know why pa.us.wikimedia is not actually closed? [20:52:21] :p [20:52:40] trying #wikimedia-pa [20:52:47] jeremyb_: ^^ [20:52:51] lol, fail [20:52:54] 14:01 -!- mutante was kicked from #wikimedia-pa by ChanServ [Invite only channel] [20:53:15] hah [20:53:28] mutante: is it not closed? [20:53:36] casey would be the best person to ask [20:53:40] http://pa.us.wikimedia.org/w/index.php?title=Special:RecentChanges&days=30&from= [20:53:49] Minimal activity, userpages and such [20:53:58] i think casey was one of the board members [20:54:05] I've just poked him [20:54:16] Ryan_Lane: re: deleting jenkins user from ldap.. is jenkins down right now? [20:54:39] nah, hashar created that user a couple of hours ago [20:54:44] and we agreed it shouldn't be there [20:54:47] ah [20:54:58] jeremyb_: Looks like a yes [20:55:01] jenkins is at least bring really slow [20:55:03] yeah the jenkins user is a local user on the continuous integration box [20:55:04] yeah [20:55:18] I would love to have all our users in LDAP one day though :-] [20:56:41] New patchset: Reedy; "Close pa_uswikimedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53857 [20:57:01] hashar: definitely not for system users [20:57:15] !log reedy synchronized wmf-config/InitialiseSettings.php [20:57:15] system users should always be local [20:57:22] Logged the message, Master [20:57:25] hashar, wanna repeat that part about packaging zuul and the bug you told me the other day (re:: jenkins slowness) [20:57:30] for binasher [20:57:31] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53857 [20:57:40] sure [20:57:42] so zuul has several bugs [20:58:08] bleh. jenkins is definitely being really slow :( [20:58:24] one of them is that the reporting is done after the loop that trigger the jobs or something similar. So there is some kind of delay before zuul report back to Gerrit [20:58:43] and there is a nasty bug that causes Zuul to sometime mistake the commits to trigger :-] [20:58:49] annnd [20:59:16] there are too many jobs running during peak hours (aka right now when European volunteers code, Reedy merge and all ops do their puppet stuff) [20:59:20] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Thu Mar 14 20:59:11 UTC 2013 [20:59:21] New patchset: Asher; "initial redisdb role config patch set 2: ganglia config Change-Id: I819c79ba048fc14538db06a7aff956a9ca7f6460" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53800 [20:59:30] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [21:00:23] hashar: is this something we can get a second box and load balance (easily at least) ? [21:01:04] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53800 [21:01:05] jenkins can remote control other boxes [21:01:16] !log DNS update - add pa-us wiki to replace pa.us [21:01:22] Logged the message, Master [21:01:30] LeslieCarr: going to test that using a labs instance, and will probably order you another box. Robh told me there are some misc boxes assigned to platform that we could potentially use. [21:02:13] New patchset: Reedy; "Bug 38763 - Our *.wikimedia.org cert doesn't properly cover https://pa.us.wikimedia.org/" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53807 [21:04:21] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53844 [21:06:36] New patchset: Asher; "fix class scoping" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53858 [21:08:51] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53807 [21:10:59] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53858 [21:11:01] RECOVERY - Puppet freshness on capella is OK: puppet ran at Thu Mar 14 21:10:55 UTC 2013 [21:11:13] New patchset: Dzahn; "redirect pa.us chapter wiki to pa-us (bug 38763)" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/53859 [21:12:17] oh, should i redirect to https ? [21:13:36] New patchset: Krinkle; "Add sudo user "krinkle" on gallium." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53861 [21:15:31] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: Puppet has not run in the last 10 hours [21:15:31] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: Puppet has not run in the last 10 hours [21:15:31] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: Puppet has not run in the last 10 hours [21:15:31] PROBLEM - Puppet freshness on msfe1002 is CRITICAL: Puppet has not run in the last 10 hours [21:15:31] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [21:16:05] kibble! [21:16:15] :o ! [21:16:23] Hey Jeremy. :-) [21:16:37] PA is dead, long live NYC! [21:16:39] mutante [21:16:44] Reedy [21:16:52] Yeaah [21:16:58] We still need something to test on ;) [21:17:01] :), already added pa-us to DNS [21:17:09] apache redirect is pending jenkins [21:17:14] and it's closed [21:17:26] gave up on the wikimedia.us idea [21:18:49] mutante: good :) [21:18:53] re gave up [21:20:11] RECOVERY - Puppet freshness on tola is OK: puppet ran at Thu Mar 14 21:20:00 UTC 2013 [21:20:31] RECOVERY - Puppet freshness on hume is OK: puppet ran at Thu Mar 14 21:20:27 UTC 2013 [21:21:35] Aaron|home: rdb1001 is ready for use, with rdb1002 slaving it [21:23:41] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 12 seconds [21:24:01] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [21:24:36] binasher: cool, maybe they can be used in wmf12 [21:25:15] oh, wow, renaming wiki tickets :o .. so "als" was once supposed to be Swiss German, (Alemannic?) but it actually is Albanian. result: "the most article in the albanian wikipedia are write in Swiss [21:25:19] German [21:26:44] Aaron|home: i am now officially excited about the next mediawiki release. [21:26:45] uhhh, wow [21:26:51] i'll admit, this is a little weird. [21:27:03] binasher: point or major? [21:27:11] point [21:27:15] binasher: there is some rewriting pending in gerrit [21:30:09] mutante, basically, but a little more complicated. There was no code for Alemannic (which is kinda like Swiss German, but a little wider of a group of languages of which Swiss German is the biggest, I think), so they went with als. Then the ISO gave ALS to Tosk Albanian, which is a dialect of Alabanian, not Albanian. So both of the languages are "smallish", but the main issue is that there's a conflict. [21:30:32] Krinkle, jshint is blocking a submodule update for deployment: https://gerrit.wikimedia.org/r/#/c/53840/ [21:30:37] It's unrelated core code. [21:30:50] It looks like jshint was non-voting when it was added but voting here. [21:31:00] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Thu Mar 14 21:30:58 UTC 2013 [21:31:30] superm401: no, jshint was always voting [21:31:30] for a while [21:31:30] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [21:31:35] and failed both times [21:32:07] Krinkle, so is it voting for regular core commits right now? [21:32:14] for months [21:32:41] Then how did the failure at https://gerrit.wikimedia.org/r/#/c/53840/ get in? [21:32:43] someone must've broken it in wmf [21:32:49] it can be overridden [21:32:59] I'll see how easy it is to fix. [21:33:03] and some stuburn people refuse to let jenkins run before merging [21:33:19] the history is hard to trace where it started breaking because in those cases it couldn't leave a vote [21:33:38] superm401: Hopefully it'll be an easy cherry-pick from core [21:33:45] Change merged: Dzahn; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/53859 [21:33:49] superm401: anyway, jshint in the core repo doesn't run over extensions [21:33:59] Krinkle, yeah, I know. [21:34:01] superm401: so you can just ignore it, knowing the other tests succeeded (view the comment) [21:34:04] binasher: they are both 72gb? [21:34:14] yup [21:34:20] superm401: will you find the lint error / cherry-pick? [21:34:31] using append only logging? [21:34:41] Krinkle, I'm not seeing it locally. [21:34:45] or ram only? [21:34:47] It might be fixed, so I'll rebase. [21:34:51] Aaron|home: though they currently have maxmemory 58Gb [21:35:31] RECOVERY - Puppet freshness on kuo is OK: puppet ran at Thu Mar 14 21:35:25 UTC 2013 [21:35:47] superm401: jenkins tests postmerge locally [21:35:59] it essentially rebases it onto latest master before testing [21:36:15] that way it ensures that it passes in the state it would be in after the merge [21:36:27] New patchset: Ori.livneh; "Enable NavigationTiming on test2 w/sampling factor of 10000" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53865 [21:36:34] Aaron|home: rdb snapshot, but not append only logging [21:36:43] Then why does my jshint pass locally when I pull latest wmf/1.21wmf11? [21:37:02] binasher: does "stop the world"? [21:37:07] *does that [21:37:55] and how often does it run? [21:38:55] Aaron|home: it should happen asynchronously via a background thread, currently set to snapshot every 5 minutes, or every 100 key changes [21:39:05] would you prefer aol? [21:39:20] !log Zuul is hanging out not reporting changes :/ [21:39:26] Logged the message, Master [21:39:32] i didn't want it for the session redis store, but it could be more appropriate here [21:41:12] Aaron|home: read http://redis.io/topics/persistence [21:41:31] I was looking at that again [21:42:55] so yeah, only the fork() stops the world, and the delay is proportional to the amount of ram [21:43:01] * Aaron|home wonders if that could cause swapping [21:43:23] well, hopefully we won't have that much data at once [21:44:23] binasher: anyway, my preference was for the log, but we can try either [21:45:07] Aaron|home: let's go for the aol with every 1-sec fsyncs [21:46:30] ok [21:48:40] !log apparently Gerrit takes ages to submit a change that in turn cause a hard lock on Zuul which is waiting for the merge to happen. [21:48:47] Logged the message, Master [21:51:17] !log Gerrit took roughly 5 minutes to merge the change https://gerrit.wikimedia.org/r/#/c/53412/ ( mediawiki/core ) which means Zuul does nothing during that time waiting for the merge to happen. [21:51:24] Logged the message, Master [21:52:39] New patchset: Rfaulk; "add. Some new global settings for metrics-api project." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53868 [21:57:30] PROBLEM - Puppet freshness on virt1005 is CRITICAL: Puppet has not run in the last 10 hours [21:57:38] !log olivneh synchronized php-1.21wmf11/extensions/NavigationTiming 'Update of extension to use EventLogging' [21:57:45] Logged the message, Master [21:58:33] Change merged: Ori.livneh; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53865 [22:01:40] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Thu Mar 14 22:01:38 UTC 2013 [22:02:30] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [22:02:36] !log olivneh synchronized wmf-config 'Enable NavigationTiming on test2' [22:02:42] Logged the message, Master [22:05:30] PROBLEM - Puppet freshness on sq83 is CRITICAL: Puppet has not run in the last 10 hours [22:05:30] PROBLEM - Puppet freshness on tmh1002 is CRITICAL: Puppet has not run in the last 10 hours [22:07:30] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Puppet has not run in the last 10 hours [22:08:21] New patchset: Ori.livneh; "$wmgUseNavigationTiming['default'] = true" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53872 [22:11:32] Change merged: Ori.livneh; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53872 [22:13:30] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [22:13:52] !log olivneh synchronized wmf-config/InitialiseSettings.php 'Enabling NavigationTiming at 1:10,000 sampling factor across cluster.' [22:13:59] Logged the message, Master [22:15:02] New patchset: Asher; "make redis persistence model configurable, use aof in role::redisdb and rdb for sessions" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53873 [22:16:37] http://www.bbc.co.uk/news/world-europe-21793224 [22:16:44] that title is great [22:19:34] binasher: hey, you guys can use this for analytics! https://wiki.openstack.org/wiki/EHO [22:19:35] :D [22:20:45] ori-l: so far, no data for event_connecting or event_dnsLookup, interesting [22:21:27] well, dns lookup isn't super-surprising because people will most likely have it cached [22:21:35] but connecting is strange, yeah [22:21:40] Ryan_Lane: please tell me it replaces hdfs with gluster [22:21:51] no. even better [22:21:52] swift [22:23:12] Ryan_Lane: of course! [22:23:19] binasher: btw, I realized too that the rate of events will be less than we initially thought, because you need to win the 1:10,000 lottery *and* have a browser w/nav timing support (66% according to caniuse) [22:23:48] New patchset: Dzahn; "better redirect for pa.us to pa-us chapter wiki (bug 38763)" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/53874 [22:24:09] we got a DNS look-up, too [22:24:34] oh yeah, and some connecting now [22:25:41] i wonder if we should collect urls with this, or if it would be too much noise [22:26:28] New patchset: Hashar; "jenkins::slave to setup a Jenkins agent" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53875 [22:31:50] New review: Ryan Lane; "(1 comment)" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/53736 [22:32:13] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Thu Mar 14 22:32:06 UTC 2013 [22:32:28] New review: Dzahn; "apache-fast-test pa-us.wikimedia.url mw1044" [operations/apache-config] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/53874 [22:32:28] Change merged: Dzahn; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/53874 [22:32:30] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [22:33:38] binasher: IIRC the path + query of NavigationTiming events is currently ~400 bytes, so still well under the <1024 limit current imposed by varnish. But the full request URL can add quite a lot of bloat, especially when you consider code points above the ascii range get URL encoded twice [22:34:28] !log disabling notifications for puppet freshness on celsus [22:34:34] Logged the message, Master [22:35:48] ori-l: i don't think it would be helpful, except for a very small set of common pages like en.wikipedia.org/wiki/Main_Page [22:36:52] binasher: ah, well we could easily log the numeric article ID; it's already available client-side as wgArticleId [22:38:35] ooh [22:38:39] yeah! [22:39:37] ori-l: is wgArticleId page.page_id, and not tied to revision? [22:40:00] yeah. revision ID is available too -- we could log both [22:40:42] dzahn is doing a graceful restart of all apaches [22:41:23] !log dzahn gracefulled all apaches [22:41:29] Logged the message, Master [22:41:31] RECOVERY - Apache HTTP on mw1041 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.067 second response time [22:42:31] New patchset: Hashar; "puppet now manage jenkins ssh authorized_keys" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53736 [22:42:32] New patchset: Hashar; "systemuser learned 'managehome' (default true)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53879 [22:42:32] New patchset: Hashar; "create jenkins user with systemuser" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53880 [22:43:00] New review: Hashar; "(1 comment)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53736 [22:43:09] !log pa.us.wikimedia.org now redirects to pa-us.wikimedia.org (Wikimedia Pennsylvania) (no sub.sub domains) [22:43:15] Logged the message, Master [22:43:36] New review: Hashar; "Now depends on:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53736 [22:44:46] mutante: I wonder what we can do next.. [22:45:12] ori-l: just page-id would be a good start. revision lifetime might be too short [22:45:27] rfaulkner: you have mail [22:47:06] Reedy: close "nrm" ?:) [22:47:15] jeremyb_: thanks, got it [22:47:17] Needs to go via langcom first [22:47:29] binasher: right, but suppose you're seeing high event_rendering values for a particular page, but when you check it renders very quickly for you. It's useful to be able to look at the history and realize, oh, this massive template got removed, etc. [22:47:30] Though, I can't imagine it being contraversial with the amount of activity [22:47:32] quoting DannyB: ""Even easier would be to close it, as it is inactive. Only bots, globalmaintenance scripts, vandalisms (and reverts of it), interwiki changes. Nocontent progress."" [22:48:04] Do we have that list of those double domain wikis? Arbcoms, wg.en.wiki, noboard.chapters.wikimedia [22:48:28] Shall I go ahead and mark the labs.wikimedia.org wikis as deleted? [22:48:39] That wikitech redirect went in, right? [22:48:50] Yup, it has [22:48:53] any wiki to be closed should be first exported to incubator [22:49:10] Doesn't really need to be done first [22:49:13] yes, they redirect.. as here: https://gerrit.wikimedia.org/r/#/c/53478/3/redirects.conf [22:49:15] If it's closed, it's still there [22:49:23] ori-l: good point [22:49:26] New patchset: Reedy; "Kill wikimedialabs wiki configs!" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53487 [22:49:34] )(de|en|flaggedrevs|liquidthreads|readerfeedback) [22:49:47] * Reedy grins [22:49:49] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53487 [22:52:00] !log reedy synchronized wmf-config/InitialiseSettings.php [22:52:06] Logged the message, Master [22:54:11] New patchset: Reedy; "Remove pmtpa from output" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53884 [22:54:33] Reedy: can look at the fix in https://gerrit.wikimedia.org/r/#/c/53882/ ? [22:54:50] New patchset: Reedy; "Remove pmtpa from output" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53884 [22:56:24] Reedy: writing to ops list about killing the rest that are sub.sub of labs in DNS but not wikis [22:59:55] Haha. noboard.chapters.wikimedia.org has a whole 720 revisions [23:00:04] :D [23:00:52] In 3 years [23:00:57] They obviously really needed their own wiki [23:02:10] well they had no vandalism or spam! [23:10:43] Reedy: would be nice to deploy that it if's fine [23:11:27] New patchset: Reedy; "Update wgServer and wgCanonicalServer for multi subdomain wikis" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53885 [23:16:43] New patchset: Reedy; "Remove elwikinews from wgServer and wgCanonicalServer (unneeded)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53886 [23:17:01] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53886 [23:24:22] New patchset: awjrichards; "Updating MobileFrontend copyright logo path on testwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53888 [23:27:20] Change merged: awjrichards; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53888 [23:30:16] !log awjrichards synchronized wmf-config/InitialiseSettings.php 'Updating MF copyright logo for testwiki' [23:30:22] Logged the message, Master [23:32:11] New patchset: awjrichards; "Update MF custom logo path for enwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53890 [23:41:15] New patchset: Reedy; "Bug 39482 - Rename "chapcomwiki" to "affcomwiki"" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53892 [23:43:55] !log installing package upgrades on sockpuppet [23:43:58] New patchset: Reedy; "Bug 39482 - Rename "chapcomwiki" to "affcomwiki"" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53892 [23:44:00] Logged the message, Master [23:44:46] Change merged: awjrichards; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53890 [23:46:46] !log awjrichards synchronized php-1.21wmf11/extensions/MobileFrontend/ 'Updating MobileFrontend per https://www.mediawiki.org/wiki/Extension:MobileFrontend/Deployments/2013-03-14' [23:46:47] New patchset: Reedy; "Bug 39482 - Rename "chapcomwiki" to "affcomwiki"" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/53894 [23:46:53] Logged the message, Master [23:47:20] !log awjrichards synchronized wmf-config/InitialiseSettings.php 'Updating MF copyright logo path for enwiki' [23:47:26] Logged the message, Master [23:48:18] !log DNS update - add affcom in prep. to replace chapcom (bug 39482) [23:48:25] Logged the message, Master [23:56:34] Started scap for both E3 and mobile (though latter has no i18n changes) [23:57:26] Reedy: oooh.. these renaming tickets even lead to stuff like: https://bugzilla.redhat.com/show_bug.cgi?id=677570 hah [23:57:55] Status: CLOSED RAWHIDE [23:57:56] o_0 [23:58:12] but also "fixed in upstream"