[00:07:00] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [00:09:00] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Thu Mar 28 00:08:57 UTC 2013 [00:09:10] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [00:10:02] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [00:12:41] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Thu Mar 28 00:12:35 UTC 2013 [00:13:01] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [00:14:02] New patchset: Reedy; "(bug 46004) Set $wgCategoryCollation to 'uca-be' on be.wikipedia and be.wikisource" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/54365 [00:14:54] RobH: whatever happened with professor? errors never popped up again? [00:16:00] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Thu Mar 28 00:15:49 UTC 2013 [00:16:00] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [00:16:20] jeremyb_: which one [00:16:42] mutante: 4619 [00:17:24] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/54365 [00:18:14] jeremyb_: i think it says it all, the fan was replaced but not yet the memory [00:18:18] !log reedy synchronized wmf-config/InitialiseSettings.php [00:18:25] Logged the message, Master [00:18:43] mutante: the reason to replace the fan and not the memory was in order to see the memory errors without the flood of fan errors [00:18:55] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Thu Mar 28 00:18:41 UTC 2013 [00:18:55] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [00:19:12] jeremyb_: yes, i saw that, so? [00:19:47] so, i was just wondering if the new errors that we were waiting for had appeared or not [00:19:49] i wouldnt expect them to replace it without commenting [00:20:16] (it's nearly 2 weeks and seems like it would be more than enough time to get more errors. but maybe i'm wrong) [00:20:54] i don't know, so it was just a "bump?" [00:21:10] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Thu Mar 28 00:21:07 UTC 2013 [00:21:27] lol, that is nice, since db11 was supposed to be decom'ed [00:21:50] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [00:23:50] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Thu Mar 28 00:23:49 UTC 2013 [00:24:50] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [00:25:50] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Thu Mar 28 00:25:45 UTC 2013 [00:27:09] !log once again disabling notifications for db11 [00:27:15] Logged the message, Master [00:43:24] New patchset: Aklapper; "bugzilla_report.php: Add query and formatting for list of urgent issues" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56348 [00:52:11] PROBLEM - Puppet freshness on virt2 is CRITICAL: Puppet has not run in the last 10 hours [00:57:11] PROBLEM - Puppet freshness on virt3 is CRITICAL: Puppet has not run in the last 10 hours [01:05:00] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [01:07:10] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [01:25:52] New patchset: Ram; "Bug: 43544: Improve error handling to not hide internal errors." [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/56354 [01:55:29] New patchset: Reedy; "Remove some comments that just get in the way" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56356 [01:55:47] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56356 [02:04:51] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [02:07:01] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [02:19:24] !log LocalisationUpdate completed (1.21wmf12) at Thu Mar 28 02:19:24 UTC 2013 [02:19:31] Logged the message, Master [02:25:33] New review: Faidon; "Looks OK to me. Ping me or someone else in ops to merge it when you're around, just in case." [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/54324 [02:30:52] New review: Faidon; "(3 comments)" [operations/debs/python-voluptuous] (master) C: -1; - https://gerrit.wikimedia.org/r/56168 [02:35:58] New review: Faidon; "The package itself is maintained in git in" [operations/debs/python-jsonschema] (debian/experimental) - https://gerrit.wikimedia.org/r/56064 [02:39:40] New review: Faidon; "(4 comments)" [operations/debs/python-statsd] (master) C: -1; - https://gerrit.wikimedia.org/r/55069 [02:43:11] New review: Faidon; "I think it's okay. But please figure out the answer to your TODO/FIXME and fix it, instead of adding..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55302 [02:58:13] New patchset: Reedy; "(bug 46081) Set $wgCategoryCollation to 'uca-default' on Polish Wiktionary" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/54367 [02:58:23] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/54367 [02:59:20] !log reedy synchronized wmf-config/InitialiseSettings.php [02:59:25] New patchset: Reedy; "cswikinews: Set autopatrolled group" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55441 [02:59:27] Logged the message, Master [02:59:32] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55441 [03:00:48] New patchset: Reedy; "(bug 46589) Add localised/v2 logos for Wikipedias without one (second installment)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56097 [03:01:04] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56097 [03:01:22] New patchset: Reedy; "(bug 43863) Enabled wgImportSources on the Spanish Wikivoyage. Added eswiki, meta, commons, en.voy, de.voy, fr.voy, it.voy, nl.voy, pt.voy, ru.voy, and sv.voy" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56113 [03:01:30] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56113 [03:01:46] New patchset: Reedy; "(bug 46461) Set $wgAutoConfirmCount to 50 for Wikidata" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56150 [03:01:52] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56150 [03:02:06] New patchset: Reedy; "(bug 45638) Modify user group rights on it.wikivoyage Modified wgAddGroups and wgRemoveGroups; changed user rights for autoconfirmed, added patroller group." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56118 [03:02:12] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56118 [03:02:31] New patchset: Reedy; "Add tz database time zone settings for wikis in Maldivian language" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56098 [03:02:37] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56098 [03:02:55] New patchset: Reedy; "(bug 46182) Set LQT as opt-out on se.wikimedia (chapter wiki)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56130 [03:03:01] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56130 [03:03:24] New patchset: Reedy; "(bug 44285) config changes for eswikivoyage" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56055 [03:03:31] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56055 [03:04:16] !log reedy synchronized wmf-config/InitialiseSettings.php [03:04:22] Logged the message, Master [03:06:08] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [03:08:21] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [03:12:18] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [03:26:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:28:07] New review: Faidon; "I like this very much." [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/49710 [03:28:28] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [03:30:02] New review: Faidon; "Oh and the answer to your 0444 question is that when you vi and says it's read-only is a helpful hin..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49710 [03:50:34] New review: Faidon; "So, this is a very nice effort. It feels a bit too complicated and I was confused a lot while review..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/53714 [03:51:45] paravoid: s/ger-orig-source/get-orig-source/ [04:06:32] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [04:08:02] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Thu Mar 28 04:07:58 UTC 2013 [04:08:37] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [04:08:42] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [04:09:32] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Thu Mar 28 04:09:29 UTC 2013 [04:10:32] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [04:10:42] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Thu Mar 28 04:10:41 UTC 2013 [04:11:32] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [04:11:52] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Thu Mar 28 04:11:49 UTC 2013 [04:12:32] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [04:12:53] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Thu Mar 28 04:12:48 UTC 2013 [04:13:33] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [04:13:42] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Thu Mar 28 04:13:40 UTC 2013 [04:14:32] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [04:15:12] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Thu Mar 28 04:15:02 UTC 2013 [04:15:32] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [04:15:34] New patchset: Faidon; "New upstream release" [operations/debs/ruby-jsduck] (master) - https://gerrit.wikimedia.org/r/56362 [04:16:24] Change merged: Faidon; [operations/debs/ruby-jsduck] (master) - https://gerrit.wikimedia.org/r/56362 [04:19:37] !log updating jsduck in apt and upgrading it on gallium [04:19:39] Krinkle: ^^^ [04:19:44] Logged the message, Master [04:21:42] paravoid: Thanks [04:21:53] paravoid: btw, what does this find/print command do? https://gerrit.wikimedia.org/r/#/c/56362/1/debian/rules [04:23:16] Wondering what could cause chmod to be wrong by default. [04:25:24] why not use -exec ? is it not guaranteed to be in every /usr/bin/find ? [04:26:13] (or for that matter chmod -R) [04:26:24] oh, -type f [04:26:35] still, idk. X is useful :) [04:37:29] Krinkle: the gifs were 755 [04:37:45] paravoid: orly, that's messed up. [04:37:53] jeremyb_: chmod -R is for dirs; -exec vs. xargs is the difference between running (fork/exec) N chmods vs. one [04:37:53] I'll file a bug. [04:38:29] paravoid: well i'd do -exec ... + not -exec ... \; [04:39:26] Krinkle: hmm [04:39:29] they're not in the tarball [04:39:54] oh nevermind [04:39:55] they are [04:40:03] 2821891 4 -rwxr-xr-x 1 www-data www-data 856 Mar 28 05:59 ./extjs/resources/themes/images/default/util/splitter/mini-bottom.gif [04:40:06] 2821892 4 -rwxr-xr-x 1 www-data www-data 856 Mar 28 05:59 ./extjs/resources/themes/images/default/util/splitter/mini-top.gif [04:40:10] etc. [04:40:20] > find debian/ruby-jsduck/usr/share/ruby-jsduck/ -type f -exec chmod 644 {} + [04:40:56] oh hah [04:41:02] I didn't know + [04:41:03] that's nifty [04:41:08] must be relatively new [04:42:10] core bins like that usually take very long to get updates spread [04:42:28] but I guess 1997 is new in that case :P [04:42:52] 2005-01-15 James Youngman [04:42:55] First working version of -exec ...+ [04:43:23] so, yeah, not exactly new [04:43:26] but much newer than 1997 [04:43:54] funny [04:43:57] changelog goes back to... [04:44:00] 87/02/21 22:19:25 22:19:25 cire (Eric B. Decker) [04:44:04] 1987! [04:45:49] oh god [04:45:51] ariel woke up [04:45:54] and I still haven't gone to bed [04:48:59] tha's a bad sign [04:49:06] shoo! [04:49:36] I wouldnt' say I "woke up" exactly, 'groggily sitting at keyboard" more like [04:49:51] cursig gnome-shell 3 yet again [05:06:53] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [05:08:26] Could someone merged https://gerrit.wikimedia.org/r/#/c/38252/ please? [05:08:27] ;p [05:09:03] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [05:09:12] I'm one step ahead of you paravoid, I'm in bed, but not asleep... Checked IRC and logged 2 new bugs... [05:09:59] but we also have 2h time difference, so I win anyway :) [05:10:28] uhm, you've commented that it's buggy [05:12:16] "buggy" [05:12:26] fatalmonitor has similar little flaws [05:12:59] For all intents and purposes it works [05:14:55] e.g. [05:14:56] 84 Exception from line 637 of /usr/local/apache/common-local/php-1.21wmf12/includes/cache/MessageCache.php: Message key 'Filepage.css' does not appear to be a full key. [05:14:56] 2 Exception from line 637 of /usr/local/apache/common-local/php-1.21wmf12/includes/cache/MessageCache.php: Message key 'Handheld.css' does not appear to be a full key. [05:21:31] ping [05:26:57] TimStarling: ping [05:27:10] hello preilly [05:27:20] TimStarling: May I PM [05:27:24] yes [05:37:14] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: Puppet has not run in the last 10 hours [05:37:14] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: Puppet has not run in the last 10 hours [05:37:14] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: Puppet has not run in the last 10 hours [05:45:20] New patchset: Aude; "Update fywiki sort order, add note about default Wikibase settings" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56367 [06:06:44] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [06:08:54] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [06:09:14] PROBLEM - Puppet freshness on cp3010 is CRITICAL: Puppet has not run in the last 10 hours [06:21:14] PROBLEM - Puppet freshness on virt1005 is CRITICAL: Puppet has not run in the last 10 hours [06:25:57] ori-l: ping [06:26:08] hey, preilly [06:27:36] ori-l: I just saw your Vagrant change [06:27:46] ori-l: have you thought about using sshfs? [06:28:31] No, Vagrant doesn't support it by default [06:28:41] instead of? [06:28:56] How's that possible? [06:29:00] VirtualBox Shared Folders and NFS are the options it supports out of the box [06:29:13] ori-l: have you tried the VMWare driver yet? [06:30:07] No -- it isn't free or open-source, so I don't expect it to be very popular with our community [06:30:33] ori-l: hmm [06:30:33] Jasper_Deng_busy: look up? [06:30:43] ori-l: I just meant have you tried it? [06:30:43] * Aaron|home admires http://www.time.com/time/photogallery/0,29307,2036928_2218542,00.html [06:30:50] no, I haven't [06:31:03] Have you? How does it compare to VirtualBox? [06:31:23] Aaron|home: is that server porn? [06:31:35] ori-l: it seems much much faster to me [06:31:46] ori-l: I just started playing with it today [06:33:14] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Thu Mar 28 06:33:08 UTC 2013 [06:33:45] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [06:34:24] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Thu Mar 28 06:34:14 UTC 2013 [06:34:26] Honestly, I'm mostly hoping that VMWare support motivates Oracle to make the integration with VirtualBox better [06:34:44] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [06:34:57] their stewardship of open-source projects hasn't been inspiring [06:35:05] ori-l: makes total sense to me [06:35:14] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Thu Mar 28 06:35:09 UTC 2013 [06:35:14] ori-l: yeah totally [06:35:44] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [06:35:54] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Thu Mar 28 06:35:53 UTC 2013 [06:36:39] based on oracle's current philosophy I'm surprised virtualbox still exists [06:36:44] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [06:36:50] Ryan_Lane: yeah totally [06:37:10] they have hardly touched it since they acquired sun [06:37:18] Ryan_Lane: yeah [06:37:39] Ryan_Lane: and the delta between it and VMWare is growing all the time [06:37:42] indeed [06:37:47] on Linux it's not a big deal [06:37:55] on OS X and Windows it's a pain in the ass [06:38:16] http://download.virtualbox.org/favicon.ico [06:38:17] yeah [06:38:28] ori-l: :D [06:38:29] ha ha ha [06:38:43] ori-l: that's hilarious [07:06:42] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [07:08:52] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [08:06:24] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [08:07:35] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [08:07:54] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Thu Mar 28 08:07:53 UTC 2013 [08:08:25] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [08:08:54] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Thu Mar 28 08:08:52 UTC 2013 [08:09:24] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [08:09:54] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Thu Mar 28 08:09:47 UTC 2013 [08:10:24] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [08:14:38] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Thu Mar 28 08:14:31 UTC 2013 [08:15:24] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [08:17:24] PROBLEM - Puppet freshness on mw1160 is CRITICAL: Puppet has not run in the last 10 hours [09:05:16] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [09:07:26] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [09:10:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:11:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [09:14:41] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Thu Mar 28 09:14:30 UTC 2013 [09:15:16] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [09:22:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:23:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [10:04:54] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [10:06:34] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 32479 MB (3% inode=99%): [10:07:04] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [10:14:03] hi apergos [10:14:09] yo [10:14:35] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Thu Mar 28 10:14:31 UTC 2013 [10:14:42] apergos: did greg tell you that you are going to watch my deployment? [10:14:54] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [10:15:07] no, he asked if I was available and I said I'd be happy to [10:15:13] :-P [10:15:22] how long from now is that window? [10:15:26] apergos: it's now [10:15:31] all righty [10:16:07] apergos: to put it short... I was not able to reach working solution for https://gerrit.wikimedia.org/r/#/c/56345/ so I'm not deploying it, only Translate which is https://gerrit.wikimedia.org/r/56379 [10:17:42] ok [10:33:45] Nikerabbit: but no revert of the other either? (would be useful to tell somewhere) [10:39:12] !log nikerabbit synchronized php-1.21wmf12/extensions/Translate/ 'Translate to master' [10:39:20] Logged the message, Master [10:39:34] Nemo_bis: Siebrand is updating the bugs [10:40:11] New patchset: Hashar; "package-builder learned 'cowbuilder'" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56382 [10:40:30] All information should be in bugs https://bugzilla.wikimedia.org/show_bug.cgi?id=1495 and https://bugzilla.wikimedia.org/show_bug.cgi?id=46579#c19 now. [10:40:54] I'm writing my last email to some Wikimedia folks now, and will continue with other things. [10:41:34] last email on the subject. [10:41:46] * Nemo_bis just received bugmail [10:53:11] PROBLEM - Puppet freshness on virt2 is CRITICAL: Puppet has not run in the last 10 hours [10:58:07] PROBLEM - Puppet freshness on virt3 is CRITICAL: Puppet has not run in the last 10 hours [10:58:14] I hate you puppet [10:58:14] really [10:58:22] most of the time [11:02:25] New review: Hashar; "I am not paying attention:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56382 [11:06:46] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [11:08:26] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 31767 MB (3% inode=99%): [11:08:56] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [11:24:34] ssh: connect to host gerrit.wikimedia.org port 29418: Connection timed out [11:31:04] thanks freenode [11:33:26] PROBLEM - search indices - check lucene status page on search20 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern found - 62805 bytes in 0.118 second response time [11:33:46] PROBLEM - search indices - check lucene status page on search19 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern found - 62805 bytes in 0.110 second response time [11:35:44] pmtpa right? guess I'm going to not care about those [11:38:30] apergos: the search indices comes from a patch I wrote and got deployed yesterday [11:38:56] ok [11:39:15] apergos: we have an Icinga check for each of the search servers that verify whether each indices are fine. Some search boxes have issues though :( [11:39:34] we should probably have fixed the issues before enabling the module. I will have a look at search19 and search20 [11:39:59] all right. we're basically serving search out of eqiad though, right? [11:40:08] no idea [11:40:22] ram said he will have a look at them anyway [11:40:27] great [11:41:22] ahh [11:41:27] that is the enwiki.prefix db that failed [11:41:29] when I look at network traffic for the search clusters there's steady to eqiad and not really to pmtpa [11:42:08] !log search19 and search20 have enwiki.prefix marked as FAILED. (see: curl --silent http://search19.pmtpa.wmnet:8123/status |grep FAILED and curl --silent http://search20.pmtpa.wmnet:8123/status |grep FAILED). [11:42:15] Logged the message, Master [11:42:27] curl --silent http://search20.pmtpa.wmnet:8123/status |grep FAILED [11:42:28] [FAILED] enwiki.prefix [11:42:28] ;) [11:42:37] I guess this might be a not peter thing [11:42:48] yup and ram or ^demon [11:42:53] ok [11:43:33] I will fill a RT ticket [11:43:53] thanks [11:48:06] apergos: can you possibly acknowledge both errors in Icinga and refer to RT #4845 ? [11:48:38] ah lemme see, last time I tried to ack things there I didn't have permission [11:48:53] !log search19 and search20 issue with enwiki.prefix is in {{rt|4845}} [11:49:00] Logged the message, Master [11:50:43] I will try [11:51:14] not authorized :( [11:51:20] link is https://icinga-admin.wikimedia.org/cgi-bin/icinga/cmd.cgi?cmd_typ=34&host=search19&service=search+indices+-+check+lucene+status+page [11:51:32] and https://icinga-admin.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=search20&service=search+indices+-+check+lucene+status+page [11:53:35] yeah I'm not authorizd still [11:53:36] sorry [11:53:54] at least you tried, thank you for that :-] [11:53:57] sure [11:54:03] I am sure someone will ping notpeter [11:54:40] hopefully not for several hours! [11:55:50] maybe ^demon can fix it [11:56:04] I guess now I shuld be watching the servers for blips (that the lightning deployment is done) [11:56:09] so far it's been nice and boring [11:57:20] are you referring to the icinga notifications? [11:57:26] no [11:57:36] nike rabbit's deployment [11:59:17] ohhh [12:05:29] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [12:07:09] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 31258 MB (3% inode=99%): [12:07:41] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [12:07:49] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Thu Mar 28 12:07:48 UTC 2013 [12:08:29] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [12:08:52] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Thu Mar 28 12:08:40 UTC 2013 [12:09:30] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [12:09:31] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Thu Mar 28 12:09:22 UTC 2013 [12:10:29] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [12:14:39] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Thu Mar 28 12:14:28 UTC 2013 [12:15:29] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [12:33:02] New patchset: Hashar; "package-builder learned 'cowbuilder'" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56382 [12:33:54] notice: /Stage[main]/Misc::Package-builder/Misc::Package-builder::Builder[pbuilder]/Misc::Package-builder::Image[pbuilder-lucid]/Exec[imaging lucid for pbuilder]/returns: E: Could not perform immediate configuration on 'util-linux'.Please see man 5 apt.conf under APT::Immediate-Configure for details. (2) [12:33:57] that is more and more cryptic [12:34:01] i guess I should remove lucid [13:05:46] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [13:07:56] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [13:08:26] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 30645 MB (3% inode=99%): [13:12:36] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [13:56:15] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49710 [13:59:33] New patchset: Ottomata; "Fixing README and comments" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56395 [13:59:53] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56395 [14:05:40] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [14:07:50] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [14:08:21] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 30099 MB (3% inode=99%): [14:17:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:18:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.142 second response time [14:40:38] New patchset: Matmarex; "(bug 45776) Set $wgCategoryCollation to 'uca-uk' on all Ukrainian-language wikis" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56400 [14:40:53] New patchset: Ottomata; "Fixing some more comments" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56401 [14:41:09] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56401 [14:50:48] New patchset: Demon; "Switch nostalgiawiki to use Nostalgia from extension" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56402 [14:51:28] New patchset: Hashar; "package-builder learned 'cowbuilder'" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56382 [14:51:55] New patchset: Ottomata; "Adding misc/limn.pp to manage setup of WMF hosted limn sites. Installing reportcard.wikimedia.org on stat1001." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56403 [14:56:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:56:32] New patchset: Ottomata; "Adding misc/limn.pp to manage setup of WMF hosted limn sites. Installing reportcard.wikimedia.org on stat1001." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56403 [14:57:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.162 second response time [14:58:43] New patchset: Ottomata; "Adding misc/limn.pp to manage setup of WMF hosted limn sites. Installing reportcard.wikimedia.org on stat1001." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56403 [15:06:06] New patchset: Ottomata; "Adding misc/limn.pp to manage setup of WMF hosted limn sites. Installing reportcard.wikimedia.org on stat1001." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56403 [15:09:32] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [15:11:41] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [15:12:11] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 29559 MB (3% inode=99%): [15:12:34] New patchset: Hashar; "package-builder learned 'cowbuilder'" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56382 [15:22:40] New patchset: Hashar; "package-builder learned 'cowbuilder'" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56382 [15:31:34] New review: Hashar; "Deployed on integration-jobbuilder instance using puppetmaster::self. That is generating the images ..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56382 [15:32:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:33:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.131 second response time [15:35:12] New patchset: Demon; "In sync-dir, actually perform the syntax check" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56105 [15:35:12] New patchset: Demon; "Move scap source location from fenari to tin" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56104 [15:35:16] New patchset: Demon; "Basic puppetization of dsh" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56107 [15:35:17] New patchset: Demon; "Remove some node lists" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56108 [15:38:11] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: Puppet has not run in the last 10 hours [15:38:11] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: Puppet has not run in the last 10 hours [15:38:11] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: Puppet has not run in the last 10 hours [15:46:07] New patchset: Ottomata; "Adding misc/limn.pp to manage setup of WMF hosted limn sites. Installing reportcard.wikimedia.org on stat1001." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56403 [15:46:17] hi hashar! [15:46:20] would you look that last one over for me? [15:46:28] https://gerrit.wikimedia.org/r/56403 [15:46:49] my main question is whether or not misc/limn.pp makes sense [15:47:32] i mean, the puppet stuff there makes sense, but i'm not sure if I should create a file called misc/limn.pp [15:47:35] or maybe just limn.pp [15:47:42] or can I put a define in role class? [15:47:44] role/limn.pp? [15:50:44] New review: Hashar; "(2 comments)" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/56348 [15:52:06] ottomata: hhhhhiiii [15:52:35] ottomata: don't you have a limn module nowadays ? [15:53:00] ah you have [15:53:04] ottomata: that should be a role class [15:53:11] ottomata: manifests/role/limn.pp [15:53:23] ottomata: consider having a role for production (default) and another for labs [15:53:37] something like: role::limn and role::limn::labs [15:56:00] New review: Ottomata; "Well, this could be an artifact of the way I created this branch." [operations/debs/python-jsonschema] (debian/experimental) - https://gerrit.wikimedia.org/r/56064 [15:56:15] hashar [15:56:20] andrew [15:56:20] well,i don't want a role::limn class directly [15:56:23] :) [15:56:29] but [15:56:32] why not ? [15:56:44] limn is just a piece of software, not a functional thing [15:56:49] that be like having a role::apache class [15:57:00] make sense [15:57:03] a role::reportcard class will make sense [15:57:13] (reportcard is a limn site) [15:57:24] essentially, that's what misc::statistics::sites::reportcard is [15:57:30] I kept that there for consistency [15:57:30] also [15:57:41] i wasn't sure if defines belonged in a role class [15:57:47] define role::limn::instance? [15:57:58] basically, I want a limn instance define that abstracts out WMF specific settings for limn instances [15:58:35] :-o what does it mean when ls shows a dir like this: [15:58:35] and, I think the way I'm doing the $::realm conditional inside of that is nicer than separate labs/production classes [15:58:36] d????????? ? ? ? ? ? instances [15:58:37] ? [15:58:38] this way you don't have to think about it [15:58:45] just [15:58:56] include misc::statistics::sites::reportcard [15:59:01] andrewbogott: that is a gluster issue [15:59:01] will work in labs and production [15:59:02] and do the right thing [15:59:36] andrewbogott: I just reboot the instance in such a case. I haven't found out how to properly restart Gluster. If reboot fail, you want to look at the Gluster volume which might be corrupted. [16:00:10] ottomata: yeah I understand your POV. I had the same :-] [16:01:00] so , i like the content of misc/limn.pp [16:01:10] hashar: s/loosed/lost/ FYI [16:01:12] i'm just not sure if we are trying to stop creating files in misc/ [16:01:18] ottomata: but the recommended way nowadays is to have a role for prod and one for labs. For example the role::gerrit::labs and role::gerrit::production [16:01:37] if role::gerrit will work in labs and production, isn't that nicer? [16:01:49] so modules should be as non WMF specific as possible and the roles used to instantiate the module with WMF settings. [16:01:54] misc/* must die (eventually) [16:02:34] hmm, ok, more general question [16:02:39] where do WMF specific defines belong? [16:02:58] that is a good question :-] In misc? [16:03:08] ahhhh [16:03:09] azeaze [16:03:15] I got it [16:03:24] so by using a role class for prod and another one for labs [16:03:28] you no more need your define :-] [16:03:40] but then i'd have to duplicate the same logic for every limn instance [16:03:51] there will be more [16:03:56] global dev wants one, mobile, etc. [16:04:21] unless role::limn::labs craft the server alias by using $::instancename [16:04:42] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [16:04:58] but still, role::limn doesn't make sense I think [16:05:04] role::reportcard is what makes sense [16:05:22] RECOVERY - DPKG on virt2 is OK: All packages OK [16:05:41] also, you can install multiple instance on a single node [16:05:46] so I need a define to abstract the logic [16:06:22] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 31028 MB (3% inode=99%): [16:06:43] ottomata: something like http://dpaste.com/1037795/ [16:06:52] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [16:06:56] ah multiple instances on a node need a bit more work yeah :( [16:07:06] probably want to give a path or a different port [16:07:28] which should in turn be part of the $name to make sure the limn::instance::proxy() always has a unique name [16:08:09] something like 'limn-server001-port8080' 'limm-server001-port8081' [16:08:12] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Thu Mar 28 16:08:05 UTC 2013 [16:08:43] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [16:08:47] right [16:08:53] that's what misc::limn::instance does [16:09:03] yup :) [16:09:53] so right now [16:09:58] all that I have to do to set up a new limn instance [16:10:00] in labs or in production [16:10:01] is [16:10:10] misc::limn::instance { 'reportcard': } [16:10:12] PROBLEM - Puppet freshness on cp3010 is CRITICAL: Puppet has not run in the last 10 hours [16:10:12] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Thu Mar 28 16:10:10 UTC 2013 [16:10:18] or to install another instance on the same machine [16:10:34] misc::limn::instnace { 'global-dev': port => 8082 } [16:10:42] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [16:10:47] so, a role calss [16:11:11] would only be necessary per limn site [16:11:22] role::limn::production { 'reportcard': } [16:11:22] role::limn::labs { 'reportcard': } [16:11:23] role::limn::labs { 'reportcard': port => 8082, } [16:11:33] that is essentially the same [16:11:46] but instead of using some if( $::realm ) , you have the realm set per the name of the class [16:11:52] but that's a define, is it ok to have a 'role define' [16:12:00] you can even have ::labs and ::production subclass to extend a ::common class. [16:12:02] so, 2 arguments here [16:12:09] 1: where should this define live [16:12:22] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Thu Mar 28 16:12:13 UTC 2013 [16:12:22] 2: should I have classes named after realms [16:12:26] ah the define is used to get a uniq name isn't it ? [16:12:30] I think 2 is a bigger argument, [16:12:36] yes [16:12:42] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [16:13:02] role::limn::labs { 'reportcard': } [16:13:02] role::limn::labs { 'reportcard-sandbox': port => 8082 } [16:13:03] :( [16:13:44] is that an argument for 1. or 2.? :) [16:13:53] let's debate those separately :) [16:14:02] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Thu Mar 28 16:13:52 UTC 2013 [16:14:36] (i wonder i paravoid is around to chime in :) ) [16:14:37] so yeah a define in role, maybe that is possible [16:14:42] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [16:14:59] yeah, i started with this in a role [16:15:10] but I stopped because I realized that the define itself isn't really a role [16:15:22] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Thu Mar 28 16:15:19 UTC 2013 [16:15:27] its not like I would say "the role of this machine is apache" [16:15:42] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [16:15:48] rather, I would say "the role of this machine is the wikimedia blog webhost" [16:16:03] yeah that does not really sense. So you would want a role::reportcard ? [16:16:03] the actual use of apache is irrelvant [16:16:05] yeah, role::reportcard is cool [16:16:12] but as a class [16:16:14] not a define [16:16:35] that would then directly call limn::instance::proxy [16:16:36] i'd still need a define to abstract the WMF specific details of setting up limn for hosting the reportcard class [16:16:41] yeah, I could do that [16:16:42] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Thu Mar 28 16:16:32 UTC 2013 [16:16:42] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [16:16:48] the reason I have the misc::limn::instance define [16:16:59] is just to abstract the logic of doing that multiple times [16:16:59] so [16:17:00] yes. [16:17:02] RECOVERY - Puppet freshness on virt2 is OK: puppet ran at Thu Mar 28 16:16:54 UTC 2013 [16:17:13] yeah I understand your point [16:17:20] but if you ever need to add something specific to prod or labs [16:17:23] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Thu Mar 28 16:17:20 UTC 2013 [16:17:28] the logic inside of misc::limn::instance could move inside of role::reportcard, and would make sense there, if it were the only limn instance we were going to set up [16:17:30] you will need to add some more if( $::realm ) [16:17:42] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [16:17:47] hmmm [16:18:01] i can't see what else we'd need at the moment [16:18:02] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Thu Mar 28 16:17:59 UTC 2013 [16:18:30] role::reportcard { limn::instance::proxy { 'reportcard-limn': limn_port => 8082, server_name => 'reportcard.wikimedia.org' } } [16:18:35] then copy paste for the labs part :-] [16:18:42] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [16:18:46] yeah that'd be fine, i'm just trying to DRY as much as possible [16:18:52] easy to add a new one (just copy paste the lines, replace the instance name and server_name. done [16:18:58] any time i'm supposed to copy/paste I think 'hmmm, this could be done better' [16:19:02] honestly, both approach are fine [16:19:17] seems faidon prefers the role based one and avoid having a class react differently based on the $:realm [16:19:22] but also, that would require a different class include in labs vs in prod [16:19:33] like in labsconsole, i'd have to select the ::labs class explicitly [16:19:40] indeed [16:19:48] i feel like labs and production are just different environments, they shouldn't have that stuff reflected in class names, but that's just a feeling [16:19:54] that also let you add things specific to labs [16:20:14] yeah, i see your point, if there were lots of things specific to labs, this would get messy [16:20:15] but [16:20:20] hm [16:20:22] what I would do in that case [16:20:27] I had the exact same approach when I started adapting the production class for the beta cluster [16:20:28] use whatever's easiest [16:20:29] is not make the user of a class choose [16:20:40] no one approach is the best for all circumstances [16:20:45] i would do a conditional include of ::labs or ::production classes [16:20:56] in an interface class, so the user doesn't have to know [16:21:00] like [16:21:18] that's really the same as doing a big case statement, honestly [16:21:31] class role::reportcard { [16:21:31] if production include role::reportcard::prod [16:21:31] else if labs include role::reportcard::labs [16:21:31] } [16:21:31] or even [16:21:35] just [16:21:35] although then you CAN override it [16:21:47] class { "role::reportcard::$::realm": } :p [16:22:12] PROBLEM - Puppet freshness on virt1005 is CRITICAL: Puppet has not run in the last 10 hours [16:22:15] mark, you mean you can override it if it is a class? [16:22:16] yeah. [16:22:26] but that's not very important [16:22:38] yeah, in this case probably not, since the labs vs production differences are very small [16:22:45] use case when you have a lot of overlap, and just minor differences [16:22:59] use different classes if they're very different, or all you're doing is calling other classes with different parameters [16:23:02] something like that [16:23:24] yeah hm [16:23:30] puppet's support for class inheritance etc is pretty funky and limited, so sometimes case statements and the like are just easier [16:23:45] yeah, especially with parameterized classes [16:23:54] yes [16:24:11] so that's why I said, use whatever's easiest for every situation [16:24:40] ok cool, it hink i'll stick with this then, i'll add a bit of this discussion to the comments in the patchset [16:24:46] the pragmatic approach ;) [16:24:46] quick q mark, since you are chiming in [16:24:50] New patchset: Hashar; "0.6.1-2 gbp.conf and tweaks" [operations/debs/python-voluptuous] (master) - https://gerrit.wikimedia.org/r/56168 [16:24:59] i've got a define in misc/limn.pp, misc::limn::instance [16:25:08] New review: Hashar; "(3 comments)" [operations/debs/python-voluptuous] (master) - https://gerrit.wikimedia.org/r/56168 [16:25:09] that abstracts out the logic needed to set up a limn instance [16:25:15] i'm just not sure if misc/limn.pp is the proper place for that [16:25:29] i like the define, just not sure of the best place to put it [16:25:57] I'd say it's probably slightly better in misc/limn.pp than in the reportcard role class [16:26:14] it's much like varnish::instance probably [16:26:27] yeah [16:26:45] is manifests/misc/limn.pp ok? or would just manifests/limn.pp be better? [16:26:51] New patchset: Aklapper; "bugzilla_report.php: Add query and formatting for list of urgent issues" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56348 [16:26:54] misc is fine [16:27:01] limn isn't exactly a big part of our infrastructure ;) [16:27:17] that's sort of where I used to draw the line, but it's a bit cluttered [16:28:53] New review: Aklapper; "hashar: Gosh, thanks for spotting this. So much for testing on another machine and failing to do pro..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56348 [16:30:09] ahh [16:30:17] mark: thank for the line drawing :-] [16:30:45] ottomata: so I guess you can take the pragmatic approach [16:31:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:32:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [16:35:18] ottomata: it's not in the right place ;) [16:35:21] it should be a module [16:36:55] no, ryan there is a module [16:37:02] this is the WMF abstraction of the settings for the module [16:37:12] then shouldn't it be in role? [16:37:17] its a define [16:37:57] (and all of our roles should really be in a module, but I digress) [16:38:55] misc is full of crap that is full manifests [16:40:08] * jeremyb_ just found Ryan_Lane at the bottom of http://zomobo.net/Wikipedia:About [16:41:46] ah, yeah, the sf php group meetup I did [16:42:19] but the link doesn't work! :-P [16:44:59] * jeremyb_ sees LeslieCarr's on this week... [16:45:11] * jeremyb_ points LeslieCarr to RT 4761 [16:47:47] jeremyb_: it worked for me [16:48:44] Ryan_Lane: i meant that it doesn't go to an actual video about you [16:48:56] yes it does [16:49:06] i got further on second try [16:49:16] idk what was wrong before [16:55:03] Change abandoned: Hashar; "(no reason)" [operations/debs/python-voluptuous] (master) - https://gerrit.wikimedia.org/r/56172 [16:56:52] New review: Andrew Bogott; "> The *-labs-proxy files are still in the module." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43886 [16:57:39] New patchset: Demon; "Show notice to users who are using legacy skins" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56408 [17:01:21] New review: Ottomata; "DOHp, sorry. Dunno why I thought they were, read that completely wrong." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43886 [17:05:12] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [17:07:20] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [17:07:50] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 31425 MB (3% inode=99%): [17:10:24] New review: Ottomata; "I discussed this for a while with hashar and mark in #ops." [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/56403 [17:10:27] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56403 [17:22:45] ottomata: congrats :) [17:24:17] New review: Dzahn; "there are 3 conditions but just a single [OR]" [operations/apache-config] (master) C: -1; - https://gerrit.wikimedia.org/r/49069 [17:24:33] thanks! [17:32:13] and I am off for today *wave* [17:33:13] New patchset: Jeremyb; "fix gerrit header logo link" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56413 [17:40:26] oh noes [17:40:29] why did you do that ? [17:40:55] mark, I have a rough version of MobileFrontend that does not vary HTML by 21 device types, can we discuss it? [17:41:12] yes [17:41:14] New patchset: Reedy; "Latin -> Cyrillic for ukwikivoyage aliases" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56415 [17:41:32] although an email discussion will also be good [17:41:44] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56415 [17:42:20] !log reedy synchronized wmf-config/InitialiseSettings.php [17:42:27] Logged the message, Master [17:43:06] mark, ok. I'll email the full information (to ops@?), but now I just want a quick sanity check [17:43:11] New patchset: Ottomata; "Removing logic to install and setup apache from limn::instance::proxy. You must do this yourself" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56416 [17:43:26] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56416 [17:44:22] ok [17:45:23] yes, ops@ and/or a public list (wikitech-l?) [17:45:36] mark, basically, this design introduces a header that tell whether phone is WAP-only, and we vary on it. device-specific ResourceLoader modules are implemented with an autodetect module which varies the load.php by X-Device (we will need to send these requests to m instead of bits for this) [17:46:15] this does not add extra RL variance: what was varied by URL now get varied by a header [17:46:46] and because of the header variance we need to send the RL stuff to m. you mean? [17:47:46] because we don't have the device detection on bits [17:47:54] yes, because only m gets X-Device [17:47:57] right [17:47:58] New patchset: Reedy; "(bug 45776) Set $wgCategoryCollation to 'uca-uk' on all Ukrainian-language wikis" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56400 [17:48:04] yeah that sounds a lot better [17:48:10] so then we only vary over wap/non-wap? [17:48:23] for main html [17:48:26] and on device for RL stuff [17:48:49] yes [17:49:13] VCL change: https://gerrit.wikimedia.org/r/#/c/32866/ [17:50:14] well [17:50:20] wouldn't it be better to send that header from mediawiki? [17:50:47] like? [17:51:09] no, sorry brainfart [17:51:18] yes, this is fine I think [17:52:01] mark, thanks - I'll write a more detailed email now:) [17:52:02] I think this is a good step [17:52:14] perhaps in the future would it be possible to reduce variance even for resourceloader modules? [17:52:19] i'm not sure what that would involve [17:53:10] mark, we considered this, but ancient phones also have a crappy CSS support, so not so quickly:) [17:53:42] right [17:53:49] but then for newer phones you might be able to [17:54:30] also, what's the current hit rate for mobile requests on bits? [17:54:49] no idea, but the overall hit rate is very high [17:54:50] 99.5% or so [17:54:58] so indeed [17:55:01] might not be an issue at all [17:55:17] Asher said it was like 97% seconds after a flush [17:55:23] that's right [17:55:53] !log creating search indices for new wikivoyages [17:55:59] right now it's around 99.7% hit rate for everything [17:56:00] Logged the message, Master [17:57:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:58:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.189 second response time [17:59:31] awjr / MaxSem: as I indicated in our monday meeting, I noticed that currently the Cache-Control header sent by mobile varnish doesn't allow any client side caching [17:59:39] is that intended? [17:59:53] it looks to me to be an artifact of how varnish was setup rather than a conscious choice [18:00:07] * MaxSem looks [18:00:09] mark oh i misunderstood what jon explained to me [18:00:19] mark is this in the mobile varnish backend? [18:00:19] I think we should allow client side caching with revalidate [18:00:35] awjr: it's now in the frontend vcl, I moved it last week [18:00:49] it used to be set in the backend when sending to the frontend, to control the frontend's caching behaviour [18:01:06] I thought that was silly and changed it so that the frontend sets it when sending out to clients, but I think we should change it [18:02:09] interesting - am i reading correcetly that this essentially makes it so clients will cache if the request gets handled by the backend (eg MW sets s-maxage) but otherwise disables client-side caching? [18:02:18] so, the varnish backend used to set that header to set the ttl for the frontend, but then that header is passed on to clients as well [18:02:35] awjr: yes, except for shared proxies (squids out there etc) [18:02:41] so those will cache for max 5 mins [18:03:00] i guess a good portion of mobile clients actually are behind proxies, but still [18:03:11] why not allow clients to cache for much longer as long as they revalidate [18:03:16] eek Cache-Control:private, s-maxage=0, max-age=0, must-revalidate [18:03:28] i think that makes sense mark [18:03:28] yep, I see no rational reason for this [18:03:29] last monday I also put in some temp hacks to enable caching for some assets [18:03:42] like favicon.ico and resource loader resources that have a cache timestamp anyway [18:04:06] in /*Temp test */ [18:04:09] yep that [18:04:09] that looks reasonable to me too [18:04:27] mark what would you recommend as a sane max age? [18:04:28] nicest would be if we could mediawiki give separate instructions for *our varnish cluster* and the outside world [18:04:32] for client side [18:04:44] awjr: I believe we normally use one month [18:04:52] but it doesn't matter much [18:04:55] we can start slowly [18:04:55] yeah that's what MW sets as s-maxage [18:04:57] and increase it [18:05:23] sounds reasonable [18:05:31] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [18:05:33] i don't think mobile phones have very big caches, but it's up to them to decide ;) [18:05:39] :p [18:05:57] cool [18:06:29] i think this is also a platform engineering question, but nicest would be if mediawiki could control this separately [18:06:42] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56400 [18:06:42] now in squid (and varnish) we override the cache-control header with some fixed settings, and it's kinda hacky [18:06:53] blech [18:06:59] if mediawiki could tell varnish separately what it should do from what should be sent as Cache-Control to clients, then... that would be very flexible [18:07:11] that would be nice [18:07:21] varnish can look at any arbitrary header of course [18:07:37] yeah, i think it owuld be pretty easy to implement [18:07:41] for now though, will you update the vcl mark or do you want one of us to take a stab at it? [18:07:41] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [18:07:57] !log reedy synchronized wmf-config/InitialiseSettings.php [18:07:59] why don't you take a stab at it as you know your application best, but i'm happy to review and comment on it [18:08:04] Logged the message, Master [18:08:11] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 30799 MB (3% inode=99%): [18:08:15] sounds good mark, thanks [18:12:24] !log adding reportcard and analytics .wikimedia.org CNAMEs to stat1001 [18:12:30] Logged the message, Master [18:15:56] New patchset: Ottomata; "Requiring that limn::instance is set up before limn::instance::proxy in misc::limn::instance" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56419 [18:16:16] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56419 [18:17:09] New patchset: Odder; "(bug 46489) Set wmgBabelCategoryNames for Ukrainian Wikinews" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56420 [18:18:23] PROBLEM - Puppet freshness on mw1160 is CRITICAL: Puppet has not run in the last 10 hours [18:21:11] New patchset: Ottomata; "Fixing path to limn proxy vhost file since I've removed the dependence on the apache module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56421 [18:23:39] !log restarting lucene on pool4 search nodes to pickup new wikivoyage indices [18:23:46] Logged the message, Master [18:24:38] ops: would you like to have a copy of https://bugzilla.wikimedia.org/show_bug.cgi?id=46530 (Search index not updating on en.wikipedia.org) in RT? Wondering if it's somehow related to https://rt.wikimedia.org/Ticket/Display.html?id=4845 [18:24:41] New patchset: Odder; "(bug 46639) Add flood to wgAddGroups for bureaucrats on Meta" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56422 [18:24:56] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56421 [18:26:11] PROBLEM - search indices - check lucene status page on search13 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:26:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:26:51] PROBLEM - search indices - check lucene status page on search14 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:27:06] hey paravoid, thanks very much for the +2 on I834416683. I'm around now so we can merge it. [18:27:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.180 second response time [18:28:56] New patchset: Odder; "(bug 46489) Set wmgBabelCategoryNames for Ukrainian Wikinews" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56420 [18:30:41] RECOVERY - search indices - check lucene status page on search19 is OK: HTTP OK: HTTP/1.1 200 OK - 60006 bytes in 0.125 second response time [18:31:59] New patchset: Ottomata; "Making sure mod rewrite is installed for stat1001 sites" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56424 [18:32:01] RECOVERY - search indices - check lucene status page on search20 is OK: HTTP OK: HTTP/1.1 200 OK - 60006 bytes in 0.125 second response time [18:32:14] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56424 [18:33:42] greg-g: sorry I wasn't on IRC. INvestigating now [18:33:54] LeslieCarr: ^^ [18:34:07] thanks kaldari [18:34:12] any idea when this started? [18:34:21] New patchset: MaxSem; "Add HTTP header X-WAP for MobileFrontend" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32866 [18:34:53] https://graphite.wikimedia.org/dashboard/ [18:35:08] LeslieCarr: which login do I use for that site? [18:35:10] lemme get exact times [18:35:13] labsconsole/gerrit [18:35:17] k [18:35:20] thanks [18:36:12] actually looks like there was a big spike on 3/18 and then continued high numbers since then [18:37:12] whoa, that was a while ago [18:37:18] hmm [18:37:42] LeslieCarr: I'll need to debug with you about logging in, I'm using the same username (Greg Grossmeier) and password that I use on wikitech, but no luck. [18:37:47] I think it might be related to the watchlist feature in AFT [18:38:29] greg-g: no, use your labs shell username [18:38:40] oh [18:39:10] greg-g: gjg? [18:39:26] yeah [18:39:36] are you in? [18:40:08] I had a B of a time setting up labsconsole, uhhh, who helped me, that was my first day... py? I was getting stupid errors (he agreed) [18:40:16] jeremyb_: not yet, doesn't make sense [18:40:33] i can't get in now actually [18:40:47] let me try wikitech itself [18:41:12] LeslieCarr: can you tell if the errors are coming from de.wiki, en.wiki or both? [18:41:21] i can't get in [18:41:30] (to graphite. wikitech works) [18:41:38] hrm, maybe there's some ldap thing [18:41:46] i don't think i can see on graphite which wiki [18:41:50] I can get into graphite, but I don't know what I'm looking for [18:42:03] jeremyb_: yeah, I can logout/in with my username (Greg Grossmeier on wikitech, paired with gjg shell) but not graphite [18:43:27] hrm, can't see those stats i think [18:43:32] up by the top [18:43:35] 500 and 5xx [18:43:55] there's a lot of confusing data [18:44:18] RECOVERY - search indices - check lucene status page on search14 is OK: HTTP OK: HTTP/1.1 200 OK - 52931 bytes in 2.930 second response time [18:44:23] ah, I see the graph now [18:44:39] greg-g should be in wmf, right? [18:45:33] jeremyb_: ? [18:45:44] greg-g: ldap :) [18:45:57] oh, right, you're not asking me [18:45:59] ;) [18:46:04] should be [18:46:04] i think [18:46:20] (he's not) [18:46:40] $ groups gjg [18:46:40] gjg : wikidev project-bastion [18:46:52] :( [18:47:01] ah [18:49:41] https://gerrit.wikimedia.org/r/50297 [18:49:55] i knew it was changed for ishmael, didn't realize graphite was done too [18:50:23] * jeremyb_ wonders if we can find some broader group than wmf. at least for graphtie [18:50:48] yeah [18:51:08] LeslieCarr: https://wikitech.wikimedia.org/wiki/Manual_for_ops_on_duty#LDAP_group_changes [18:51:11] https://wikitech.wikimedia.org/w/index.php?title=Help:Access&oldid=5461#Giving_users_Labs_access.2C_if_they_already_have_an_SVN_account [18:51:18] found it in history, docs changed [18:51:33] are you guys sure it's AFTv5 and not the old ArticleFeedback? [18:52:05] I see some errors from the old ArticleFeedback in fatal.log, but nothing from aftv5 [18:52:14] greg-g, LeslieCarr: ^ [18:52:37] hrm [18:52:41] i am not certain [18:52:51] the timing seems to work out is why [18:52:54] !log adding gpg to wmf ldap group [18:52:57] greg said AFTv5 in his email [18:53:01] Logged the message, Master [18:53:05] make that "gjg" :p [18:53:15] that's because i said aftv5 i think [18:53:16] jeremyb_: greg-g ^ [18:53:55] is fatal.log the right place to look for 500 errors? [18:53:58] $ groups gjg [18:53:58] gjg : wikidev wmf project-bastion [18:54:26] i'm about to start checking out from locke [18:55:11] New review: Thehelpfulone; "Simple change" [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/56422 [18:55:31] mutante: heh, I'll go by gpg, too, I guess :) [18:55:44] ^ isn't that comment inviting for someone to take a look for +2? ;) [18:55:47] [28-Mar-2013 18:25:01] Fatal error: Class 'SpecialArticleFeedback' not found at /home/wikipedia/common/php-1.21wmf12/extensions/ArticleFeedback/populateAFStatistics.php on line 294 [18:55:47] #0 /home/wikipedia/common/php-1.21wmf12/extensions/ArticleFeedback/populateAFStatistics.php(294): PopulateAFStatistics::populateHighsLows() [18:55:47] #1 /home/wikipedia/common/php-1.21wmf12/extensions/ArticleFeedback/populateAFStatistics.php(154): PopulateAFStatistics->populateHighsLows() [18:55:47] #3 /home/wikipedia/common/php-1.21wmf12/extensions/ArticleFeedback/populateAFStatistics.php(613): require_once('/home/wikipedia...') [18:56:01] kaldari: Are you looking at cronjob fail? [18:56:03] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56413 [18:56:03] greg-g: the typo was just in the !log line though:) modify-ldap-group --addmembers=gjg wmf [18:56:40] Reedy: I was just grepping fatal.log for 'ArticleFeedback' on fluorine [18:56:41] mutante: I still can't log into graphite, is there a replication delay? [18:56:54] welll, it looks like populateAFStatistics.php is a maintenance job [18:57:02] Reedy: how do you look at cronjob fail? [18:57:07] Nothing else in the apache logs on fenari eiher [18:57:23] kaldari: I'm presuming that's run on as a cronjob, as it's there every hour at 25 past [18:57:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:57:52] mutante: no delay... maybe it's cached [18:58:05] greg-g: i dunno then :/ just followed docs [18:58:13] no, he's in the group [18:58:16] you did right [18:58:32] good [18:58:57] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56422 [18:59:28] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [19:00:21] greg-g, LeslieCarr: unfortunately, matthias, who is the dev for AFT, is asleep right now. Since this has already been going on for a while, would there be any objection to waiting until we can hear back from him before doing any reverting or disabling? [19:00:37] New patchset: Asher; "- fix issue where export wouldn't work under high load, leading to graphs flipping between no data and huge spikes when a response was received minutes later" [operations/software] (master) - https://gerrit.wikimedia.org/r/56428 [19:01:19] kaldari: Asleep? [19:01:29] Reedy: it's a thing people do [19:01:30] :P [19:01:34] especially as it isn't clear if it's coming from old AFT or AFTv5 [19:01:36] At 8 or 9pm? [19:01:37] ori-l so funny [19:01:54] i <3 Reedy, he knows that [19:01:56] I dunno, maybe not asleep. Just resting his eyes :) [19:02:06] ori-l: I was up all night, then apparently slept most of the day.. [19:02:16] Change merged: Asher; [operations/software] (master) - https://gerrit.wikimedia.org/r/56428 [19:02:23] people get confused when Europeans work at night and sleep during the day:) [19:03:28] I get confused when I do it [19:04:16] Reedy: i created those search indices for new wikivoyages, but i'm not sure yet if it worked right :p [19:04:21] binasher: would you mind if I truncated the NavTiming log tables? There is a bug in IE9's implementation that I fixed a few days ago. Prior to that IE9 was generating skewed measurements. Rather than carefully prune those away, I'd prefer to just truncate the table. [19:04:31] mutante: IIRC it takes 24 hours or something [19:04:46] mutante: Besides, it'll take them a few weeks to notice :D [19:05:01] Reedy: aah, that would explain it. well, i will claim it is resolved because i did what the docs said [19:05:54] icinga thinks manganese hasn't had a puppet run for nearly 3 hours? [19:06:11] 3 hours? it wouldn't report until it's been 10 hours?! [19:06:16] The last Puppet run was at Thu Mar 28 18:39:59 UTC 2013 (26 minutes ago). [19:06:19] icinga lies [19:07:24] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [19:07:38] mutante: i was using https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=manganese&service=Puppet+freshness ... [19:07:39] and it just can't forget db11.. meeeh [19:08:48] load average: 281.37, 107.56, 53.82 [19:08:49] yay [19:09:18] whoa [19:09:20] which box? [19:09:24] neon [19:09:37] !log stopping gmetad and ganglia-parser on neon [19:09:49] Logged the message, Master [19:09:50] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [19:10:00] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 31069 MB (3% inode=99%): [19:10:04] and i predict now you will soon see some recoveries [19:10:09] hah [19:10:15] !log replaced collector on professor with locally built version from I2bc117d5620b9e545a8e5a6cf3b654ee835b75d7 (testing fix for issues only under very high load) [19:10:22] Logged the message, Master [19:12:14] !log pkill -u snmptt on neon, restarting snmptt deamon [19:12:20] Logged the message, Master [19:12:30] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Thu Mar 28 19:12:20 UTC 2013 [19:12:30] kaldari: so looking through the 503's only ... [19:12:45] actually, anyone else want to help me ? someone with better knowledge skills at this ? [19:13:20] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [19:13:45] and it's back to load average: 5.76 and decreasing [19:13:50] LeslieCarr: I'm happy to help if I can [19:14:10] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Thu Mar 28 19:14:07 UTC 2013 [19:14:18] but ganglia_parser just came back, started by puppet.. so... [19:14:20] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [19:14:27] let me look at the logs again [19:14:37] thanks ori-l [19:14:44] mutante: if it keeps dying, just kill it in the puppet manifest [19:14:49] and let analytics know [19:14:59] LeslieCarr: merged on sockpuppet not just gerrit, right? [19:15:00] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Thu Mar 28 19:14:51 UTC 2013 [19:15:07] mutante: btw, still no go with me logging into graphite, anything else I should test to see if the wmf group works on another system? [19:15:20] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [19:15:30] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Thu Mar 28 19:15:28 UTC 2013 [19:15:37] New patchset: Demon; "Updating for 2.6-rc0-73-gaab0ec6" [operations/debs/gerrit] (master) - https://gerrit.wikimedia.org/r/56485 [19:15:41] greg-g: try ishmael.wikimedia.org [19:16:09] although that's the same box... [19:16:20] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [19:16:21] jeremyb_: no go [19:17:05] New patchset: Demon; "Updating for 2.6-rc0-76-g52fb5ae" [operations/debs/gerrit] (master) - https://gerrit.wikimedia.org/r/56485 [19:17:06] !log manually deleting db11 references from icinga configs and restarting it [19:17:13] Logged the message, Master [19:17:40] New review: Demon; "War available: https://integration.wikimedia.org/nightly/gerrit/wmf/gerrit-2.6-rc0-76-g52fb5ae.war" [operations/debs/gerrit] (master) - https://gerrit.wikimedia.org/r/56485 [19:17:40] New patchset: Ottomata; "Maxing user mod proxy is enabled for reportcard" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56493 [19:18:07] greg-g: https://icinga-admin.wikimedia.org/icinga [19:18:10] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56493 [19:18:13] hrm actually there's another thought/possibility i am thinking [19:18:23] jeremyb_: nope [19:18:29] greg-g: this is fun! [19:18:36] so varnish sends a 500 (am i right?) when doesn't have something cached [19:18:37] LeslieCarr: ok [19:18:44] jeremyb_: how many places can we get greg to type his password?! [19:18:51] jeremyb_: ah, thanks, icinga was a good idea [19:19:04] mutante: was just grepping for virt0 [19:20:16] maybe something changed which isn't properly getting cache hits [19:21:01] LeslieCarr: why would it do that? [19:21:06] (why send a 500) [19:21:21] New patchset: MaxSem; "Sync X-Device rules with MobileFrontend" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56502 [19:21:22] LeslieCarr: I have to finished up an Echo deployment and then would like to eat something. Is it OK if we stay in a holding pattern for a bit on the AFT errors? [19:21:34] can someone please review ^^^ [19:21:52] back to the backend servers [19:21:54] yeah we can hold [19:21:57] thanks [19:23:56] kaldari / LeslieCarr: https://gerrit.wikimedia.org/r/#/c/56503/ [19:24:01] !log kaldari synchronized php-1.21wmf12/extensions/Echo 'syncing Echo on wmf12' [19:24:08] Logged the message, Master [19:24:37] ori-l: oh, nice [19:25:57] New patchset: Ottomata; "Fixing path to apache vhost error log now that I am not using apache module." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56504 [19:26:15] ori-l, LeslieCarr: I'll go ahead and deploy this fix for the old ArticleFeedback [19:26:18] oh actually squid (forgot text is still on squids… sigh) [19:26:24] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56504 [19:26:41] mutante: since we're talking greg's logins, how do I get access to RT? I've found a few situations where it would've been useful for me to have access. [19:27:01] although I don't know if this is the issue that Leslie is flagging or not [19:27:01] ah never mind [19:27:04] http://squid-web-proxy-cache.1019090.n4.nabble.com/How-to-disable-TCP-NEGATIVE-HIT-td1031337.html [19:27:05] i was wrong [19:27:14] so this is squid getting 503's form the mediawiki servers [19:28:37] LeslieCarr: are we looking at the same problem? [19:28:49] i'm looking at the ArticleFeedback fatals [19:28:50] if this is relevant [19:28:59] ah i am looking at squid errors in general [19:29:03] this is the url that is 500ing the most: [19:29:03] http://commons.wikimedia.org/w/index.php?title=MediaWiki:Filepage.css&action=raw&maxage=2678400&usemsgcache=yes&ctype=text%2Fcss&smaxage=2678400 [19:29:11] LeslieCarr: Do you recall why you thought the errors were coming from AFTv5 specifically? [19:29:14] as in like 99% of the time [19:29:19] 99% of the 500s [19:29:21] are that url [19:29:24] because of the timing [19:29:33] (rough napkin calculation based on 10000 current 500s) [19:29:45] why does this cause fatal exceptions ? [19:29:45] b [19:29:50] ecause i can occasionally get it to work [19:29:58] oh, i just got an exception, ok [19:30:55] haha, binasher we both just ran pretty much the same command :p [19:31:13] ahha - https://bugzilla.wikimedia.org/show_bug.cgi?id=46612 [19:31:20] Reedy: you around ? [19:31:21] look up the exception in the log [19:31:33] Mmm [19:31:41] i'm not seeing it anywhere; the hash is 210df484 [19:31:45] ottomata: i ran it a couple days ago tho too and it's the same now as then.. LeslieCarr what did you see that made you think aft? [19:32:00] the timing of the deploy and the 500 increase [19:32:23] Commons doesn't use AFT [19:33:08] LeslieCarr: good to check what requests are throwing errors before pointing :) [19:33:32] !log starting gmetad on neon again [19:33:34] Logged the message, Master [19:34:11] so reedy i pinged you because you reported the bug -- is anyone checking out the bug ? [19:34:18] Don't think so [19:34:28] haha of course :) [19:34:40] I just noted it to be prevelant so logged it [19:35:50] !log olivneh synchronized php-1.21wmf12/extensions/ArticleFeedback/populateAFRevisions.php 'Fix fatal error triggered by populateAFRevisions.php cronjob when AFT is not loaded.' [19:35:57] Logged the message, Master [19:36:12] Aaron|home: Reedy: I3066d8dbebc97abcc0567d71625f995d62549b4c maybe [19:37:53] eh, probably not [19:39:57] though it probably did start with march 22 21:03 logmsgbot: aaron rebuilt wikiversions.cdb and synchronized wikiversions files: Set all non-Wikipedias back to 1.21wmf12 again. [19:40:05] not sure if there were other relevant changes in wmf12 [19:40:36] binasher, while you are here, are we monitoring dropped udp packet counters, esp. on udp2log machines [19:40:57] either from /proc/net/udp or from netstat -s p packet receive errors? [19:41:08] and, if not, shall I? [19:41:32] can we try wmf12 over again, plz? :/ [19:41:56] PROBLEM - Apache HTTP on mw1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:41:56] PROBLEM - Apache HTTP on mw1018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:42:01] ottomata: probably just the janky crap from udplog samples.. i asked in an rt ticket maybe a year ago to use /proc/net/udp data instead but i didn't personally do it [19:42:06] PROBLEM - LVS HTTPS IPv4 on wikivoyage-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:42:16] PROBLEM - Apache HTTP on mw1063 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:42:16] PROBLEM - Apache HTTP on mw1083 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:42:16] PROBLEM - Apache HTTP on mw1047 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:42:16] PROBLEM - Apache HTTP on mw1176 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:42:16] PROBLEM - Apache HTTP on mw1163 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:42:16] PROBLEM - Apache HTTP on mw1076 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:42:16] PROBLEM - Apache HTTP on mw1099 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:42:17] PROBLEM - Apache HTTP on mw1046 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:42:17] PROBLEM - Apache HTTP on mw1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:42:18] PROBLEM - Apache HTTP on mw1057 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:42:18] PROBLEM - Apache HTTP on mw1055 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:42:19] PROBLEM - Apache HTTP on mw1052 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:42:19] PROBLEM - Apache HTTP on mw1168 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:42:20] ok cool, i'm going to do it [19:42:20] PROBLEM - Apache HTTP on mw1181 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:42:20] PROBLEM - Apache HTTP on mw1090 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:42:21] PROBLEM - Apache HTTP on mw1092 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:42:21] aahhh! [19:42:24] oh [19:42:25] oh noes [19:42:26] PROBLEM - Apache HTTP on mw1215 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:42:26] PROBLEM - Apache HTTP on mw1179 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:42:27] PROBLEM - Apache HTTP on mw1042 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:42:27] PROBLEM - Apache HTTP on mw1069 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:42:27] PROBLEM - Apache HTTP on mw1174 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:42:27] PROBLEM - Apache HTTP on mw1082 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:42:36] PROBLEM - Apache HTTP on mw1028 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:42:37] PROBLEM - Apache HTTP on mw1098 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:42:37] PROBLEM - Apache HTTP on mw1019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:42:37] PROBLEM - Apache HTTP on mw1101 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:42:37] PROBLEM - Apache HTTP on mw1056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:42:37] PROBLEM - Apache HTTP on mw1185 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:42:37] PROBLEM - Apache HTTP on mw1038 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:42:37] PROBLEM - Apache HTTP on mw1054 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:42:37] ottomata: we should probably do https://rt.wikimedia.org/Ticket/Display.html?id=1873 [19:42:46] ερ? [19:42:48] sigh [19:42:49] and could then nagios the ganglia [19:42:55] but uh.. yeah, ^^ [19:43:04] oh, killing gnglia again and then killing it in puppet [19:43:05] picked random one, mw1054, apache procs running [19:43:14] New patchset: Spage; "Disable E3Experiments extension on all wikis" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56509 [19:43:23] ah great, thanks binasher [19:43:38] errors reported in wikitech [19:43:40] RR_CANNOT_FORWARD, errno (11) Resource temporarily unavailable at Thu, 28 Mar 2013 19:40:14 GMT [19:43:44] RECOVERY - Apache HTTP on mw1185 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 4.818 second response time [19:43:44] RECOVERY - Apache HTTP on mw1030 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 5.483 second response time [19:43:44] RECOVERY - Apache HTTP on mw1081 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.055 second response time [19:43:44] RECOVERY - Apache HTTP on mw1066 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.066 second response time [19:43:44] RECOVERY - Apache HTTP on mw1038 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 6.846 second response time [19:43:45] RECOVERY - Apache HTTP on mw1074 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 7.315 second response time [19:43:45] RECOVERY - Apache HTTP on mw1071 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.631 second response time [19:43:46] RECOVERY - Apache HTTP on mw1217 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 8.752 second response time [19:43:46] RECOVERY - Apache HTTP on mw1101 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 8.818 second response time [19:43:47] RECOVERY - Apache HTTP on mw1209 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 5.260 second response time [19:43:47] RECOVERY - LVS HTTP IPv4 on appservers.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 62802 bytes in 0.208 second response time [19:43:48] RECOVERY - Apache HTTP on mw1061 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 5.125 second response time [19:43:48] RECOVERY - Apache HTTP on mw1026 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 5.500 second response time [19:43:49] RECOVERY - Apache HTTP on mw1213 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 1.750 second response time [19:43:49] RECOVERY - Apache HTTP on mw1166 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 4.657 second response time [19:43:49] that was fast [19:43:50] RECOVERY - Apache HTTP on mw1097 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 2.781 second response time [19:43:50] RECOVERY - Apache HTTP on mw1171 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.069 second response time [19:43:51] RECOVERY - Apache HTTP on mw1048 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.108 second response time [19:43:51] RECOVERY - Apache HTTP on mw1110 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.070 second response time [19:43:52] RECOVERY - Apache HTTP on mw1167 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.072 second response time [19:43:52] RECOVERY - Apache HTTP on mw1036 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.071 second response time [19:43:53] RECOVERY - Apache HTTP on mw1084 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 1.274 second response time [19:43:53] RECOVERY - Apache HTTP on mw1164 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.060 second response time [19:43:54] RECOVERY - Apache HTTP on mw1080 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.599 second response time [19:43:54] RECOVERY - Apache HTTP on mw1029 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 2.925 second response time [19:43:55] RECOVERY - Apache HTTP on mw1018 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.980 second response time [19:43:59] RECOVERY - Apache HTTP on mw1094 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 4.134 second response time [19:43:59] RECOVERY - Apache HTTP on mw1100 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 5.339 second response time [19:43:59] RECOVERY - Apache HTTP on mw1087 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 5.833 second response time [19:43:59] RECOVERY - Apache HTTP on mw1059 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 6.203 second response time [19:43:59] RECOVERY - Apache HTTP on mw1020 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 4.837 second response time [19:44:00] RECOVERY - Apache HTTP on mw1079 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 9.540 second response time [19:44:00] RECOVERY - Apache HTTP on mw1022 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 8.908 second response time [19:44:01] RECOVERY - Apache HTTP on mw1037 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 9.168 second response time [19:44:09] good start, ok cool, so /proc/net/snmp is the proper place then? [19:44:13] better than netstat -s? [19:44:23] !log killed gmetad on neon [19:44:30] !log asher synchronized wmf-config/db-eqiad.php 'pulling db1051' [19:44:30] and those are more global numbers then looking for processes/ports in /proc/net/udp [19:44:30] right? [19:44:30] Logged the message, Mistress of the network gear. [19:44:30] RECOVERY - Apache HTTP on mw1055 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 4.522 second response time [19:44:36] Logged the message, Master [19:44:39] reported recovery in wikitech [19:44:44] *wikimedia-tech [19:44:49] RECOVERY - Apache HTTP on mw1039 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 9.432 second response time [19:44:50] RECOVERY - Apache HTTP on mw1075 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 2.809 second response time [19:44:59] RECOVERY - Apache HTTP on mw1017 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 4.566 second response time [19:44:59] RECOVERY - Apache HTTP on mw1107 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 8.552 second response time [19:45:27] die gmetad ... [19:45:35] it's a bit screwy that in MessageCache::get both $langcode and $isFullKey are used to signal whether the key is complete [19:45:41] that inconsistency is the source of the error here [19:45:44] oh look in tech [19:45:45] New patchset: Lcarr; "deactivating ganglios" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56510 [19:45:47] but i still don't know where it's coming from [19:45:53] that wasn't just neon [19:46:00] that was actual issues [19:46:11] thanks apergos for the heads up for -tech [19:46:16] yw [19:46:52] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56510 [19:47:24] New review: Dzahn; "yea, ganglia_parser and gmetad keep messing up neon" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56510 [19:47:46] Aaron|home: there are some *huge* inserts into the job table on enwiki that are blocking up all the slaves [19:47:54] uh oh [19:47:59] lagged slaves = apaches not using them = site down [19:48:05] please not have to do with wikidata [19:48:11] * apergos twitches [19:48:13] apergos: probably [19:48:19] grumbfgirg [19:48:23] apergos: can you kill the wikidata job cron? [19:48:41] man I keep getting distracted today [19:48:59] binasher: i sync'd a change to make the aft populate statistics script die early if the ArticleFeedbackSpecial class is not loaded. i hope it is not getting inserted in a loop somewhere. [19:49:12] uh lemme look, just a sec [19:50:06] looks like refreshLinks [19:51:16] LeslieCarr: binasher: I'm only just catching up - is there still any indication AFTv5 is related to the issues? or anythine else I can do? [19:51:22] !log asher synchronized wmf-config/db-eqiad.php 'returning db1051' [19:51:29] Logged the message, Master [19:51:34] mlitn: no.. leslie - can you reply to that aft thread? [19:51:45] done on hume temporarily (live hack) [19:51:54] Aaron|home: how many refreshLInks jobs can be added per single job insert? [19:52:19] did it not go through ? [19:52:48] LeslieCarr: oh yeah, it did [19:53:24] Change merged: Mattflaschen; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56509 [19:53:39] around $wgUpdateRowsPerJob [19:53:40] that should do it until puppet updates over there again [19:53:43] Aaron|home: the current enwiki master binlog is almost entirely refreshlinks job queue inserts [19:54:06] if only there was some other place to put the queue [19:54:13] Has any user actually reported a problem related to this message cache error? seems to be 9-10% of all exception log lines are related [19:54:28] Reedy: which one? [19:55:06] Message key '(Filepage|Common|Handheld|Print|Monobook).css' does not appear to be a full key [19:55:29] they're all like: INSERT /* JobQueueDB::{closure} */ [19:55:45] what's with the no ip address or user in the comment? [19:56:13] when the job runners run queries, 127.0.0.1 appears [19:56:13] that means it's from a job runner [19:57:00] Reedy: if you look at includes/actions/RawAction.php:140, you'll see MessageCache::singleton()->get() called with $title->getDBkey() as the key, and the final parameter ($isFullKey) set to true, which is meant to indicate that the key contains the language code in the format 'en/foo' [19:57:15] Reedy: $title->getDBkey() clearly does not return values in that format, but what confuses me is how that ever worked [19:57:26] Reedy: most are "INSERT /* JobQueueDB::{closure} 127.0.0.1 */ " which is what i'd expect from a job runner [19:59:02] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [19:59:43] ori-l, thanks for python-udp-gmond ! [19:59:56] mind if I take it and stick it in puppet? [20:00:03] gonna add a few more metrics to the generic udp one I htink [20:00:12] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [20:00:24] ottomata: sure, go for it [20:00:26] glad it's useful [20:00:42] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 30716 MB (3% inode=99%): [20:01:06] yeah totally awesome, gonna save me tons of time [20:02:22] I also have a udp sequence id skip ganglia plug-in somewhere if you want it, though perhaps that metric is already collected by an existing script [20:02:56] skip ganglia plugin? [20:03:08] do you mean the PacketLossLog tailer? [20:03:37] i don't remember what that is actually polling; if it's looking for gaps in sequence IDs than you don't need my script [20:03:42] New patchset: Rfaulk; "mod. reduce max process count for user metrics project hosted on stat1001." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56512 [20:04:01] misc/maintenance.pp class misc::maintenance::wikidata if you decide you need them off or altered longer than the next puppet run [20:08:02] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Thu Mar 28 20:08:01 UTC 2013 [20:08:18] Heya, E3 is starting its deployment. Nothing big, just turning off one extension and changes to some others. [20:08:54] spagewmf: hang on a sec & wait for a green light -- there are some site issues [20:09:00] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56512 [20:09:02] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [20:09:45] ori-l np [20:09:53] yea, ori-l, that's what it is doing, [20:10:02] but in C [20:10:06] part of udp2log package [20:10:09] packet-loss.cpp [20:10:11] (C++*) [20:11:11] !log add hostname samarium.wikimedia.org and pdns-update [20:11:16] Logged the message, Master [20:14:42] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Thu Mar 28 20:14:32 UTC 2013 [20:15:02] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [20:15:18] spagewmf: well, seems quiet, so go ahead [20:16:27] ori-l, do you know what NoPorts is? [20:16:31] OH [20:16:32] haha [20:16:38] i see your description in the code right now [20:16:40] ha [20:16:46] nm [20:16:59] was googling/manpaging, but your code documentation is better :) [20:19:21] binasher: do you know: in /proc/net/snmp, is InErrors the same number as RcvbufErrors [20:20:25] ah hm, i think they aren't [20:29:28] hiaayy notpeter or maybe Ryan_Lane, whatcha think? [20:29:45] should udp2log specific ganglia plugins go in files/udp2log or files/ganglia/plugins? [20:29:59] files ganglia plugins [20:30:07] until we split things apart into modules [20:31:38] mk cool [20:31:42] danke [20:31:44] ottomata: grepping through the relevant files it looks like InErrors is incremented on ENOBUFS ("No buffer space available"), ENOMEM ("Out of Memory"), when the checksum is 0 or otherwise invalid, and when IPSec policy specifies the packet should be rejected. but the last one applies only to ipv6 [20:32:14] hm, aye, and RcvBufErrors is more specific? [20:32:18] so, are we ok with the 500 issue now, LeslieCarr ? [20:32:39] also, the debug of the jobqueue thing related to hume dying there for a bit: https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&c=Miscellaneous+pmtpa&h=hume.wikimedia.org&tab=m&vn=&mc=2&z=medium&metric_group=ALLGROUPS ? [20:33:04] I think LeslieCarr is at lunch, but [20:33:08] still tons of 500s [20:33:15] k [20:33:23] mainly from this: [20:33:23] http://commons.wikimedia.org/w/index.php?title=MediaWiki:Filepage.css&action=raw&maxage=2678400&usemsgcache=yes&ctype=text%2Fcss&smaxage=2678400 [20:36:06] ottomata: looking at http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/net/ipv4/udp.c (grep for 'UDP_MIB_RCVBUFERRORS ') it looks like RcvBufErrors is incremented specifically on failure to allocate a buffer in memory [20:36:14] so yes, it's more specific [20:36:23] hmm k [20:36:24] cool [20:36:26] good to have both then [20:36:27] danke! [20:37:18] yeah, i guess InErrors - RcvbufErrors = invalid checksums and ipsec rejections [20:43:00] greg-g: https://gerrit.wikimedia.org/r/#/c/56519/ should fix it, but I'd be reluctant to move forward with it without input from platform peeps.. I still don't really get how this was never an issue before. [20:43:18] ori-l: understand, thanks [20:43:21] New patchset: Mattflaschen; "Enable GuidedTour on mwlwiki and ptwiki." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56520 [20:44:41] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56520 [20:47:06] ok Ryan_Lane, another puppet q, [20:47:11] i want to monitor generic udp stats [20:47:32] on a good number of machines (front caches, apaches, udp2log machines) [20:47:46] I just need to install 2 ganglia plugin files [20:47:59] should I make a new .pp file (udp_monitoring.pp???) [20:48:15] or is there a good place for generic os/network level monitoring stuff? [20:48:22] i'm looking around but not finding anything [20:48:35] ummm [20:50:06] greg-g: I think this change exposed the bug: https://gerrit.wikimedia.org/r/#/c/44224/ [20:50:11] even though it did not introduce it [20:50:38] binasher: did you see that patch just now? [20:50:39] ori-l: hmmmmmm [20:51:00] ori-l: mind commenting on https://bugzilla.wikimedia.org/show_bug.cgi?id=46612 ? [20:51:02] Reedy: https://gerrit.wikimedia.org/r/#/c/56522/ [20:52:58] New patchset: Ottomata; "UDP and udplog socket stats into ganglia, yay! See: RT 1873" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56524 [20:56:06] New review: Ottomata; "NOT READY YET!" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/56524 [20:56:23] ottomata: sorry, working on something distracting [20:56:40] s'ok [20:56:43] no hurry [20:56:50] greg-g: done [20:57:00] ori-l: thanks much [20:59:02] PROBLEM - Puppet freshness on virt3 is CRITICAL: Puppet has not run in the last 10 hours [21:04:44] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [21:06:24] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 31269 MB (3% inode=99%): [21:06:54] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [21:08:56] New patchset: Ottomata; "UDP and udplog socket stats into ganglia, yay!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56524 [21:31:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:32:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [21:35:44] New patchset: Ottomata; "udp2log socket stats into ganglia, yay!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56524 [21:36:01] !log olivneh synchronized php-1.21wmf12/extensions/NavigationTiming [21:36:08] Logged the message, Master [21:37:19] New review: Ottomata; "more generic UDP stats to come in a separate commit." [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/56524 [21:37:24] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56524 [21:40:46] starting scap [21:44:20] ori-l, there's an unrelated wmf-config change showing up that you +2'd, "(bug 46639) Add flood to wgAddGroups for bureaucrats on Meta" https://gerrit.wikimedia.org/r/#/c/56422/ [21:44:33] is it OK to deploy that along with E3's config changes? [21:44:39] yep [21:45:41] !log olivneh synchronized php-1.21wmf12/extensions/NavigationTiming/modules/ext.navigationTiming.js 'Fix mobileMode property name (was: 'mobile')' [21:45:48] Logged the message, Master [21:50:02] scap is reporting a few permission problems, 'Copying to fenari from 10.0.5.8...rsync: failed to set times on "/usr/local/apache/common-local/live-1.5": Operation not permitted (1)' [21:50:26] New review: Faidon; ".PHONY usually goes at the end, but meh :)" [operations/debs/python-voluptuous] (master) C: 2; - https://gerrit.wikimedia.org/r/56168 [21:50:54] * Aaron|home still grins at the word "voluptuous" [21:51:18] likes how jenkins-bot is sorry [21:51:43] Aaron|home: heh [21:52:27] paravoid: feel free to submit https://gerrit.wikimedia.org/r/#/c/54324/; i'm around [21:53:23] scap reporting several unreadable index.lock files in /home/wikipedia/common/php-1.21wmf11/.git/modules/extensions/*/index.lock [21:54:22] ori-l: path conflict, needs manual rebase [21:54:30] spagewmf: you can disregard, i think [21:54:41] paravoid: ah, ok. i'll make the small additional change you requested too, then. [21:55:58] the index.lock files are over a week old, owned by bsitu, roan, mlitn. I'm going to rm -f them [21:57:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:58:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 1.064 second response time [22:02:06] !log spage Started syncing Wikimedia installation... : E3: Updated GettingStarted and GuidedTour extensions [22:02:13] Logged the message, Master [22:03:42] New patchset: Ori.livneh; "Add 'eventlogging' puppet module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54324 [22:04:21] faidon: have to go interview a candidate, but go ahead and submit if it's ok [22:06:16] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [22:06:48] icinga-wm: that's just .. i manually deleted that from your config.. so you are still actively re-creating that [22:07:26] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [22:07:56] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 30715 MB (3% inode=99%): [22:10:34] !log bugzilla admin stuff: noc@w.o is now a disabled account cuz im tired of folks who arent on the alias attempting to sign up with it [22:10:41] Logged the message, RobH [22:11:01] New patchset: Ottomata; "Puppetizing udp2log instances on analytics nodes." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56537 [22:12:26] !log removed some log files from ms1004 - it was full [22:12:33] Logged the message, Mistress of the network gear. [22:13:04] New review: Ottomata; "Faidon, this is not necessary to merge if you feel like we shouldn't. I can do what I need to do ri..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56537 [22:13:19] mutante: I ran puppetstoredconfigclean.rb db11.pmtpa.wmnet [22:13:20] for you [22:13:56] j^: have you had a chance to mess around with UW on test2wiki? [22:17:45] I think scap's 'failed to set times on "/usr/local/apache/common-local/live-1.5": Operation not permitted' are because it's a symlink to w, and rsync can't update the fs attrs of a symlink. [22:18:28] paravoid: thanks :) does that clean it from db? looking [22:19:08] scap is reporting dozens of "mw121: Copying to mw121 from mw10.pmtpa.wmnet...failed", perhaps this symlink problem is the only issue [22:20:20] !log spage Finished syncing Wikimedia installation... : E3: Updated GettingStarted and GuidedTour extensions [22:20:27] Logged the message, Master [22:23:32] K4-713 scap finished, sorry to delay you. If you're going to run scap note my comments ^ about copy errors [22:24:00] spagewmf: Thanks! [22:24:41] greg-g: I'm starting now. [22:26:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:27:05] great, thanks spagewmf and K4-713 [22:27:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.122 second response time [22:31:26] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Thu Mar 28 22:31:16 UTC 2013 [22:32:11] binasher: when might the https://gerrit.wikimedia.org/r/#/c/35139/ index changes be applied? [22:32:16] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [22:32:46] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Thu Mar 28 22:32:40 UTC 2013 [22:33:16] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [22:33:16] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Thu Mar 28 22:33:09 UTC 2013 [22:33:16] RECOVERY - Puppet freshness on ms1004 is OK: puppet ran at Thu Mar 28 22:33:15 UTC 2013 [22:33:45] Aaron|home: oh.. it was never added to the schema_changes page, has it just been waiting for me to apply in prod? is mainly needed on commons or all? [22:33:50] !log khorn synchronized php-1.21wmf12/languages/Language.php 'Reverting changes in language fallback behavior' [22:33:57] Logged the message, Master [22:34:43] !log khorn synchronized php-1.21wmf12/includes/cache/MessageCache.php 'Reverting changes in language fallback behavior' [22:34:50] Logged the message, Master [22:35:00] binasher: all (commons,en in particular) [22:35:07] *enwiki [22:36:39] Aaron|home: how about on monday? [22:36:45] sure [23:06:32] !log cvn Installing subversion from apt on cvn-app1 (for pywikipedia) [23:06:38] Logged the message, Master [23:06:41] * Krinkle undoes [23:06:56] wrong chan [23:07:53] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [23:09:03] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [23:09:33] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 30145 MB (3% inode=99%): [23:12:22] New patchset: Ori.livneh; "Remove configs for LastModified and E3Experiments extensions" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56542 [23:14:34] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Thu Mar 28 23:14:31 UTC 2013 [23:14:53] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [23:18:54] LeslieCarr, (or any opsen) am I correct in thinking that if you add an alias to mchenry, you could reroute traffic sent to an OTRS address else where? Say for example if we had an email address that's currently an OTRS queue, email@wikipedia.org, if you added the alias in mchenry would you be able to temporarily send emails to that address elsewhere instead (a mailman mailing list)? [23:20:34] Thehelpfulone: please ask what you want to do, not theoretical questions [23:21:24] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56542 [23:21:51] heh sorry, okay so when OTRS is upgraded I presume it needs to go down for sometime. The enwiki oversight email address is currently an OTRS queue, but when OTRS is down for the upgrade they want to reroute that to their mailman mailing list so that people can still email the same email address. Is that possible, and does it need work from both ops and OTRS admins, or just ops? [23:23:56] !log olivneh synchronized wmf-config/CommonSettings.php 'Remove E3Experiments and LastModified (1/2)' [23:24:03] Logged the message, Master [23:24:09] !log olivneh synchronized wmf-config/InitialiseSettings.php 'Remove E3Experiments and LastModified (2/2)' [23:24:16] Logged the message, Master [23:30:27] New patchset: Reedy; "Remove wrong comments under wmgClickTracking" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56549 [23:31:40] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56549 [23:32:09] paravoid, ^ (see above) [23:38:24] Thehelpfulone: we can just defer/4xx mails for the period that it's down [23:39:37] paravoid, that's fine for the normal OTRS emails, but for the oversight one (which has emails with private information on enwiki that needs to be oversighted quickly) we can't defer it, hence the need to send it elsewhere [23:40:08] how long is the transition going to take? [23:40:13] who's doing it? [23:43:38] I know Martin (the OTRS creator) is testing some stuff this week and I imagine he'll be doing the transition too (with help from someone in ops, not sure who, maybe Daniel or Jeff?). I'm asking on behalf of the oversighters, and we don't really know how long it would take, that's probably something we'd need to ask him. [23:44:08] Would the approach be different for a few hours compared to a couple of days? [23:44:38] I would assume people could wait a few hours to have stuff OS'ed. [23:49:52] If the transition is going to take a couple of days though, would we be able to forward the emails? [23:50:29] Anyone mind if I do a single sync-file for https://gerrit.wikimedia.org/r/#/c/56546/1/GuidedTour.php ?