[00:00:28] officeit-bgp , heh [00:01:04] cajoel: nice [00:01:38] jeremyb: wanna review https://gerrit.wikimedia.org/r/#/c/96915/ ? [00:01:45] jeremyb: you got all the reply about dickson, right [00:03:18] (03PS1) 10Cmjohnson: Removing mgmt for sq44 & sq48 [operations/dns] - 10https://gerrit.wikimedia.org/r/96917 [00:04:33] (03CR) 10MZMcBride: "I think I'd personally prefer to hold off on deploying this to private wikis until it's been stress-tested a bit more, but I trust you two" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96890 (owner: 10CSteipp) [00:04:49] (03CR) 10Yurik: [C: 04-2] "fixed by another patch, this will break it :)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96824 (owner: 10Dr0ptp4kt) [00:05:41] (03PS2) 10Cmjohnson: Removing all dns for sq44 & sq48 [operations/dns] - 10https://gerrit.wikimedia.org/r/96917 [00:05:58] (03CR) 10Aaron Schulz: [C: 031] Hack: cron job to clean up tifs from /tmp on app servers [operations/puppet] - 10https://gerrit.wikimedia.org/r/96915 (owner: 10Ori.livneh) [00:06:07] thanks AaronSchulz [00:06:29] (03CR) 10Cmjohnson: [C: 032] Removing all dns for sq44 & sq48 [operations/dns] - 10https://gerrit.wikimedia.org/r/96917 (owner: 10Cmjohnson) [00:06:39] (03CR) 10Ori.livneh: [C: 032] Hack: cron job to clean up tifs from /tmp on app servers [operations/puppet] - 10https://gerrit.wikimedia.org/r/96915 (owner: 10Ori.livneh) [00:06:40] I thought you needed to escape \, but I was misreading. [00:06:54] (03CR) 10Chad: [C: 032] Increase Cirrus pool counter for new servers [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96064 (owner: 10Manybubbles) [00:07:34] (03Merged) 10jenkins-bot: Increase Cirrus pool counter for new servers [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96064 (owner: 10Manybubbles) [00:07:44] !log Built texvc for 1.23wmf5 [00:08:00] Logged the message, Master [00:08:36] !log demon synchronized wmf-config/PoolCounterSettings-pmtpa.php 'New pool counter settings for Cirrus' [00:08:52] Logged the message, Master [00:09:07] !log demon synchronized wmf-config/PoolCounterSettings-eqiad.php 'New pool counter settings for Cirrus' [00:09:21] Logged the message, Master [00:14:31] (03CR) 10Dzahn: [C: 032] fix and remove various planet feed URLs [operations/puppet] - 10https://gerrit.wikimedia.org/r/96914 (owner: 10Dzahn) [00:15:35] mutante: no, busy, didn't read it yet; :) /me runs off again [00:17:40] (03CR) 10Dzahn: [C: 032] etherpad - tabbing, quoting & aligning [operations/puppet] - 10https://gerrit.wikimedia.org/r/96354 (owner: 10Dzahn) [00:22:45] (03CR) 10Dzahn: [C: 032] retab, quoting, linting of ishmael.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/96362 (owner: 10Dzahn) [00:28:51] (03PS7) 10Mwalker: Initial Puppet Try for OCG::Collection Role [operations/puppet] - 10https://gerrit.wikimedia.org/r/96811 [00:31:44] (03CR) 10Dzahn: "the dependency has been merged meanwhile. still good to go?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/96424 (owner: 10Dzahn) [00:32:57] ^d is your stuff on tin already? [00:33:05] I'm about to scap [00:33:07] <^d> I already sync'd it. [00:33:14] cool :) [00:33:14] <^d> Fire away [00:35:21] !log ori synchronized php-1.23wmf4/extensions/MobileFrontend/javascripts/loggingSchemas/MobileWebInfobox.js 'Cherry-pick I3efc1fa64' [00:35:36] Logged the message, Master [00:36:20] !log csteipp started scap: Updating WikimediaMessages to master for OAuth message [00:36:36] Logged the message, Master [00:36:47] yay. i'm not going to miss the ellipsis. [00:37:04] :) [00:39:18] Hmm. "mw25: rsync: send_files failed to open "/php-1.23wmf5/includes/Sanitizer.php.save"" [00:40:15] csteipp: !log ? [00:44:49] (03PS1) 10Cmjohnson: Removing DNS entries for ms7 and ms8 [operations/dns] - 10https://gerrit.wikimedia.org/r/96927 [00:44:50] going to touch and sync wmf4's startup.js; it won't interfere with scap [00:45:13] Hey Reedy, want to remove your temp file :) [00:46:13] !log ori synchronized php-1.23wmf4/resources/startup.js 'touch' [00:46:17] ty git am [00:46:28] Logged the message, Master [00:50:21] !log ori synchronized php-1.23wmf4/extensions/MobileFrontend/javascripts/loggingSchemas/MobileWebInfobox.js 'Cherry-pick I3efc1fa64' [00:50:26] (03CR) 10Cmjohnson: [C: 032] Removing DNS entries for ms7 and ms8 [operations/dns] - 10https://gerrit.wikimedia.org/r/96927 (owner: 10Cmjohnson) [00:50:29] !log csteipp finished scap: Updating WikimediaMessages to master for OAuth message [00:50:36] Logged the message, Master [00:50:44] csteipp: did you get timing info? [00:50:48] in stdout? [00:50:51] Logged the message, Master [00:50:58] ori-l: scap completed in 16m 09s. [00:51:04] That's amazing [00:51:15] (compared with previous) [00:51:32] :) [00:52:01] cmprss all the things [00:52:20] --go-faster [00:53:19] <^d> Who needs a new deployment tool? We'll just let ori-l make us `faster-scap` ;-) [00:53:25] is it like -v in lspci? the more -f (--faster) you put, the faster it goes? -fffffffffffffff [00:53:55] i was actually asking just to make sure the timing is outputted correctly [00:54:12] <^d> greg-g: Not quite. [00:54:21] <^d> --fffffuuuuuuuu [00:54:53] ^d: that's the one that crashes all mwXXXX's, right? [00:55:33] <^d> Yep. [00:56:25] !log ori synchronized php-1.23wmf4/resources/startup.js 'touch' [00:56:39] Logged the message, Master [00:58:11] (03PS4) 10CSteipp: Enable OAuth on all public wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96890 [00:59:54] (03CR) 10CSteipp: [C: 032] Enable OAuth on all public wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96890 (owner: 10CSteipp) [01:00:05] (03Merged) 10jenkins-bot: Enable OAuth on all public wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96890 (owner: 10CSteipp) [01:09:30] Dangit. You know, I loved scap so much, I'm going to do it again.. [01:11:36] scap: it runs twice as fast, *twice as often* [01:11:39] that's 4x! [01:15:39] (03PS1) 10Legoktm: Undeploy AssertEdit (merged into core) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96931 [01:16:04] !log csteipp started scap: Really send the new messages this time.. [01:16:21] Logged the message, Master [01:23:02] Ryan_Lane, paravoid: if I have a package in a PPA (it's a backport of texlive 2012 into ubuntu 12.04) -- how does that work with puppetization? can we import it into our local repos? [01:24:16] mwalker: i think that has been done? i guess partly would depend on where it's from (anyone can make a ppa!) [01:24:37] it's from the maintainers of texlie [01:24:40] *texlive [01:37:11] !log csteipp finished scap: Really send the new messages this time.. [01:37:27] Logged the message, Master [01:38:04] (03CR) 10Dzahn: "one inline comment and fyi the patch it depended on has been merged" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/96403 (owner: 10Dzahn) [01:40:42] !log csteipp synchronized wmf-config/CommonSettings.php [01:40:57] Logged the message, Master [01:41:11] !log csteipp synchronized wmf-config/InitialiseSettings.php [01:41:26] Logged the message, Master [01:45:36] (03CR) 10Dzahn: [C: 04-1] "thanks, it's nice that you are starting as a module right away, please see inline comments though and add some more reviewers to talk abou" (038 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/96552 (owner: 10Addshore) [01:46:32] greg-g: Deployment is done ^ Sorry it took so long [01:48:00] (03PS2) 10Dzahn: qualify vars planet_domain_name, planet_languages [operations/puppet] - 10https://gerrit.wikimedia.org/r/96225 (owner: 10ArielGlenn) [01:50:35] (03PS2) 10Dzahn: role classes for download servers [operations/puppet] - 10https://gerrit.wikimedia.org/r/96415 [01:55:56] (03CR) 10Dzahn: [C: 031] "this should be fine, it's more a matter of taste, merge if Ariel likes it since Ariel is kind of the owner of this role" [operations/puppet] - 10https://gerrit.wikimedia.org/r/96415 (owner: 10Dzahn) [02:09:00] (03CR) 10Dzahn: [C: 032] qualify vars planet_domain_name, planet_languages [operations/puppet] - 10https://gerrit.wikimedia.org/r/96225 (owner: 10ArielGlenn) [02:16:36] !log LocalisationUpdate completed (1.23wmf4) at Fri Nov 22 02:16:36 UTC 2013 [02:16:53] Logged the message, Master [02:21:06] (03CR) 10Dzahn: "thanks Ariel, no issues at all with this" [operations/puppet] - 10https://gerrit.wikimedia.org/r/96225 (owner: 10ArielGlenn) [02:32:21] (03PS1) 10Dzahn: more broken planet feeds [operations/puppet] - 10https://gerrit.wikimedia.org/r/96935 [02:33:13] (03CR) 10Dzahn: [C: 031] "these things are handled on https://meta.wikimedia.org/wiki/Planet_Wikimedia" [operations/puppet] - 10https://gerrit.wikimedia.org/r/96935 (owner: 10Dzahn) [02:35:52] !log LocalisationUpdate completed (1.23wmf5) at Fri Nov 22 02:35:51 UTC 2013 [02:36:08] Logged the message, Master [02:48:54] (03PS1) 10Dzahn: fix moved feed URLs [operations/puppet] - 10https://gerrit.wikimedia.org/r/96938 [02:50:04] (03PS2) 10Dzahn: fix moved feed URLs [operations/puppet] - 10https://gerrit.wikimedia.org/r/96938 [02:50:45] (03CR) 10Dzahn: [C: 032] fix moved feed URLs [operations/puppet] - 10https://gerrit.wikimedia.org/r/96938 (owner: 10Dzahn) [03:23:34] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri Nov 22 03:23:33 UTC 2013 [03:23:50] Logged the message, Master [03:41:02] (03PS1) 10Tim Starling: Normalise the path part of URLs in the text frontend [operations/puppet] - 10https://gerrit.wikimedia.org/r/96941 [03:41:43] (03CR) 10Dzahn: "maybe, i'd like to keep possible changes to the reporter scripts separate from this specific change though and hear andre if we want to ac" [operations/puppet] - 10https://gerrit.wikimedia.org/r/94075 (owner: 10Dzahn) [03:53:42] (03CR) 10Dzahn: "Ariel, re: install_certificate{ $svc_name: } in apache.pp etc. no, actually don't expect those certs to be moved into the module, at least" [operations/puppet] - 10https://gerrit.wikimedia.org/r/94075 (owner: 10Dzahn) [04:49:52] (03PS2) 10Springle: remove bellin/blondel references, they don't exist [operations/puppet] - 10https://gerrit.wikimedia.org/r/92989 (owner: 10Dzahn) [04:52:47] (03CR) 10Springle: [C: 032] remove bellin/blondel references, they don't exist [operations/puppet] - 10https://gerrit.wikimedia.org/r/92989 (owner: 10Dzahn) [05:33:38] (03PS1) 10Ori.livneh: graphite::web: parametrize site_name; declare in role class [operations/puppet] - 10https://gerrit.wikimedia.org/r/96952 [06:28:27] (03CR) 10ArielGlenn: "OK, it makes sense to leave certs where they are for now. But at some point we should have certs live in the (role) modules where they ar" [operations/puppet] - 10https://gerrit.wikimedia.org/r/94075 (owner: 10Dzahn) [06:53:09] !log neon back to read-only root filesystem, investigating [06:53:22] Logged the message, Master [06:54:44] (03CR) 10Ori.livneh: [C: 032] graphite::web: parametrize site_name; declare in role class [operations/puppet] - 10https://gerrit.wikimedia.org/r/96952 (owner: 10Ori.livneh) [06:58:33] (03PS2) 10Mattflaschen: Remove "Your cache administrator is nobody" joke. [operations/puppet] - 10https://gerrit.wikimedia.org/r/95147 [06:59:53] (03PS1) 10Ori.livneh: graphite::web: Correct parameter name [operations/puppet] - 10https://gerrit.wikimedia.org/r/96959 [07:00:52] (03CR) 10Ori.livneh: [C: 032] graphite::web: Correct parameter name [operations/puppet] - 10https://gerrit.wikimedia.org/r/96959 (owner: 10Ori.livneh) [07:03:27] (03PS1) 10Ori.livneh: rewrite nginx module [operations/puppet] - 10https://gerrit.wikimedia.org/r/96961 [07:06:46] ! rebooting neon, no underlying errors found except for the initial ext3 'deleted inode reference' which caused / to be remounted r/o [07:06:54] !log rebooting neon, no underlying errors found except for the initial ext3 'deleted inode reference' which caused / to be remounted r/o [07:07:09] Logged the message, Master [07:07:37] (03PS8) 10Mwalker: Initial Puppet Try for OCG::Collection Role [operations/puppet] - 10https://gerrit.wikimedia.org/r/96811 [07:08:05] * apergos sighs [07:14:29] well that worked but I sure wish I knew what is actually broken, because otherwise we'll have this again in not too long [07:16:30] faulty RAM? [07:16:55] possible [07:17:13] bad sector on disk? [07:17:25] didn't see any disk-related errors at a lower level [07:17:42] first error was ext3 whining that a deleted inode was referenced [07:19:21] its a bit farfetched at ground level -- but I always love to blame things on high energy particles flipping bits [07:19:41] twice in two days... mmmm [07:19:58] that's a bit more unlikely though [07:20:44] elves? [07:20:51] communists? [07:20:57] pixies! [07:21:24] de-ba-ser [07:21:28] did any other application complain about not being able to access a file? [07:21:38] there could be something holding a handle open [07:21:47] and I would expect it to throw an error [07:22:39] no, I didn't see anything like that [07:22:45] I'm on another track right now, hold on [07:23:07] * mwalker watches for oncoming trains [07:25:21] are you implying that communists are as nonexistent as pixies? :-P [07:25:52] oh no; just that they're as likely a candidate for mysterious bit flipping [07:26:06] hahaha [07:29:08] ok so what's interesting about these two fs issues is they were both the same time of day and the *same inode* [07:29:21] Nov 22 06:33:22 neon kernel: [76964.339556] EXT3-fs error (device md0): ext3_lookup: deleted inode referenced: 21531 [07:29:23] today's [07:29:42] Nov 21 06:32:16 neon kernel: [1285011.010472] EXT3-fs error (device md0): ext3_lookup: deleted inode referenced: 21531 [07:29:45] yesterday's [07:34:03] well; if you wanted to wait a while; you could try to `find -printf "%i:\t%p\n" | egrep "^21531\t"` [07:34:23] aka; try and see if there's a file that matches that inode [07:34:45] I guess inum finds by inode [07:34:56] but I was going to see if there's some quicker way to get info about it [07:53:42] I have managed to trigger the error already by the find [07:54:27] /lib/modules/3.2.0-45-generic/kernel/net/netfilter/xt_hashlimit.ko this [07:54:52] there are four files that trigger these errors. I wonder what I can do about it [07:55:08] find: `/lib/modules/3.2.0-45-generic/kernel/net/netfilter/xt_hashlimit.ko': Input/output error [07:55:08] find: `/lib/modules/3.2.0-45-generic/kernel/net/netfilter/xt_u32.ko': Input/output error [07:55:08] find: `/lib/modules/3.2.0-45-generic/kernel/net/netfilter/xt_esp.ko': Input/output error [07:55:08] find: `/lib/modules/3.2.0-45-generic/kernel/net/netfilter/xt_socket.ko': Input/output error [07:55:16] the find completes after these. [07:55:25] recompile the netfilter module? [07:55:55] or actually; easier -- downgrade the kernel [07:56:02] 3.2.0-53-generic this is what we are running [07:56:14] I need to get rid of those somehow, but in theory they are already gone? [07:56:18] oh... [07:56:20] hmm [07:56:33] ext3_lookup: deleted inode referenced: 21525 and so on [07:56:37] for all four of them [07:56:58] in the meantime we are read only again on neon until the next reboot [08:00:57] can you touch and then delete the files? [08:01:42] I can't ls them so I would guess I can't touch them anything [08:01:48] s/anything/either/ [08:02:10] I mean any filesystem operation will involve referencing the inode which will cause barf -> r/o mode [08:02:27] I was wondering if touching it would create the inode [08:05:01] * mwalker wonders if you could breed them by hardlinking [08:05:14] * apergos shudders [08:08:12] I'm sorta scared to try; but ln has a -f option which 'removes existing destination files' [08:08:25] if it doesn't look first you might be able to use that [08:11:39] well for right now I'm going to reboot this so we're back in r/w with monitoring again [08:14:26] !log rebooting neon again, found four filenames with deleted inode but side effect is back in r/o for / [08:14:43] Logged the message, Master [08:28:44] (03CR) 10Odder: "Nameserver settings for wikimedia.pl were changed with bug 33509, so I guess it wouldn't be too hard to change them to ns*.wikimedia.org." [operations/dns] - 10https://gerrit.wikimedia.org/r/86659 (owner: 10Dzahn) [08:33:54] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 3 hours [08:33:54] PROBLEM - Puppet freshness on mc1003 is CRITICAL: No successful Puppet run in the last 3 hours [08:33:54] PROBLEM - Puppet freshness on mw16 is CRITICAL: No successful Puppet run in the last 3 hours [08:33:54] PROBLEM - Puppet freshness on pc2 is CRITICAL: No successful Puppet run in the last 3 hours [08:33:54] PROBLEM - Puppet freshness on sq67 is CRITICAL: No successful Puppet run in the last 3 hours [08:33:54] PROBLEM - Puppet freshness on stafford is CRITICAL: No successful Puppet run in the last 3 hours [08:34:08] ughh [08:34:54] PROBLEM - Puppet freshness on cp1007 is CRITICAL: No successful Puppet run in the last 3 hours [08:34:54] PROBLEM - Puppet freshness on cp3003 is CRITICAL: No successful Puppet run in the last 3 hours [08:34:54] PROBLEM - Puppet freshness on cp4011 is CRITICAL: No successful Puppet run in the last 3 hours [08:34:54] PROBLEM - Puppet freshness on fenari is CRITICAL: No successful Puppet run in the last 3 hours [08:34:54] PROBLEM - Puppet freshness on labsdb1002 is CRITICAL: No successful Puppet run in the last 3 hours [08:34:54] PROBLEM - Puppet freshness on labsdb1003 is CRITICAL: No successful Puppet run in the last 3 hours [08:34:54] PROBLEM - Puppet freshness on mc1015 is CRITICAL: No successful Puppet run in the last 3 hours [08:34:55] PROBLEM - Puppet freshness on mw1018 is CRITICAL: No successful Puppet run in the last 3 hours [08:34:55] PROBLEM - Puppet freshness on mw1088 is CRITICAL: No successful Puppet run in the last 3 hours [08:34:56] PROBLEM - Puppet freshness on mw1099 is CRITICAL: No successful Puppet run in the last 3 hours [08:34:56] PROBLEM - Puppet freshness on mw1100 is CRITICAL: No successful Puppet run in the last 3 hours [08:34:57] PROBLEM - Puppet freshness on mw1103 is CRITICAL: No successful Puppet run in the last 3 hours [08:34:57] PROBLEM - Puppet freshness on mw1117 is CRITICAL: No successful Puppet run in the last 3 hours [08:34:58] PROBLEM - Puppet freshness on mw1164 is CRITICAL: No successful Puppet run in the last 3 hours [08:34:58] PROBLEM - Puppet freshness on mw1176 is CRITICAL: No successful Puppet run in the last 3 hours [08:34:59] PROBLEM - Puppet freshness on mw1217 is CRITICAL: No successful Puppet run in the last 3 hours [08:34:59] PROBLEM - Puppet freshness on mw51 is CRITICAL: No successful Puppet run in the last 3 hours [08:35:00] PROBLEM - Puppet freshness on search1010 is CRITICAL: No successful Puppet run in the last 3 hours [08:35:00] PROBLEM - Puppet freshness on sq71 is CRITICAL: No successful Puppet run in the last 3 hours [08:35:01] PROBLEM - Puppet freshness on sq83 is CRITICAL: No successful Puppet run in the last 3 hours [08:35:01] PROBLEM - Puppet freshness on sq86 is CRITICAL: No successful Puppet run in the last 3 hours [08:35:02] PROBLEM - Puppet freshness on srv287 is CRITICAL: No successful Puppet run in the last 3 hours [08:35:02] PROBLEM - Puppet freshness on srv300 is CRITICAL: No successful Puppet run in the last 3 hours [08:35:46] (03CR) 10Odder: "All except stuartgeiger.com work for me, I'll contact Stuart on-wiki and ask him to fix his website." [operations/puppet] - 10https://gerrit.wikimedia.org/r/96935 (owner: 10Dzahn) [08:35:54] PROBLEM - Puppet freshness on amssq52 is CRITICAL: No successful Puppet run in the last 3 hours [08:35:54] PROBLEM - Puppet freshness on cp1012 is CRITICAL: No successful Puppet run in the last 3 hours [08:35:54] PROBLEM - Puppet freshness on cp1020 is CRITICAL: No successful Puppet run in the last 3 hours [08:35:54] PROBLEM - Puppet freshness on cp1056 is CRITICAL: No successful Puppet run in the last 3 hours [08:35:54] PROBLEM - Puppet freshness on cp1061 is CRITICAL: No successful Puppet run in the last 3 hours [08:35:54] PROBLEM - Puppet freshness on cp3007 is CRITICAL: No successful Puppet run in the last 3 hours [08:35:54] PROBLEM - Puppet freshness on cp4003 is CRITICAL: No successful Puppet run in the last 3 hours [08:36:24] those are lies. [08:37:54] PROBLEM - Puppet freshness on amssq57 is CRITICAL: No successful Puppet run in the last 3 hours [08:37:54] PROBLEM - Puppet freshness on cp1006 is CRITICAL: No successful Puppet run in the last 3 hours [08:37:54] PROBLEM - Puppet freshness on cp1051 is CRITICAL: No successful Puppet run in the last 3 hours [08:37:54] PROBLEM - Puppet freshness on cp1069 is CRITICAL: No successful Puppet run in the last 3 hours [08:37:54] PROBLEM - Puppet freshness on cp4001 is CRITICAL: No successful Puppet run in the last 3 hours [08:38:54] PROBLEM - Puppet freshness on cp1046 is CRITICAL: No successful Puppet run in the last 3 hours [08:38:54] PROBLEM - Puppet freshness on analytics1015 is CRITICAL: No successful Puppet run in the last 3 hours [08:38:54] PROBLEM - Puppet freshness on cp1050 is CRITICAL: No successful Puppet run in the last 3 hours [08:38:54] PROBLEM - Puppet freshness on cp4007 is CRITICAL: No successful Puppet run in the last 3 hours [08:38:54] PROBLEM - Puppet freshness on gadolinium is CRITICAL: No successful Puppet run in the last 3 hours [08:38:55] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 3 hours [08:38:55] PROBLEM - Puppet freshness on lvs3 is CRITICAL: No successful Puppet run in the last 3 hours [08:38:56] PROBLEM - Puppet freshness on lvs4 is CRITICAL: No successful Puppet run in the last 3 hours [08:38:56] PROBLEM - Puppet freshness on lvs4003 is CRITICAL: No successful Puppet run in the last 3 hours [08:38:57] PROBLEM - Puppet freshness on ms-fe3 is CRITICAL: No successful Puppet run in the last 3 hours [08:38:57] PROBLEM - Puppet freshness on mw1111 is CRITICAL: No successful Puppet run in the last 3 hours [08:38:58] PROBLEM - Puppet freshness on mw1124 is CRITICAL: No successful Puppet run in the last 3 hours [08:38:58] PROBLEM - Puppet freshness on mw1125 is CRITICAL: No successful Puppet run in the last 3 hours [08:38:58] (03CR) 10Odder: "All fine for me except the removal of wikiźródła.pl, which works for me; might this be related to the fact that it's an IDN domain and Ven" [operations/puppet] - 10https://gerrit.wikimedia.org/r/96914 (owner: 10Dzahn) [08:38:59] PROBLEM - Puppet freshness on mw1130 is CRITICAL: No successful Puppet run in the last 3 hours [08:38:59] PROBLEM - Puppet freshness on mw1147 is CRITICAL: No successful Puppet run in the last 3 hours [08:39:00] PROBLEM - Puppet freshness on mw1161 is CRITICAL: No successful Puppet run in the last 3 hours [08:39:00] PROBLEM - Puppet freshness on mw119 is CRITICAL: No successful Puppet run in the last 3 hours [08:39:01] PROBLEM - Puppet freshness on mw1190 is CRITICAL: No successful Puppet run in the last 3 hours [08:39:01] PROBLEM - Puppet freshness on mw38 is CRITICAL: No successful Puppet run in the last 3 hours [08:39:02] PROBLEM - Puppet freshness on sq51 is CRITICAL: No successful Puppet run in the last 3 hours [08:39:02] PROBLEM - Puppet freshness on srv290 is CRITICAL: No successful Puppet run in the last 3 hours [08:39:03] PROBLEM - Puppet freshness on tin is CRITICAL: No successful Puppet run in the last 3 hours [08:39:03] PROBLEM - Puppet freshness on wtp1024 is CRITICAL: No successful Puppet run in the last 3 hours [08:39:49] ugh [08:39:54] PROBLEM - Puppet freshness on amssq44 is CRITICAL: No successful Puppet run in the last 3 hours [08:39:54] PROBLEM - Puppet freshness on cp1063 is CRITICAL: No successful Puppet run in the last 3 hours [08:39:54] PROBLEM - Puppet freshness on db1004 is CRITICAL: No successful Puppet run in the last 3 hours [08:39:54] PROBLEM - Puppet freshness on db1020 is CRITICAL: No successful Puppet run in the last 3 hours [08:39:54] PROBLEM - Puppet freshness on es6 is CRITICAL: No successful Puppet run in the last 3 hours [08:40:02] it thinks all the results for these are stale [08:41:54] PROBLEM - Puppet freshness on analytics1014 is CRITICAL: No successful Puppet run in the last 3 hours [08:41:54] PROBLEM - Puppet freshness on cp1010 is CRITICAL: No successful Puppet run in the last 3 hours [08:41:54] PROBLEM - Puppet freshness on db1001 is CRITICAL: No successful Puppet run in the last 3 hours [08:41:54] PROBLEM - Puppet freshness on db1031 is CRITICAL: No successful Puppet run in the last 3 hours [08:41:54] PROBLEM - Puppet freshness on db63 is CRITICAL: No successful Puppet run in the last 3 hours [08:41:54] PROBLEM - Puppet freshness on elastic1008 is CRITICAL: No successful Puppet run in the last 3 hours [08:41:54] PROBLEM - Puppet freshness on helium is CRITICAL: No successful Puppet run in the last 3 hours [08:41:55] PROBLEM - Puppet freshness on mc1007 is CRITICAL: No successful Puppet run in the last 3 hours [08:41:55] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: No successful Puppet run in the last 3 hours [08:41:56] PROBLEM - Puppet freshness on mw1032 is CRITICAL: No successful Puppet run in the last 3 hours [08:41:56] PROBLEM - Puppet freshness on mw124 is CRITICAL: No successful Puppet run in the last 3 hours [08:41:57] PROBLEM - Puppet freshness on mw43 is CRITICAL: No successful Puppet run in the last 3 hours [08:41:57] PROBLEM - Puppet freshness on mw57 is CRITICAL: No successful Puppet run in the last 3 hours [08:41:58] PROBLEM - Puppet freshness on potassium is CRITICAL: No successful Puppet run in the last 3 hours [08:41:58] PROBLEM - Puppet freshness on searchidx1001 is CRITICAL: No successful Puppet run in the last 3 hours [08:41:59] PROBLEM - Puppet freshness on sq54 is CRITICAL: No successful Puppet run in the last 3 hours [08:41:59] PROBLEM - Puppet freshness on sq58 is CRITICAL: No successful Puppet run in the last 3 hours [08:42:00] PROBLEM - Puppet freshness on srv255 is CRITICAL: No successful Puppet run in the last 3 hours [08:42:00] PROBLEM - Puppet freshness on wtp1015 is CRITICAL: No successful Puppet run in the last 3 hours [08:42:54] Nov 22 08:40:44 neon icinga: Warning: The results of service 'Puppet freshness' on host 'snapshot1' are stale by 0d 0h 0m 54s (threshold=0d 3h 0m 0s). I'm forcing an immediate check of the service. how is 54 seconds past the threshhold?? [08:42:54] PROBLEM - Puppet freshness on db1022 is CRITICAL: No successful Puppet run in the last 3 hours [08:42:54] PROBLEM - Puppet freshness on calcium is CRITICAL: No successful Puppet run in the last 3 hours [08:42:54] PROBLEM - Puppet freshness on db1044 is CRITICAL: No successful Puppet run in the last 3 hours [08:42:54] PROBLEM - Puppet freshness on db48 is CRITICAL: No successful Puppet run in the last 3 hours [08:42:54] PROBLEM - Puppet freshness on ms-fe1001 is CRITICAL: No successful Puppet run in the last 3 hours [08:44:54] PROBLEM - Puppet freshness on amssq48 is CRITICAL: No successful Puppet run in the last 3 hours [08:44:54] PROBLEM - Puppet freshness on cp3011 is CRITICAL: No successful Puppet run in the last 3 hours [08:44:54] PROBLEM - Puppet freshness on cp4002 is CRITICAL: No successful Puppet run in the last 3 hours [08:44:54] PROBLEM - Puppet freshness on cp4014 is CRITICAL: No successful Puppet run in the last 3 hours [08:44:54] PROBLEM - Puppet freshness on db1021 is CRITICAL: No successful Puppet run in the last 3 hours [08:45:54] PROBLEM - Puppet freshness on antimony is CRITICAL: No successful Puppet run in the last 3 hours [08:45:54] PROBLEM - Puppet freshness on bast1001 is CRITICAL: No successful Puppet run in the last 3 hours [08:45:54] PROBLEM - Puppet freshness on cp1018 is CRITICAL: No successful Puppet run in the last 3 hours [08:45:54] PROBLEM - Puppet freshness on cp1058 is CRITICAL: No successful Puppet run in the last 3 hours [08:45:54] PROBLEM - Puppet freshness on cp4019 is CRITICAL: No successful Puppet run in the last 3 hours [08:45:54] PROBLEM - Puppet freshness on dataset1001 is CRITICAL: No successful Puppet run in the last 3 hours [08:45:54] PROBLEM - Puppet freshness on db9 is CRITICAL: No successful Puppet run in the last 3 hours [08:46:54] PROBLEM - Puppet freshness on amslvs1 is CRITICAL: No successful Puppet run in the last 3 hours [08:46:54] PROBLEM - Puppet freshness on amssq51 is CRITICAL: No successful Puppet run in the last 3 hours [08:46:54] PROBLEM - Puppet freshness on amssq56 is CRITICAL: No successful Puppet run in the last 3 hours [08:46:54] PROBLEM - Puppet freshness on analytics1009 is CRITICAL: No successful Puppet run in the last 3 hours [08:46:54] PROBLEM - Puppet freshness on capella is CRITICAL: No successful Puppet run in the last 3 hours [08:46:54] PROBLEM - Puppet freshness on analytics1022 is CRITICAL: No successful Puppet run in the last 3 hours [08:46:54] PROBLEM - Puppet freshness on cp1016 is CRITICAL: No successful Puppet run in the last 3 hours [08:47:54] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: No successful Puppet run in the last 3 hours [08:47:54] PROBLEM - Puppet freshness on amssq31 is CRITICAL: No successful Puppet run in the last 3 hours [08:47:54] PROBLEM - Puppet freshness on amssq35 is CRITICAL: No successful Puppet run in the last 3 hours [08:47:54] PROBLEM - Puppet freshness on amssq37 is CRITICAL: No successful Puppet run in the last 3 hours [08:47:54] PROBLEM - Puppet freshness on cp1062 is CRITICAL: No successful Puppet run in the last 3 hours [08:48:54] PROBLEM - Puppet freshness on analytics1008 is CRITICAL: No successful Puppet run in the last 3 hours [08:48:54] PROBLEM - Puppet freshness on brewster is CRITICAL: No successful Puppet run in the last 3 hours [08:48:54] PROBLEM - Puppet freshness on cp1038 is CRITICAL: No successful Puppet run in the last 3 hours [08:48:54] PROBLEM - Puppet freshness on db1058 is CRITICAL: No successful Puppet run in the last 3 hours [08:48:54] PROBLEM - Puppet freshness on db57 is CRITICAL: No successful Puppet run in the last 3 hours [08:49:54] PROBLEM - Puppet freshness on analytics1011 is CRITICAL: No successful Puppet run in the last 3 hours [08:49:54] PROBLEM - Puppet freshness on analytics1019 is CRITICAL: No successful Puppet run in the last 3 hours [08:49:54] PROBLEM - Puppet freshness on cp4017 is CRITICAL: No successful Puppet run in the last 3 hours [08:49:54] PROBLEM - Puppet freshness on cp1065 is CRITICAL: No successful Puppet run in the last 3 hours [08:49:54] PROBLEM - Puppet freshness on dataset2 is CRITICAL: No successful Puppet run in the last 3 hours [08:49:55] PROBLEM - Puppet freshness on db1010 is CRITICAL: No successful Puppet run in the last 3 hours [08:49:55] PROBLEM - Puppet freshness on db1024 is CRITICAL: No successful Puppet run in the last 3 hours [08:49:56] PROBLEM - Puppet freshness on labsdb1001 is CRITICAL: No successful Puppet run in the last 3 hours [08:49:56] PROBLEM - Puppet freshness on mc1008 is CRITICAL: No successful Puppet run in the last 3 hours [08:49:57] PROBLEM - Puppet freshness on mw109 is CRITICAL: No successful Puppet run in the last 3 hours [08:49:57] PROBLEM - Puppet freshness on mw1138 is CRITICAL: No successful Puppet run in the last 3 hours [08:49:58] PROBLEM - Puppet freshness on mw117 is CRITICAL: No successful Puppet run in the last 3 hours [08:49:58] PROBLEM - Puppet freshness on mw26 is CRITICAL: No successful Puppet run in the last 3 hours [08:49:59] PROBLEM - Puppet freshness on mw33 is CRITICAL: No successful Puppet run in the last 3 hours [08:49:59] PROBLEM - Puppet freshness on mw36 is CRITICAL: No successful Puppet run in the last 3 hours [08:50:00] PROBLEM - Puppet freshness on mw64 is CRITICAL: No successful Puppet run in the last 3 hours [08:50:00] PROBLEM - Puppet freshness on snapshot4 is CRITICAL: No successful Puppet run in the last 3 hours [08:50:01] PROBLEM - Puppet freshness on ssl1007 is CRITICAL: No successful Puppet run in the last 3 hours [08:50:01] PROBLEM - Puppet freshness on tarin is CRITICAL: No successful Puppet run in the last 3 hours [08:50:02] PROBLEM - Puppet freshness on testsearch1002 is CRITICAL: No successful Puppet run in the last 3 hours [08:50:02] PROBLEM - Puppet freshness on wtp1013 is CRITICAL: No successful Puppet run in the last 3 hours [08:50:54] PROBLEM - Puppet freshness on aluminium is CRITICAL: No successful Puppet run in the last 3 hours [08:50:54] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: No successful Puppet run in the last 3 hours [08:50:54] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: No successful Puppet run in the last 3 hours [08:50:54] PROBLEM - Puppet freshness on es1009 is CRITICAL: No successful Puppet run in the last 3 hours [08:50:54] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 3 hours [08:50:54] PROBLEM - Puppet freshness on lvs1001 is CRITICAL: No successful Puppet run in the last 3 hours [08:50:54] PROBLEM - Puppet freshness on mw1001 is CRITICAL: No successful Puppet run in the last 3 hours [08:50:55] PROBLEM - Puppet freshness on mw1022 is CRITICAL: No successful Puppet run in the last 3 hours [08:50:55] PROBLEM - Puppet freshness on mw1040 is CRITICAL: No successful Puppet run in the last 3 hours [08:50:56] PROBLEM - Puppet freshness on mw1062 is CRITICAL: No successful Puppet run in the last 3 hours [08:50:56] PROBLEM - Puppet freshness on mw107 is CRITICAL: No successful Puppet run in the last 3 hours [08:50:57] PROBLEM - Puppet freshness on mw1132 is CRITICAL: No successful Puppet run in the last 3 hours [08:50:57] PROBLEM - Puppet freshness on mw1185 is CRITICAL: No successful Puppet run in the last 3 hours [08:50:58] PROBLEM - Puppet freshness on mw1218 is CRITICAL: No successful Puppet run in the last 3 hours [08:50:58] PROBLEM - Puppet freshness on mw40 is CRITICAL: No successful Puppet run in the last 3 hours [08:50:59] PROBLEM - Puppet freshness on mw53 is CRITICAL: No successful Puppet run in the last 3 hours [08:50:59] PROBLEM - Puppet freshness on mw70 is CRITICAL: No successful Puppet run in the last 3 hours [08:51:00] PROBLEM - Puppet freshness on sq55 is CRITICAL: No successful Puppet run in the last 3 hours [08:51:00] PROBLEM - Puppet freshness on sq64 is CRITICAL: No successful Puppet run in the last 3 hours [08:51:01] PROBLEM - Puppet freshness on sq81 is CRITICAL: No successful Puppet run in the last 3 hours [08:51:01] PROBLEM - Puppet freshness on srv296 is CRITICAL: No successful Puppet run in the last 3 hours [08:51:02] PROBLEM - Puppet freshness on testsearch1001 is CRITICAL: No successful Puppet run in the last 3 hours [08:51:02] PROBLEM - Puppet freshness on virt2 is CRITICAL: No successful Puppet run in the last 3 hours [08:51:03] PROBLEM - Puppet freshness on wtp1021 is CRITICAL: No successful Puppet run in the last 3 hours [08:51:27] still lies >_< [08:51:54] PROBLEM - Puppet freshness on amssq32 is CRITICAL: No successful Puppet run in the last 3 hours [08:51:54] PROBLEM - Puppet freshness on amssq40 is CRITICAL: No successful Puppet run in the last 3 hours [08:51:54] PROBLEM - Puppet freshness on amssq43 is CRITICAL: No successful Puppet run in the last 3 hours [08:51:54] PROBLEM - Puppet freshness on amssq59 is CRITICAL: No successful Puppet run in the last 3 hours [08:51:54] PROBLEM - Puppet freshness on cp1015 is CRITICAL: No successful Puppet run in the last 3 hours [08:53:54] PROBLEM - Puppet freshness on analytics1020 is CRITICAL: No successful Puppet run in the last 3 hours [08:53:54] PROBLEM - Puppet freshness on cp1001 is CRITICAL: No successful Puppet run in the last 3 hours [08:53:54] PROBLEM - Puppet freshness on cp1039 is CRITICAL: No successful Puppet run in the last 3 hours [08:53:54] PROBLEM - Puppet freshness on cp1053 is CRITICAL: No successful Puppet run in the last 3 hours [08:53:54] PROBLEM - Puppet freshness on cp1054 is CRITICAL: No successful Puppet run in the last 3 hours [08:54:54] PROBLEM - Puppet freshness on amssq50 is CRITICAL: No successful Puppet run in the last 3 hours [08:54:54] PROBLEM - Puppet freshness on analytics1021 is CRITICAL: No successful Puppet run in the last 3 hours [08:54:54] PROBLEM - Puppet freshness on cp1008 is CRITICAL: No successful Puppet run in the last 3 hours [08:54:54] PROBLEM - Puppet freshness on cp1009 is CRITICAL: No successful Puppet run in the last 3 hours [08:54:54] PROBLEM - Puppet freshness on cp3019 is CRITICAL: No successful Puppet run in the last 3 hours [08:55:54] PROBLEM - Puppet freshness on amssq36 is CRITICAL: No successful Puppet run in the last 3 hours [08:55:54] PROBLEM - Puppet freshness on amssq41 is CRITICAL: No successful Puppet run in the last 3 hours [08:55:54] PROBLEM - Puppet freshness on cp1064 is CRITICAL: No successful Puppet run in the last 3 hours [08:55:54] PROBLEM - Puppet freshness on db1015 is CRITICAL: No successful Puppet run in the last 3 hours [08:55:54] PROBLEM - Puppet freshness on db1018 is CRITICAL: No successful Puppet run in the last 3 hours [08:55:54] PROBLEM - Puppet freshness on db1049 is CRITICAL: No successful Puppet run in the last 3 hours [08:55:54] PROBLEM - Puppet freshness on elastic1005 is CRITICAL: No successful Puppet run in the last 3 hours [08:56:54] PROBLEM - Puppet freshness on amssq60 is CRITICAL: No successful Puppet run in the last 3 hours [08:56:54] PROBLEM - Puppet freshness on analytics1006 is CRITICAL: No successful Puppet run in the last 3 hours [08:56:54] PROBLEM - Puppet freshness on analytics1024 is CRITICAL: No successful Puppet run in the last 3 hours [08:56:54] PROBLEM - Puppet freshness on cp1019 is CRITICAL: No successful Puppet run in the last 3 hours [08:56:54] PROBLEM - Puppet freshness on cp1057 is CRITICAL: No successful Puppet run in the last 3 hours [08:57:54] PROBLEM - Puppet freshness on amssq58 is CRITICAL: No successful Puppet run in the last 3 hours [08:57:54] PROBLEM - Puppet freshness on analytics1016 is CRITICAL: No successful Puppet run in the last 3 hours [08:57:54] PROBLEM - Puppet freshness on arsenic is CRITICAL: No successful Puppet run in the last 3 hours [08:57:54] PROBLEM - Puppet freshness on cp4013 is CRITICAL: No successful Puppet run in the last 3 hours [08:57:54] PROBLEM - Puppet freshness on db1003 is CRITICAL: No successful Puppet run in the last 3 hours [08:58:54] PROBLEM - Puppet freshness on analytics1013 is CRITICAL: No successful Puppet run in the last 3 hours [08:58:54] PROBLEM - Puppet freshness on analytics1026 is CRITICAL: No successful Puppet run in the last 3 hours [08:58:54] PROBLEM - Puppet freshness on bast4001 is CRITICAL: No successful Puppet run in the last 3 hours [08:58:54] PROBLEM - Puppet freshness on cp1004 is CRITICAL: No successful Puppet run in the last 3 hours [08:58:54] PROBLEM - Puppet freshness on cp1011 is CRITICAL: No successful Puppet run in the last 3 hours [08:59:54] PROBLEM - Puppet freshness on amssq46 is CRITICAL: No successful Puppet run in the last 3 hours [08:59:54] PROBLEM - Puppet freshness on cp1048 is CRITICAL: No successful Puppet run in the last 3 hours [08:59:54] PROBLEM - Puppet freshness on cp4018 is CRITICAL: No successful Puppet run in the last 3 hours [08:59:54] PROBLEM - Puppet freshness on db1030 is CRITICAL: No successful Puppet run in the last 3 hours [08:59:54] PROBLEM - Puppet freshness on db1016 is CRITICAL: No successful Puppet run in the last 3 hours [08:59:55] PROBLEM - Puppet freshness on hooft is CRITICAL: No successful Puppet run in the last 3 hours [08:59:55] PROBLEM - Puppet freshness on es8 is CRITICAL: No successful Puppet run in the last 3 hours [08:59:56] PROBLEM - Puppet freshness on lvs2 is CRITICAL: No successful Puppet run in the last 3 hours [08:59:56] PROBLEM - Puppet freshness on ms-be1008 is CRITICAL: No successful Puppet run in the last 3 hours [08:59:57] PROBLEM - Puppet freshness on ms1002 is CRITICAL: No successful Puppet run in the last 3 hours [08:59:57] PROBLEM - Puppet freshness on mw1034 is CRITICAL: No successful Puppet run in the last 3 hours [08:59:58] PROBLEM - Puppet freshness on mc1004 is CRITICAL: No successful Puppet run in the last 3 hours [08:59:58] PROBLEM - Puppet freshness on mw1050 is CRITICAL: No successful Puppet run in the last 3 hours [08:59:59] PROBLEM - Puppet freshness on mw1156 is CRITICAL: No successful Puppet run in the last 3 hours [08:59:59] PROBLEM - Puppet freshness on mw123 is CRITICAL: No successful Puppet run in the last 3 hours [09:00:00] PROBLEM - Puppet freshness on mw1183 is CRITICAL: No successful Puppet run in the last 3 hours [09:00:00] PROBLEM - Puppet freshness on mw4 is CRITICAL: No successful Puppet run in the last 3 hours [09:00:01] PROBLEM - Puppet freshness on mw72 is CRITICAL: No successful Puppet run in the last 3 hours [09:00:01] PROBLEM - Puppet freshness on oxygen is CRITICAL: No successful Puppet run in the last 3 hours [09:00:02] PROBLEM - Puppet freshness on srv263 is CRITICAL: No successful Puppet run in the last 3 hours [09:00:02] PROBLEM - Puppet freshness on ssl1009 is CRITICAL: No successful Puppet run in the last 3 hours [09:00:03] PROBLEM - Puppet freshness on ssl3002 is CRITICAL: No successful Puppet run in the last 3 hours [09:00:03] PROBLEM - Puppet freshness on virt1007 is CRITICAL: No successful Puppet run in the last 3 hours [09:00:04] PROBLEM - Puppet freshness on virt5 is CRITICAL: No successful Puppet run in the last 3 hours [09:00:54] PROBLEM - Puppet freshness on amssq38 is CRITICAL: No successful Puppet run in the last 3 hours [09:00:54] PROBLEM - Puppet freshness on amssq62 is CRITICAL: No successful Puppet run in the last 3 hours [09:00:54] PROBLEM - Puppet freshness on cp1055 is CRITICAL: No successful Puppet run in the last 3 hours [09:00:54] PROBLEM - Puppet freshness on db1053 is CRITICAL: No successful Puppet run in the last 3 hours [09:00:54] PROBLEM - Puppet freshness on db1059 is CRITICAL: No successful Puppet run in the last 3 hours [09:02:54] PROBLEM - Puppet freshness on amslvs3 is CRITICAL: No successful Puppet run in the last 3 hours [09:04:05] PROBLEM - Puppet freshness on amssq42 is CRITICAL: No successful Puppet run in the last 3 hours [09:04:05] PROBLEM - Puppet freshness on amssq39 is CRITICAL: No successful Puppet run in the last 3 hours [09:04:05] PROBLEM - Puppet freshness on analytics1004 is CRITICAL: No successful Puppet run in the last 3 hours [09:04:05] PROBLEM - Puppet freshness on amssq45 is CRITICAL: No successful Puppet run in the last 3 hours [09:04:05] PROBLEM - Puppet freshness on amssq49 is CRITICAL: No successful Puppet run in the last 3 hours [09:10:00] ' [09:18:15] RECOVERY - Puppet freshness on mw1008 is OK: puppet ran at Fri Nov 22 09:18:10 UTC 2013 [09:18:16] RECOVERY - Puppet freshness on mw1073 is OK: puppet ran at Fri Nov 22 09:18:10 UTC 2013 [09:18:16] RECOVERY - Puppet freshness on mw1009 is OK: puppet ran at Fri Nov 22 09:18:10 UTC 2013 [09:18:16] RECOVERY - Puppet freshness on mw1120 is OK: puppet ran at Fri Nov 22 09:18:10 UTC 2013 [09:18:16] RECOVERY - Puppet freshness on mw1060 is OK: puppet ran at Fri Nov 22 09:18:10 UTC 2013 [09:18:36] smptt stuck [09:19:06] (03PS1) 10Odder: (bug 57395) Add AbuseFilter rights to sysops on fiwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96967 [09:19:58] !log snmptt stuck waiting for a submit_check_result that had exited (or died), restarted it to clear the snmptt spool queue [09:20:15] Logged the message, Master [09:40:06] (03CR) 10Akosiaris: [C: 032] Add icinga monitoring for Gerrit and Gitblit [operations/puppet] - 10https://gerrit.wikimedia.org/r/75777 (owner: 10Chad) [09:50:29] akosiaris: got a few minutes to talk debugfs (or alternatives)? [09:51:27] ??? [09:51:32] yes [09:51:52] I dunno if you saw the backread but we had r/o filesystem on neon again this morning, it's triggered by the locate cron job (updatedb) which does a find on / [09:52:00] there's 4 inodes that have been deleted but are still in a directory entry, I have path names [09:52:40] ok [09:52:40] so for all four of these any attempt to read them will give ext3 whining about attempt to reference a deleted inode and then / will be remounted r/o and then we're hosed [09:52:57] yesterday para void did an fsck on neon at reboot [09:53:18] apparently this didn't actually fix the error because that same inode in yesterday's syslog was there today [09:53:47] after we hit th problem... I fscked blind today on reboot not knowing the underlying issue (nen has a console redirection problem) [09:54:03] but I can tell you the issue is still there, so [09:54:26] do we try debugfs undel and hope that works? or do you have some other thoughts? I have not eveer used debugfs myself [09:55:10] the four paths are in /lib/modules/3.2.0-45-generic/kernel/net/netfilter/, please don't ls it or we'll be r/o again :-D [09:55:37] ugh. [09:55:38] don't ls the entire directory ? [09:55:58] stupid question [09:56:20] the directory is fine, ls doesnot touch the inode as long as -l is not passed [09:56:33] the four inodes are 21531 21525 21532 21526 [09:56:45] could you just unlink netfilter and recreate the dir and modules? [09:57:17] and the paths are xt_hashlimit.ko xt_u32.ko xt_esp.ko xt_socket.ko [09:57:47] I doubt we can unlink without ext3 complaining, because it would surely check the reference count of the inode [09:58:04] to decrement it... and then be sad [10:00:27] and you think debugfs kill_file would just trigger the same fs behaviour? [10:00:59] (03PS3) 10Ori.livneh: log scap timing to graphite; parametrize statsd host/port [operations/puppet] - 10https://gerrit.wikimedia.org/r/96891 [10:01:13] well I think debugfs kill_file might not play well with the live ext3 journal, I really have no idea, first time I am looking at these things [10:01:20] (03PS4) 10Ori.livneh: log scap timing to graphite; parametrize statsd host/port [operations/puppet] - 10https://gerrit.wikimedia.org/r/96891 [10:01:42] this is why I thought an undel (which in theory just marks the inodes and its blocks as active) might be the safest [10:02:03] but I would love it if someone else who knows more would weigh in :-D maybe there's some better approach [10:02:55] s kill_file does not remove directory entries for an inode; [10:03:11] it won't in fact change anything, the inodes are currently deleted already [10:03:26] (03CR) 10Ori.livneh: [C: 032] log scap timing to graphite; parametrize statsd host/port [operations/puppet] - 10https://gerrit.wikimedia.org/r/96891 (owner: 10Ori.livneh) [10:03:29] i am gonna get and image of the filesystem and do whatever actions there [10:03:39] ok [10:03:40] let's not play doctor on the live system [10:03:49] a fine idea [10:04:09] I mean it might be that telling fsck the right things would actually fix it too, I don't know [10:04:13] I'm curious about how it got into that state to begin with [10:04:19] you're telling me [10:04:44] also isn't it lke 2 am where you are? :-D [10:04:51] 5 [10:04:51] 5 am! [10:04:56] holy crap [10:05:11] I'm all screwed up from being sick and drugged up, my sleep cycles are ruined [10:05:13] what are you doing awake at 5 am and you're sick? [10:05:14] ah [10:05:16] re: I doubt we can unlink without ext3 complaining [10:05:19] https://en.wikipedia.org/wiki/Ironic_process_theory [10:05:48] don't think of a white elephant! [10:05:53] btw: debugfs unlink specifically doesn't update reference counts, says the manpage. but *shrug* who knows what'll trigger ext3 to complain :) [10:06:07] oh debugfs unlink [10:06:09] sorry :-D [10:06:39] I was thinking you meant unlink(2) which... [10:06:41] anyways [10:06:44] oh :) [10:07:41] dd if=/dev/md0 of=/dev/neon/lala bs=10M [10:07:57] waiting for it to finish and run tests there [10:08:03] great [10:12:46] *yawn* time to get out of the pjs, woke up to this and so... [10:20:31] apergos: so after dd and a noisy fsck -y [10:20:43] mostly because the dd was from a live filesystem [10:20:52] right [10:20:57] stat xt_hashlimit.ko [10:20:57] File: `xt_hashlimit.ko' [10:20:57] Size: 22432 Blocks: 48 IO Block: 4096 regular file [10:20:57] Device: fc03h/64515d Inode: 21531 Links: 1 [10:21:01] blabla [10:21:05] and no complaining [10:21:13] so I suggest we reboot and fsck [10:21:16] but [10:21:22] you said we have no console right ? [10:21:26] serial ? [10:21:43] so back to vga through some crappy activex/java applet... [10:21:49] yes, para void did some editing of grub options [10:21:56] lemme find it in the scrollback [10:22:01] yeah [10:22:06] console=ttyS1 [10:22:11] "went to grub, removed console=ttyS1" [10:22:39] so I would like to hear how you get to that (yet another thing I've never done) [10:22:55] thinking about it... [10:27:33] (another thing that would be mighty fine would be to actually *fix* the console issue) [10:52:54] mornin [10:53:09] !log rebooting neon for fsck [10:53:21] down again? [10:53:22] Logged the message, Master [10:53:26] this morning [10:53:29] you missed the backread [10:53:31] I did [10:53:41] now it is a preemptive strike :P [10:53:41] fun! [10:53:50] serial console doesn't work btw [10:53:55] he knows [10:53:58] okay [10:54:00] i got vga [10:54:04] k [10:54:07] long live IE 8 with activex [10:54:10] and virtualbox [10:54:15] ugh you are kidding me [10:54:17] worked for me with java 6 and webstart [10:54:42] error: unknown terminal serial [10:54:46] yup [10:54:47] error unknown serial port [10:54:48] nice [10:54:57] so there are 4 inodes that claim to be unreferenced (by ext3) and hence the locate cron job doing its find on / the last two mornings... [10:55:03] that's because in bios the "console redirection after boot" is set to "true" [10:55:13] when I set that to "false", this message stop appearing [10:55:19] but I couldn't get into the grub menu [11:07:22] (03PS1) 10Akosiaris: Add gerrit ferm rules for production [operations/puppet] - 10https://gerrit.wikimedia.org/r/96980 [11:09:20] akosiaris: did you see the svn ferm patchset? [11:09:35] not studied them yet [11:09:44] I did not study them yet* [11:09:55] but yes [11:21:50] !log neon booted and seems ok. running updatedb manually [11:22:07] Logged the message, Master [11:22:10] oh and i hate grub2 [11:22:20] but don't log that morebots :P [11:22:21] you could just try looking at those four files :-P [11:22:32] i purged them [11:22:39] ah [11:22:40] i did btw [11:22:45] and good riddance to them [11:22:47] yes? [11:22:52] before purge all non running kernels [11:23:00] no problems [11:23:06] seems like we are ok [11:23:37] great [11:35:36] (03CR) 10Akosiaris: [C: 032] Extract geowiki paramaters into separate class [operations/puppet] - 10https://gerrit.wikimedia.org/r/96538 (owner: 10QChris) [11:35:47] (03CR) 10Akosiaris: [V: 032] Extract geowiki paramaters into separate class [operations/puppet] - 10https://gerrit.wikimedia.org/r/96538 (owner: 10QChris) [11:36:04] (03CR) 10Akosiaris: [C: 032 V: 032] Backup geowiki's data-private bare repository [operations/puppet] - 10https://gerrit.wikimedia.org/r/95363 (owner: 10QChris) [11:39:54] akosiaris: [11:39:55] Error: Could not find any host matching 'gerrit.wikimedia.org' (config file '/etc/icinga/puppet_services.cfg', starting on line 56158) [11:39:57] (neon) [11:40:36] I guess that's probably chad's changeset [11:41:00] I never though it might depend on that... [11:41:04] damn... lemme check [11:44:50] apergos: Total Warnings: 0 [11:44:51] Total Errors: 0 [11:44:51] Things look okay - No serious problems were detected during the pre-flight check [11:45:12] cleared itself up [11:45:13] weird [11:45:17] huh [11:45:32] maybe inbetween puppet runs ? [11:45:39] maybe [11:45:44] going to overlook it and move on :-D [11:45:45] more like inside one puppet run [11:46:24] it's comment out [11:46:31] did you do that ? [11:46:34] nope [11:46:41] all I did was check the config [11:48:13] someone vi'ed something oveer there [11:49:00] but maybe that was you looking [11:49:06] that was me [11:49:15] meh [11:49:16] i reran puppet manually [11:49:23] let's see what will happen [11:49:26] ok [12:03:21] apergos: found it. gerrit.pp line 73 [12:03:30] a parameter called hostname [12:03:33] * akosiaris sad [12:03:55] and of course monitor_service does not use $::hostname but $hostname [12:04:02] and yada yada yada yada [12:04:14] the first time puppet scoping bites me that badly [12:05:16] I was just fixing those [12:05:36] I'll push this in a minute [12:05:42] ok [12:06:10] lunch time then :-) [12:12:37] (03PS1) 10ArielGlenn: qualify $hostname and $ipaddress references [operations/puppet] - 10https://gerrit.wikimedia.org/r/96982 [12:14:14] apergos: cool! [12:14:22] thx for cp3013/4 btw [12:14:58] yw [12:38:50] who among us knows about getting multiple versions of a package into our apt repo? [12:39:41] springle: noone, because it's impossible :) [12:39:48] reprepro doesn't support that [12:40:07] oh [12:40:22] springle: since you're here (isn't it very late?), have you seen Megan's mail on the ops list? [12:40:37] I'm inclined to follow your lead there [12:40:40] looking for a way to have a newer mariadb in there for testing [12:40:51] hmm no, looking [12:46:50] I saw multiple versions of a few packages [12:47:00] but I didn't check to see if that was per distro [12:47:07] I just looked at the pools [12:49:32] paravoid: responded to megan. and yes, it's late, but i'm just imitating your own dedicated approach to work ;) [12:49:42] sorry, s/dedicated/crazy/ [12:49:53] lol [12:50:58] (03CR) 10ArielGlenn: [C: 032] qualify $hostname and $ipaddress references [operations/puppet] - 10https://gerrit.wikimedia.org/r/96982 (owner: 10ArielGlenn) [12:56:03] paravoid: You can't call before midnight early [12:56:09] Same as I can't [12:56:10] :P [12:57:20] (03PS2) 10Reedy: (bug 56859) Update $wgCategoryCollation on iswiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94607 (owner: 10Odder) [12:57:25] (03CR) 10Reedy: [C: 032] (bug 56859) Update $wgCategoryCollation on iswiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94607 (owner: 10Odder) [12:57:37] (03Merged) 10jenkins-bot: (bug 56859) Update $wgCategoryCollation on iswiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94607 (owner: 10Odder) [12:59:37] !log reedy synchronized wmf-config/InitialiseSettings.php 'Ic36cd656233dfc959c92a33bade794cc5c1e1bd2' [12:59:52] Logged the message, Master [13:00:20] so reedy are you just getting up or are you still awake, that is the question [13:00:40] Normally I'd still be sleeping... [13:01:01] I've been up about 5 hours. After about 5 hours sleep [13:04:08] (03CR) 10Reedy: "For future notice, there's no reason to list the bug number in the first line of the commit summary" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94607 (owner: 10Odder) [13:04:41] (03CR) 10Faidon Liambotis: [C: 04-1] "I didn't code review the PHP source at all (although I can say an immediate comment saying that there's virtually no comments :)." (035 comments) [operations/apache-config] - 10https://gerrit.wikimedia.org/r/96438 (owner: 10Tim Starling) [13:12:23] (03CR) 10Reedy: "Can we do anything to condense some of the simpler cases?" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/96438 (owner: 10Tim Starling) [13:19:23] (03CR) 10Faidon Liambotis: [C: 04-1] "Thanks for attempting to cleanup all this craft, very much appreciated! Comments inline :-)" (036 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/96961 (owner: 10Ori.livneh) [13:35:14] dysprosium grrrr [13:41:13] PROBLEM - Puppet freshness on stat1 is CRITICAL: No successful Puppet run in the last 3 hours [13:43:17] (03CR) 10Faidon Liambotis: [C: 04-1] Normalise the path part of URLs in the text frontend (034 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/96941 (owner: 10Tim Starling) [13:46:39] (03CR) 10Odder: "Why not? A Git Wizard who introduced me to Git+Gerrit told me it was kind of useful for people at the time." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94607 (owner: 10Odder) [13:47:47] (03CR) 10Catrope: "At the time that's what our convention was, but we've since moved to using the Bug: NNNNN footer instead." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94607 (owner: 10Odder) [14:06:51] (03PS2) 10Akosiaris: Cleanup puppetmaster's apache Listen ports [operations/puppet] - 10https://gerrit.wikimedia.org/r/96262 [14:07:47] (03CR) 10jenkins-bot: [V: 04-1] Cleanup puppetmaster's apache Listen ports [operations/puppet] - 10https://gerrit.wikimedia.org/r/96262 (owner: 10Akosiaris) [14:12:41] (03PS1) 10Faidon Liambotis: Capitalize resources, fix deprecation notice [operations/puppet] - 10https://gerrit.wikimedia.org/r/96993 [14:13:06] (03CR) 10Faidon Liambotis: [C: 032] Capitalize resources, fix deprecation notice [operations/puppet] - 10https://gerrit.wikimedia.org/r/96993 (owner: 10Faidon Liambotis) [14:13:14] (03CR) 10jenkins-bot: [V: 04-1] Capitalize resources, fix deprecation notice [operations/puppet] - 10https://gerrit.wikimedia.org/r/96993 (owner: 10Faidon Liambotis) [14:13:38] really?? [14:14:28] fuck you puppet [14:14:33] (03PS2) 10Faidon Liambotis: Capitalize resources, fix deprecation notice [operations/puppet] - 10https://gerrit.wikimedia.org/r/96993 [14:15:57] (03CR) 10Faidon Liambotis: [C: 032] Capitalize resources, fix deprecation notice [operations/puppet] - 10https://gerrit.wikimedia.org/r/96993 (owner: 10Faidon Liambotis) [14:21:14] (03PS3) 10Akosiaris: Cleanup puppetmaster's apache Listen ports [operations/puppet] - 10https://gerrit.wikimedia.org/r/96262 [14:28:20] (03CR) 10Akosiaris: [C: 032] Cleanup puppetmaster's apache Listen ports [operations/puppet] - 10https://gerrit.wikimedia.org/r/96262 (owner: 10Akosiaris) [14:30:16] (03PS4) 10Akosiaris: Change the way cron run times are calculated [operations/puppet] - 10https://gerrit.wikimedia.org/r/96247 [14:33:49] (03CR) 10Akosiaris: [C: 032] Change the way cron run times are calculated [operations/puppet] - 10https://gerrit.wikimedia.org/r/96247 (owner: 10Akosiaris) [14:39:12] PROBLEM - SSH on amslvs1 is CRITICAL: Server answer: [14:39:38] (03PS1) 10ArielGlenn: qualify $decommissioned_servers [operations/puppet] - 10https://gerrit.wikimedia.org/r/96995 [14:40:12] RECOVERY - SSH on amslvs1 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [14:40:47] (03CR) 10ArielGlenn: [C: 032] qualify $decommissioned_servers [operations/puppet] - 10https://gerrit.wikimedia.org/r/96995 (owner: 10ArielGlenn) [14:47:30] (03CR) 10Umherirrender: "Looks like this patch set will also undeploy the extension for wikis which does not have reached the necessary mediawiki version. A versio" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96931 (owner: 10Legoktm) [14:49:01] (03PS1) 10Akosiaris: Convert string to integer in puppet.cron.er [operations/puppet] - 10https://gerrit.wikimedia.org/r/96998 [14:49:07] er! [14:50:23] (03CR) 10Akosiaris: [C: 032] Convert string to integer in puppet.cron.er [operations/puppet] - 10https://gerrit.wikimedia.org/r/96998 (owner: 10Akosiaris) [14:59:57] (03PS4) 10Dr0ptp4kt: WIP: DO NOT MERGE YET. Apply FlaggedRevs to metawiki for W0. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95662 [15:00:23] (03PS5) 10Dr0ptp4kt: Apply FlaggedRevs to metawiki for W0. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95662 [15:01:58] (03CR) 10Dr0ptp4kt: Apply FlaggedRevs to metawiki for W0. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95662 (owner: 10Dr0ptp4kt) [15:03:18] (03Abandoned) 10Dr0ptp4kt: W0 globals order. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96824 (owner: 10Dr0ptp4kt) [15:12:43] (03PS1) 10Akosiaris: Qualify puppetmaster backends checks [operations/puppet] - 10https://gerrit.wikimedia.org/r/97003 [15:12:58] (03PS1) 10Dr0ptp4kt: Automatically pull proxies from Wikipedia Zero's config namespace on META. [operations/puppet] - 10https://gerrit.wikimedia.org/r/97004 [15:14:14] (03CR) 10Akosiaris: [C: 032] Qualify puppetmaster backends checks [operations/puppet] - 10https://gerrit.wikimedia.org/r/97003 (owner: 10Akosiaris) [15:14:24] Nikerabbit: around? [15:14:34] (whoops, wrong channel.) [15:39:55] (03PS1) 10ArielGlenn: 'qualify' openstack_version [operations/puppet] - 10https://gerrit.wikimedia.org/r/97007 [15:40:31] (03CR) 10jenkins-bot: [V: 04-1] 'qualify' openstack_version [operations/puppet] - 10https://gerrit.wikimedia.org/r/97007 (owner: 10ArielGlenn) [15:42:45] (03PS2) 10ArielGlenn: 'qualify' openstack_version [operations/puppet] - 10https://gerrit.wikimedia.org/r/97007 [15:43:18] (03CR) 10jenkins-bot: [V: 04-1] 'qualify' openstack_version [operations/puppet] - 10https://gerrit.wikimedia.org/r/97007 (owner: 10ArielGlenn) [15:45:34] (03PS3) 10ArielGlenn: 'qualify' openstack_version [operations/puppet] - 10https://gerrit.wikimedia.org/r/97007 [15:58:42] (03CR) 10Cmcmahon: [C: 031] "adding Antoine and Zeljko. I think is the right thing to do, but Antoine would probably know for certain." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93610 (owner: 10MarkTraceur) [16:04:34] (03PS1) 10Faidon Liambotis: Add check_graphite & a reqstats 5xx check [operations/puppet] - 10https://gerrit.wikimedia.org/r/97008 [16:07:13] (03CR) 10Akosiaris: [C: 04-1] "Try to have plugins shipped by us in /usr/local please. Other than that LGTM" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/97008 (owner: 10Faidon Liambotis) [16:08:43] Nemo_bis: ^^^ [16:08:59] Nemo_bis: I'm trying to automate you [16:15:43] (03PS2) 10Faidon Liambotis: Add check_graphite & a reqstats 5xx check [operations/puppet] - 10https://gerrit.wikimedia.org/r/97008 [16:16:05] paravoid: awesomesuace [16:16:18] (03CR) 10Anomie: Normalise the path part of URLs in the text frontend (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/96941 (owner: 10Tim Starling) [16:16:44] (03CR) 10Faidon Liambotis: [C: 032] Add check_graphite & a reqstats 5xx check [operations/puppet] - 10https://gerrit.wikimedia.org/r/97008 (owner: 10Faidon Liambotis) [16:16:59] greg-g: hey :) [16:27:17] (03PS1) 10Faidon Liambotis: Revert geowiki separate class & backup changes [operations/puppet] - 10https://gerrit.wikimedia.org/r/97021 [16:28:03] (03CR) 10Faidon Liambotis: [C: 032] Revert geowiki separate class & backup changes [operations/puppet] - 10https://gerrit.wikimedia.org/r/97021 (owner: 10Faidon Liambotis) [16:31:08] RECOVERY - Puppet freshness on stat1 is OK: puppet ran at Fri Nov 22 16:31:04 UTC 2013 [16:32:09] (03CR) 10Gergő Tisza: [C: 04-1] "I think at the time of writing this, we still thought real Commons is configured as a ForeignAPIRepo. Actually it is a ForeignDBViaLBRepo," [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93610 (owner: 10MarkTraceur) [16:39:06] (03PS3) 10Andrew Bogott: role classes for download servers [operations/puppet] - 10https://gerrit.wikimedia.org/r/96415 (owner: 10Dzahn) [16:39:27] Reedy: so, scap updates the i10n cache, but, when do new translations get pulled from translatewiki? ie: if scap was run now, would it pull in all new translations from translatewiki, or do I need to wait until a certain time (or can I force that part)? [16:39:53] No [16:40:08] localisation update runs at around 0100 UTC [16:40:20] or 0200 [16:40:20] gotcha [16:40:21] Or whatever [16:40:23] right [16:40:26] That's irrelevant :P [16:40:34] just not "now" [16:40:36] ;) [16:40:39] Exactly [16:40:51] So at that point, it'll update it's local repos to their masters [16:40:56] * greg-g nods [16:41:20] so, there's no way to force it in the case of a situation where login on chechen wiki is broken with its current translations? :) [16:41:21] The localisation cache updates against that [16:41:25] * aude still fuzzy on how it works [16:41:26] (03CR) 10Andrew Bogott: [C: 032] role classes for download servers [operations/puppet] - 10https://gerrit.wikimedia.org/r/96415 (owner: 10Dzahn) [16:41:32] Are the translations in master? [16:41:42] we could update master :) [16:41:53] * greg-g is still fuzzy too [16:41:54] but i don't know if that interfere with the automatic stuff [16:42:10] and if it's been broken for months, then what's another couple hours [16:42:21] heh [16:42:24] right [16:42:36] Are aliases exported every night? [16:42:37] * aude still puzzled how it could be broken for months [16:42:42] alright, I'll ignore for now ;) [16:42:48] I'm not 100% sure or if it's more a manual processs [16:42:52] Reedy: no idea [16:42:56] No raymond [16:42:57] we should ask [16:43:00] siebrand: About? [16:43:24] siebrand: Are aliases and such updated every night in master? Or do they need doing seperately? [16:43:29] after localisation runs and it's not fixed, then maybe we can update the alias manually [16:43:45] and hope it doesn't break automatic update [16:44:13] Why would it? [16:44:20] no idea [16:44:21] Reedy: sup? [16:45:15] https://bugzilla.wikimedia.org/show_bug.cgi?id=57410 [16:46:13] siebrand: Are special page aliases updated with the normal i18n updates? [16:47:09] I seem to recall maybe for core, but not for extensions [16:47:53] (03CR) 10ArielGlenn: "the second stanza with tesla.wm.o can for sure be gone. not 100% about the first one, *think* so but better to ask ryan." [operations/puppet] - 10https://gerrit.wikimedia.org/r/96489 (owner: 10Dzahn) [16:48:02] https://github.com/wikimedia/mediawiki-extensions-SecurePoll/commits/master/SecurePoll.alias.php [16:48:48] Reedy: I'd say that the latest special page aliases update was probably months ago. [16:48:58] Reedy: Does that mean it's been broken for months? [16:48:58] Aha [16:49:22] oh, it's only done periodically? [16:49:48] Reedy: Special page aliases for core are exported rarely. For extensions, it's done every time Raymond runs exports. [16:49:51] siebrand: appears it broke in july and no one noticed [16:49:55] apparently [16:50:02] for cewiki [16:50:11] (03Abandoned) 10MarkTraceur: Set up Beta Commons as an API repo for beta sites [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93610 (owner: 10MarkTraceur) [16:50:19] Reedy: So if it is only broken since recently, it could be an issue with an extension special page. [16:50:38] siebrand: it's been months [16:50:55] Reedy: I see in the bug "translatewiki.net" does nothing, but I have no idea what it is that I/we should have done. [16:51:08] aude: okay, then it's probably broken since the last update. [16:51:21] aude: Okay, then it's probably broken since the last update. [16:51:21] i think they thought editing in translate wiki would immediately fix [16:51:52] aude: we have a pretty clearly advertised support page. Issues that are urgent from a production perspective ARE being picked up. [16:52:00] What's on Translate is the same as in master [16:52:01] Белхан [16:52:33] hmmm [16:52:51] Reedy: Lemme do a quick special page aliases export for "ce" only, maybe you can deploy that single revision. [16:52:57] Reedy: Give me 10 mins... [16:53:02] https://ce.wikipedia.org/w/index.php?title=%D0%91%D0%B5%D0%BB%D1%85%D0%B0%D0%BD:%D0%A7%D1%83%D0%B2%D0%B0%D0%BB%D0%B0%D1%80/%D1%8F%D0%BB%D0%B0%D1%80&returnto=%D0%9A%D0%BE%D1%8C%D1%80%D1%82%D0%B0+%D0%B0%D0%B3%D3%80%D0%BE [16:53:03] Ahh [16:53:17] No such special page [16:53:17] You have requested an invalid special page. [16:53:38] Чувалар/ялар [16:53:49] what i see in master [16:54:13] it's probably a translation for "create account/login" [16:55:47] This IS pretty stupid: [16:55:47] - 'Userlogin' => array( 'Чувалар/ялар' ), [16:55:48] - 'Userlogout' => array( 'Аравалар/ялар' ), [16:55:53] - 'Userlogin' => array( 'Чувалар/ялар' ), [16:55:54] - 'Userlogout' => array( 'Аравалар/ялар' ), [16:56:23] (03CR) 10Ori.livneh: rewrite nginx module (034 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/96961 (owner: 10Ori.livneh) [16:56:26] ori-l: hey [16:56:56] [[Special:UserLogin|системин чугӀо]] [16:57:18] English all over [16:57:43] hey paravoid [16:57:46] Reedy: Patch being submitted. [16:57:54] wee, thanks siebrand [16:57:56] ori-l: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=tungsten&service=HTTP+5xx+req%2Fmin [16:58:05] Yay [16:58:36] paravoid: oh look, it's there! yay [16:58:42] graphite-based alerting? [16:58:45] yup [16:58:53] and the nagios check I picked has pretty nice arguments too [16:58:57] that's awesome [16:59:00] Reedy: What are the currently deployed branches? [16:59:07] ! [16:59:09] siebrand: 4 and 5 [16:59:10] 1.23wmf4 and 1.23wmf5 [16:59:17] percentiles, holt winters confidence, over/under (so you can say "2xx less than 70k/min? alert!") etc. [16:59:28] greg-g: This is something we should probably bring up with andre [16:59:28] paravoid: love it [16:59:30] https://gerrit.wikimedia.org/r/#/c/97032/ [16:59:33] Reedy: which? [16:59:57] That these sort of i18n/l10n type bugs need to get triaged by our language people [16:59:59] paravoid: please announce to ops/engineering list with "now go find more things to alert yourselves of!" ;) [17:00:00] the 5xx check is very simple though [17:00:06] Reedy: ah [17:00:09] Though, I note that was only logged today [17:00:10] Reedy: https://gerrit.wikimedia.org/r/#/c/97033/ https://gerrit.wikimedia.org/r/#/c/97034/ [17:00:10] > 250 warn, > 500 crit [17:00:16] So maybe not so much of an issue [17:00:16] if it works well, we can even make it paging [17:00:18] * aude would minimum like jenkins to block on such translations [17:00:26] at minimum* [17:00:26] maybe with an increase threshold [17:00:32] too many 5xx should page [17:00:34] better yet, translate also not accept them [17:00:59] Reedy: I'm waiting for Jenkins to +2 it on master (https://gerrit.wikimedia.org/r/#/c/97032/ ) [17:01:15] Great [17:01:48] Reedy: +2-ed on master. Jenkins will merge soon. [17:01:48] gwicke ^ see above for graphite-based alerting [17:02:04] I'll force through on the deployment branches [17:02:27] Reedy: yeah, that should be safe. [17:02:40] Reedy: I'll double check the diffs on the branches. [17:03:03] (03PS1) 10coren: Tool Labs: Install djvulibre support [operations/puppet] - 10https://gerrit.wikimedia.org/r/97035 [17:06:33] Something up with Jenkins? [17:07:02] Coren: usually just have patience [17:07:44] It's not usually this long. Oh well. [17:08:02] try commenting just "recheck" [17:09:26] !log reedy synchronized php-1.23wmf4/languages/messages/MessagesCe.php 'Update ce language special page aliases' [17:09:43] Logged the message, Master [17:10:31] !log reedy synchronized php-1.23wmf5/languages/messages/MessagesCe.php 'Update ce language special page aliases' [17:10:46] Logged the message, Master [17:13:35] oh hey, siebrand while you're 'here', how'd the language summit go? [17:13:51] siebrand: I heard there was going to be some refactoring somewhere? anything I should be aware of re deployments? [17:13:53] greg-g: It was pretty nice. [17:14:09] greg-g: More than 60 participants, from 10 or so orgs. [17:14:10] siebrand: I have this on the 'near term' section: "mid-November: i18n refactoring during Language Summit (November 18–19) " [17:14:14] siebrand: awesome! [17:14:37] greg-g: hmm, not sure where that comes from :P [17:14:46] alolita [17:14:50] greg-g: we did prepare two RFCs with the VE team. [17:14:54] oh, cool [17:15:03] so, I'll just remove that line and not worry? :) [17:15:07] greg-g: What you're saying is not true. Nothing has been refactored. [17:15:23] greg-g: We have an RFC draft for changing the localisation format to JSON instead of PPH arrays. [17:15:31] cool [17:15:40] I'm ok with being untrue, as long as I fix it. [17:15:42] :) [17:15:51] greg-g: https://www.mediawiki.org/wiki/Requests_for_comment/Localisation_format James and I hope to announce it this weekend or early next week. [17:15:57] * greg-g nods [17:16:31] * aude assumes we needs localisation cache rebuild? [17:16:49] aude: Yup [17:16:51] greg-g: The second RFC is about replacing the front-end i18n with a mediawiki i18n independent library (query.i18n, see Wikimedia's GitHub), and updating ResourceLoader to make it work in mediawiki, too. [17:17:23] k [17:17:30] greg-g: The use case is basically already implanted in ULS, and Visual Editor also needs a stand-alone i18n solution. [17:17:37] * greg-g nods [17:17:44] siebrand: would be awesome [17:17:47] greg-g: we we're planing on replacing jQueryMsg (or whatever it's called now) [17:18:18] VE is ambitious and plans to get a working solution out the door by xmas. With Roan and Timo on it, that should be doable. [17:18:36] It would also mean that query.i18n would get a lot more exposure, and hopefully develop much quicker. [17:18:51] as you know, we like to make some of the wikidata widgets reusable outside of mediawiki [17:18:58] can you see what I wrote in the line before about exposure and quicker development? [17:19:08] the widgets are tied to mediawiki because of i18n, often [17:19:14] Colloqquy sometimes misses lines :( [17:19:35] siebrand: i see [17:19:50] https://etherpad.wikimedia.org/p/i18n-rfc-2013-11 contains the working draft, which still has to be updated and moved to medaiwiki.org. [17:19:55] !log reedy synchronized php-1.23wmf4/extensions/EducationProgram/includes/Events/EditEventCreator.php [17:19:57] lines 13-39 [17:20:09] Logged the message, Master [17:20:29] The draft RFC is open for comments, I think. [17:20:33] So please do read. [17:20:43] siebrand: thanks [17:21:09] !log reedy synchronized php-1.23wmf5/extensions/EducationProgram/includes/Events/EditEventCreator.php [17:21:11] The second one needs at least needs some more preface so we provide a better reason why. [17:21:20] I'm troubleshooting elsewhere, too. [17:21:23] Logged the message, Master [17:21:24] going in idle mode. [17:21:33] Reedy: please let me know when fixed. I'm interested :) [17:22:06] siebrand: wikidata is a good use case for frontend library [17:23:16] Reedy: we're still waiting on the translation fix for cewiki right? [17:23:51] localisation cache rebuild [17:24:09] right, so, we're waiting until the auto one, right? [17:24:12] i don't think it blocks deploying something esle [17:24:16] * greg-g nods [17:24:18] no idea [17:24:22] else* [17:25:15] MaxSem: when would you be ready to deploy that revert? [17:25:42] brb, coffee mug almost broke on my face [17:25:55] greg-g, I'll have a PDF check up now, will be ready after 10 [17:27:14] greg-g: I was planning on scapping so confirm it's fixed [17:27:24] gwicke, RoanKattouw: around? [17:27:34] paravoid: For only a second longer [17:27:35] Why? [17:27:41] I have a couple of questions [17:27:52] paravoid: pong [17:27:58] hey [17:28:10] so, I noticed that varnish frontends get a lot of POST api.php?random=NNNN requests [17:28:17] from parsoid servers [17:28:26] paravoid: we have a bug for that already [17:28:44] https://bugzilla.wikimedia.org/show_bug.cgi?id=51273 [17:28:51] didn't get around to it yet [17:28:54] heh [17:28:57] okay, I was about to suggest that [17:28:58] thanks :) [17:29:25] Reedy: aha, see -tech if you want to pull along a revert for the ride [17:29:34] * gwicke does some subject updating with s/Squid/Varnish/ [17:29:51] RoanKattouw: the other thing was more for you personally [17:30:41] RoanKattouw: I noticed a bunch of MediaWiki:* being loaded, among them Common.js [17:30:51] that e.g. for Meta, you're listed as author [17:31:00] these are all with Cache-control: private [17:31:22] completely uncached and "Any JavaScript here will be loaded for all users on every page load" [17:31:42] Ahm, wtf [17:31:54] paravoid: I'm about to head out to dinner so can you email me with more details? [17:31:56] This sounds concerning [17:31:59] not sure if "all users" includes anons, although the req/s for this definitely points that way [17:32:12] (03CR) 10coren: [C: 032] Tool Labs: Install djvulibre support [operations/puppet] - 10https://gerrit.wikimedia.org/r/97035 (owner: 10coren) [17:32:17] dinner? I hadn't realized you're on this side of the pond :) [17:32:22] paravoid: If you could email me URLs and headers that would be great [17:32:26] Yeah, we're in London right now [17:32:30] sure, I can do that [17:32:31] with the VE team [17:32:34] have fun! [17:32:37] Stopping over on the way back from India [17:33:39] paravoid: And the req/s numbers too if you've already got them anyway [17:33:51] okay [17:34:09] what are these MediaWiki:* javascripts? it says it's not gadgets, what are you calling those? [17:34:48] I don't know [17:34:50] Depends on the URL [17:34:52] Really gotta go now [17:34:54] okay [17:34:55] bye! [17:35:07] paravoid: are they per-wiki or cross-wiki? [17:35:48] paravoid: importScript ? [17:35:58] or mw.loader.load [17:36:07] MediaWiki:Common.js is the custom JS for each wiki, similar to MediaWiki:Common.css [17:36:21] which can then load more stuff [17:36:38] ah mutante and andrewbogott the download module looks fine, also the puppet runs seem to be unchanged so thanks for that [17:36:57] depends [17:37:18] 3754 Hash c /w/index.php?title=MediaWiki:RefToolbarBase.js&action=raw&ctype=text/javascript [17:37:21] 3754 Hash c en.wikipedia.org [17:37:25] 3754 RxHeader c Referer: http://es.wikipedia.org/wiki/Bruxismo [17:37:29] * aude click [17:37:35] andrewbogott: :) ty [17:37:43] that's one of the top ones [17:38:02] they don't set smaxage in the request [17:38:09] (03PS1) 10coren: Tool Labs: install doxygen{,-latex} [operations/puppet] - 10https://gerrit.wikimedia.org/r/97055 [17:38:13] &smaxage=86480 [17:38:24] (IIRC) [17:39:24] so, eswiki is loading it? [17:39:24] apergos: also thank matanya :) [17:39:30] that one is cached [17:39:38] Cache-control: public, s-maxage=300, max-age=2678400 [17:39:50] others are not [17:39:58] gone, will do next time [17:40:14] ok, I'll figure it out and mail roan [17:40:21] thanks :) [17:40:34] paravoid: https://www.mediawiki.org/wiki/Manual:Parameters_to_index.php#Raw [17:41:16] I wrote that thing a while ago ;) [17:41:51] heh [18:00:54] PROBLEM - SSH on amslvs1 is CRITICAL: Server answer: [18:01:54] RECOVERY - SSH on amslvs1 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [18:04:04] paravoid: hehe, automating me looks like a scary goal :) [18:09:10] Hey Reedy, do you know where the source for our current version of texvc lives? [18:12:16] (03CR) 10coren: [C: 032] Tool Labs: install doxygen{,-latex} [operations/puppet] - 10https://gerrit.wikimedia.org/r/97055 (owner: 10coren) [18:14:36] ottomata: can you look at https://gerrit.wikimedia.org/r/#/c/96796/ soon? Pretty please? With graphs on top? [18:35:08] bblack, hi, does this puppet look right to you? https://gerrit.wikimedia.org/r/#/c/97004/1/manifests/role/cache.pp [18:54:24] manybubbles: am off today! but i just skimmed that and it looks fine to me [18:54:41] ottomata: cool. don't merge it 'till you come back then:) [18:54:49] have a good day off [18:55:18] i can merge now if you like, looks simple enough [18:55:20] ja? [18:55:26] not going to cause an outage? :p :)( [18:55:48] don't tempt the demons [18:58:45] csteipp: extensions/Math [18:59:04] extensions/Math/math [18:59:47] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (205157) [19:01:34] <^d> The following wikis? [19:01:36] <^d> Which wikis? [19:02:19] wheeee [19:02:47] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [19:03:24] <^d> Meh, all better now [19:08:02] (03PS1) 10Faidon Liambotis: Add firewall to holmium/blog [operations/puppet] - 10https://gerrit.wikimedia.org/r/97071 [19:08:03] (03PS1) 10Faidon Liambotis: memcached: use nrpe when binding to localhost [operations/puppet] - 10https://gerrit.wikimedia.org/r/97072 [19:08:29] (03CR) 10Faidon Liambotis: [C: 032] Add firewall to holmium/blog [operations/puppet] - 10https://gerrit.wikimedia.org/r/97071 (owner: 10Faidon Liambotis) [19:09:31] (03PS2) 10Faidon Liambotis: memcached: use nrpe when binding to localhost [operations/puppet] - 10https://gerrit.wikimedia.org/r/97072 [19:10:04] I wonder how much time is wasted overall waiting for jenkins [19:10:25] (03CR) 10Faidon Liambotis: [C: 032] memcached: use nrpe when binding to localhost [operations/puppet] - 10https://gerrit.wikimedia.org/r/97072 (owner: 10Faidon Liambotis) [19:10:32] !log reedy started scap: Rebuild localisation cache to update ce [19:10:43] windows ce that is [19:10:48] Logged the message, Master [19:11:29] (03PS3) 10Faidon Liambotis: memcached: use nrpe when binding to localhost [operations/puppet] - 10https://gerrit.wikimedia.org/r/97072 [19:12:12] (03CR) 10Faidon Liambotis: [V: 032] Add firewall to holmium/blog [operations/puppet] - 10https://gerrit.wikimedia.org/r/97071 (owner: 10Faidon Liambotis) [19:12:22] grr [19:12:32] <^d> akosiaris: Thanks for merging those icinga checks for gerrit/gitblit by the way. That's been on my back burner for waayyyyy too long. [19:12:49] (03CR) 10Faidon Liambotis: [C: 032 V: 032] memcached: use nrpe when binding to localhost [operations/puppet] - 10https://gerrit.wikimedia.org/r/97072 (owner: 10Faidon Liambotis) [19:13:52] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (214954) [19:27:12] !log reedy finished scap: Rebuild localisation cache to update ce [19:27:26] ori-l: scap completed in 20m 15s. [19:27:27] Logged the message, Master [19:27:38] For a message cache update of one language in 2 versions... [19:28:06] huh, longer than yesterday... [19:28:24] so we are not doing any updates today, right? :) [19:28:27] greg-g, ? [19:28:42] Just tidying up some outstanding bugs [19:29:05] Reedy: i suspect scap can be choked by one or two unresponsive hosts [19:29:10] * aude can login :) [19:29:15] can we rebuild the cache in 100500 threads? [19:29:42] i soo wanted to make paravoid happy with the proxies-only list of ips :( https://gerrit.wikimedia.org/r/#/c/96686/ [19:29:45] yurik: like any day, high need bug fixes can happen, but not new features [19:29:56] yurik: letting people login sounds like a high need to me :) [19:30:11] nah, that's why we allow anonymous editing :) [19:30:14] would be awesome if just the cache for the one lang could be updated individually [19:30:26] Mmm [19:30:37] * aude knows they are stored in separate cdbs [19:30:38] If there's no changes it shouldn't be updated... [19:30:45] hmmmmm [19:30:56] I lost it in the scrollback and then closed the window [19:31:02] it tells you number of messages updated [19:31:07] * Reedy looks at timestamps [19:31:09] * aude went to eat and do other stuff, in all the time :) [19:31:51] they all look to have had some change [19:32:02] oh, ok [19:32:23] No big deal [19:34:03] (03PS1) 10Ori.livneh: Revert "Add mobile views to ganglia" [operations/puppet] - 10https://gerrit.wikimedia.org/r/97076 [19:34:35] (03CR) 10jenkins-bot: [V: 04-1] Revert "Add mobile views to ganglia" [operations/puppet] - 10https://gerrit.wikimedia.org/r/97076 (owner: 10Ori.livneh) [19:34:54] PROBLEM - SSH on aluminium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:34:54] RECOVERY - Memcached on holmium is OK: TCP OK - 0.000 second response time on port 11000 [19:36:46] paravoid: :) [19:36:50] saw that [19:36:52] hm [19:36:56] ssh failing [19:37:10] oh. i meant memcached recovery [19:37:14] yeah I know [19:37:22] (03PS1) 10Hashar: beta: $wgMWOAuthCentralWiki = 'labswiki' [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97079 [19:37:48] what's wrong with aluminium? [19:38:47] console shows login but looks overloaded,, [19:38:54] PROBLEM - Exim SMTP on aluminium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:38:59] (03CR) 10Anomie: [C: 031] "Seems sane. Haven't tested." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97079 (owner: 10Hashar) [19:39:10] (03PS2) 10Ori.livneh: Revert "Add mobile views to ganglia" [operations/puppet] - 10https://gerrit.wikimedia.org/r/97076 [19:39:13] (03CR) 10Hashar: [C: 032] beta: $wgMWOAuthCentralWiki = 'labswiki' [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97079 (owner: 10Hashar) [19:39:26] cant login on aluminium [19:39:35] (03Merged) 10jenkins-bot: beta: $wgMWOAuthCentralWiki = 'labswiki' [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97079 (owner: 10Hashar) [19:39:48] can now..it's reallly slow [19:39:55] waits for a prompt [19:40:13] (03CR) 10Ori.livneh: [C: 032] Revert "Add mobile views to ganglia" [operations/puppet] - 10https://gerrit.wikimedia.org/r/97076 (owner: 10Ori.livneh) [19:40:27] Jeff_Green: ^^, aluminium, fyi [19:40:48] yeah I'm seeing that too. not sure yet what's up [19:40:52] yes, please role::fundraising::civicrm [19:41:21] mutante: ^^^ huh? [19:41:45] i just got to see the Last login line but i don't get a prompt now [19:42:09] oic [19:42:15] Jeff_Green: just wanted to ping you the same moment, confirming it affects civicrm [19:42:20] surely some user's process has gone awry [19:42:29] mutante: k. appreciated [19:42:43] * mwalker is not guilty of having 9 screen sessions open on al [19:42:53] * mwalker looks innocent [19:42:54] RECOVERY - Exim SMTP on aluminium is OK: SMTP OK - 9.264 sec. response time [19:43:08] mwalker: can one of those screen sessions access 'top' perchance? [19:43:14] RECOVERY - SSH on aluminium is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [19:43:32] it would if I was able to login -- I mostly keep them open for work in progress [19:43:43] I'm not connected right now [19:43:50] mwalker: sadness [19:43:55] ok, what do you want me to do [19:44:14] mutante: you're connected? [19:44:15] root@aluminium:~# [19:44:27] you just need a LOT of patience to login [19:44:33] ha load average: 27.35, 44.76, 28.25 [19:46:02] user "sahar" ? [19:46:11] is doing a ton of -bash [19:46:24] the output of ps scrolls slowly down my screen .. :P [19:46:54] PROBLEM - Exim SMTP on aluminium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:47:07] fork bomb [19:47:14] PROBLEM - SSH on aluminium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:47:21] :(){ :|:& };: [19:47:24] tries to kill sahar processes [19:48:00] do you know that user? [19:48:13] he's an analyst w/FR, IIRC [19:48:14] PROBLEM - HTTP on aluminium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:48:34] sahar says he's not doing anything right now [19:49:12] mwalker: where did he say that? [19:49:19] I have him in gchat [19:49:29] http://paste.debian.net/67235/ [19:49:44] RECOVERY - Exim SMTP on aluminium is OK: SMTP OK - 0.093 sec. response time [19:50:00] mutante: forkbomb? [19:50:14] RECOVERY - HTTP on aluminium is OK: HTTP OK: HTTP/1.1 302 Found - 557 bytes in 7.277 second response time [19:50:37] "Sahar M: I ran a script - the same script I run on my own machine. That was about 10 minutes ago, it ran super-slow, so I command-C'd it, which killed it" [19:51:36] after all this time I still do not have a more than 12% functional shell on al [19:51:37] Jeff_Green: i'm sending sig STOP [19:51:42] mutante: k [19:52:14] any other suggestions how to kill it? [19:52:17] hmm [19:52:19] I'm going to drop apache [19:52:35] since it's just barfing out of memory errors anyway [19:53:02] i think i was successful now [19:53:04] RECOVERY - SSH on aluminium is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [19:53:09] making it -9 [19:53:24] root 29299 0.0 0.0 0 0 ? Z 19:53 0:00 [tar] [19:53:25] i killed a bunch of parallel tar procs [19:53:27] root 29301 0.0 0.0 0 0 ? Z 19:53 0:00 [tar] [19:53:32] ah,heh [19:53:35] ok [19:53:47] it's coming back to earth [19:53:47] well sahar procs gone now [19:53:50] yep [19:54:08] restarting services [19:54:19] jenkins on top of top now [19:54:29] i restarted it [19:54:45] Hello everyone [19:54:49] !log killed all processes by user sahar on aluminium [19:54:52] hi sahar [19:55:00] bad bad bad man [19:55:04] PROBLEM - HTTP on aluminium is CRITICAL: Connection refused [19:55:05] Logged the message, Master [19:55:09] I heard I destroyed everythign [19:55:30] sahar1: there are a ton of bash and tar procs running as your user [19:55:42] sahar1: we killed this http://paste.debian.net/67235/ [19:55:44] I definitely didn't write anything with tar [19:56:06] Maybe I called something that uses tar at a low-level? [19:56:27] * Jeff_Green double-checking who was running the tar procs [19:56:39] ooh I think I know the problem. [19:56:43] though not the tarring [19:56:55] sorry, I was wrong about the tar stuff--that was root. not sure what that is yet, maybe log rotation [19:56:57] !log starting apache on alumunium [19:57:03] Jeff_Green: fwiw, Jenkins runs tar like a mofo [19:57:04] RECOVERY - HTTP on aluminium is OK: HTTP OK: HTTP/1.1 302 Found - 557 bytes in 0.042 second response time [19:57:07] Yes okay [19:57:13] Logged the message, Master [19:57:14] I see the problem. It was a stupid stupid mistake [19:57:29] awight: to decompress logs? [19:57:49] In the script I was running, I had the name of the script in the first line, instead of saved as just the filename. [19:57:54] Jeff_Green: tar czf /archive/aluminium/jenkins_builds/Donations_queue_consume-2013-11-22_18-49-56.tgz [19:58:03] Totally my fault [19:58:19] awight: nod. I think they were backed up because of the RAM starvation [19:59:01] sahar1: ok, glad you found it [20:00:12] (03PS2) 10Ori.livneh: Schema:Echo has outdated revision id [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96901 (owner: 10Bsitu) [20:00:41] IMO we should restart aluminium [20:00:41] I'm sorry for this, everyone. [20:01:01] Jeff_Green: +1 [20:01:05] Jeff_Green: I'll let everybody know. When do you want to do it? Like, now? [20:01:15] yeah. I'm going to do an apt update first [20:01:21] :) [20:01:27] might as well get that done too [20:01:50] you might just wanna check if there are remants now of whatever that script wrote because it was all killed in the process [20:02:03] yep [20:02:04] (03PS1) 10Anomie: Add packages needed for Collection OCG to contint::packages::labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/97082 [20:03:31] dropped apache and jenkins [20:04:09] (03CR) 10Hashar: [C: 031] "indeed needed" [operations/puppet] - 10https://gerrit.wikimedia.org/r/97082 (owner: 10Anomie) [20:04:38] anyone could merge in https://gerrit.wikimedia.org/r/#/c/97082/ please ? that is for a labs instance that runs the PDF sprint tests :-] [20:04:43] !log dist-upgrade aluminium and reboot [20:04:48] (which relies on inkscape) [20:04:51] hashar: looking [20:04:55] Logged the message, Master [20:05:44] (03CR) 10Jgreen: [C: 031 V: 031] Add packages needed for Collection OCG to contint::packages::labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/97082 (owner: 10Anomie) [20:06:02] PROBLEM - HTTP on aluminium is CRITICAL: Connection refused [20:06:07] (03CR) 10Jgreen: [C: 032] Add packages needed for Collection OCG to contint::packages::labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/97082 (owner: 10Anomie) [20:06:58] hashar: merged [20:08:25] Jeff_Green: thank you! [20:08:30] np [20:09:29] sees alumnium being back up [20:09:34] (03CR) 10DarTar: [C: 031] Schema:Echo has outdated revision id [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96901 (owner: 10Bsitu) [20:11:31] anomie: I hate puppet, got a duplicate package definition :D [20:11:45] mutante: yep. looks normalish again [20:13:17] Jeff_Green: :) [20:13:40] yay for package upgrades then [20:14:08] silver lining [20:14:45] (03PS1) 10Hashar: contint: librsvg2-bin duplicate definition [operations/puppet] - 10https://gerrit.wikimedia.org/r/97085 [20:14:56] hashar: puppet can't handle being told twice that the same package should be installed? ugh? [20:15:14] Jeff_Green: another one following up https://gerrit.wikimedia.org/r/97085 (duplicate package definition between contint class ) [20:15:21] anomie: yup [20:15:30] anomie: each declaration is immutable or something ilke that [20:15:49] anomie: you could have package { 'librsvg2-bin' : ensure => present }  and another one saying ensure => absent [20:16:07] I think puppet doesn't care about the parameters (ensure => present) but ensure the uniqueness based on the name [20:16:11] aka 'librsvg-2-bin' [20:16:14] so it bails out [20:16:16] hashar: Well, complaining about differing ensure or the like would make sense [20:16:18] Import double-acting baking powder to Germany if you want to bake cornbread and such. [20:16:22] wrong paste :p [20:16:33] Puppet does not allow you to declare the same resource twice. This is to prevent multiple conflicting values from being declared for the same attribute. [20:16:36] Puppet uses the title and name/namevar to identify duplicate resources — if either of these is duplicated within a given resource type, the compilation will fail. [20:16:46] hashar: tick tock tick tock jenkins where are you? [20:17:03] (03CR) 10Jgreen: [C: 032 V: 031] contint: librsvg2-bin duplicate definition [operations/puppet] - 10https://gerrit.wikimedia.org/r/97085 (owner: 10Hashar) [20:17:06] there we go [20:17:47] thank you! [20:17:49] anomie: hashar their answer is "If multiple classes require the same resource, you can use a class or a virtual resource to add it to the catalog in multiple places without duplicating it." [20:18:14] yeah generic::package::librsvg2-bin [20:18:15] :D [20:18:20] my god [20:18:26] aren't generic::packages being hated too :P [20:18:30] I wish I was able to write tests for that. got another one now [20:18:43] anyone looking for something to do ? because FPL has another link flapping and we really need to get them to fix it.... [20:18:55] and i don't feel like spending phone time ;) [20:19:00] but can if necessary [20:19:43] what are you proposing? someone wander over to their offices and stare menacingly at them until they do something? [20:19:56] well someone be on the phone calling [20:20:00] which is annoying but works [20:20:22] hey! [20:20:24] no outage today? [20:20:29] how come? [20:20:30] none yet [20:20:36] (03PS1) 10Hashar: contint: imagemagick duplicate definition [operations/puppet] - 10https://gerrit.wikimedia.org/r/97089 [20:20:50] Jeff_Green: removed everything so https://gerrit.wikimedia.org/r/97089 :D sorry [20:21:07] argh [20:21:13] hashar: puppet is your enemy today [20:21:23] unzip is already installed got to remove it [20:21:25] paravoid: well.. aluminium [20:21:40] nah, doesn't count [20:23:39] peer: why do you reset my connection please? [20:23:42] k. must be because it's Friday [20:23:51] yeah no deploy [20:24:05] I gave a conf early this morning presenting to some devs how we handle deploy [20:24:05] (03PS2) 10Hashar: contint: unzip/imagemagick duplicate definition [operations/puppet] - 10https://gerrit.wikimedia.org/r/97089 [20:24:24] Jeff_Green: https://gerrit.wikimedia.org/r/97089 ::] last I promise [20:24:43] basically concluded with the mw release cycles and the no deploy on friday policy [20:24:44] hashar: go punch jenkins for me [20:24:45] Is there a way to ask puppet if it's explicitly installing a package, versus it being pulled in as a dependency of something else? [20:25:02] someone asked why: my reply? Do you want to waste your friday night ? [20:25:15] LeslieCarr: wooah, they have 14 maintenance accounce tickets open!? [20:25:28] anomie: it always explicitly install a package, but apt-get will pull the dependencies [20:25:40] (03CR) 10Jgreen: [C: 032 V: 031] contint: unzip/imagemagick duplicate definition [operations/puppet] - 10https://gerrit.wikimedia.org/r/97089 (owner: 10Hashar) [20:25:47] anomie: in this case unzip was being install by some openstack::common class which is applied on all labs instance :( [20:26:00] LeslieCarr: i'm about to call :p [20:26:04] oh thatnks [20:26:06] Jeff_Green: you are the chef^Wpuppet [20:26:13] they're flapping again currently [20:26:14] garg. $jenkins-bot-- [20:27:01] yeah it is slow [20:27:02] LeslieCarr: no, first i'm going through the tickets to see if they announced this [20:27:06] hashar: Exactly my question, any way to tell if puppet explicitly installed it or if apt-get pulled it in as a dependency of something else [20:27:29] Jeff_Green: I failed the upgrade of Zuul earlier this week, so we are stuck with those slowness for a few more weeks/months unfortunately :-( [20:27:30] need to extract the scheduled times [20:27:38] (03CR) 10Jgreen: [V: 032] contint: unzip/imagemagick duplicate definition [operations/puppet] - 10https://gerrit.wikimedia.org/r/97089 (owner: 10Hashar) [20:27:44] hashar: :-( [20:27:59] anomie: yeah you can query the apt repository to find out whether the package got installed explicitly (aka manually) or via a dependencey [20:28:18] anomie: can't remember the exact command though [20:28:20] hashar: What if puppet installed it at some point in the past, then the config was changed so puppet doesn't care anymore? [20:28:31] anomie: then the package is still around [20:28:42] anomie: you have to explicitly purge it using package { 'foobar' : ensure => absent } [20:28:52] LeslieCarr: it's scheduled it seems Scheduled: 22-Nov-2013 03:00:00 GMT to 22-Nov-2013 11:00:00 GMT [20:28:59] anomie: or purge it manually with dpkg --purge foobar [20:29:04] hashar: Exactly. And will show as installed explicitly in apt. But puppet no longer cares and it could theoretically be removed. [20:29:10] Vendor will implement scheduled maintenance to reroute aerial to [20:29:11] underground cable. [20:29:27] in RT 6283 [20:29:27] anomie: yup but that never happens :D [20:30:07] anomie: packages updated on integration-slave01 \O/ [20:30:13] hashar: It theoretically being removed? Yeah. Then someone provisions a new box with the puppet definition and finds the package missing ;) [20:30:23] anomie: indeed :-] [20:31:27] hashar: Hence my question as to whether we can ask puppet if it's caring, so we can know if we need to make it care or not when introducing a new dependency without playing the "merge, oops, merge, oops" game [20:31:52] PROBLEM - SSH on amslvs1 is CRITICAL: Server answer: [20:32:52] RECOVERY - SSH on amslvs1 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [20:32:56] anomie: I am not sure how one can do it, probably by testing out the patch on an instance [20:35:05] but that's over [20:35:18] it's like 1900 utc [20:41:00] LeslieCarr: right, that's why i called them. just hung up, we have a ticket number [20:41:05] and expecting to be called back [20:41:13] putting it on 6283 [20:41:54] (03PS9) 10Mwalker: Initial Puppet Try for OCG::Collection Role [operations/puppet] - 10https://gerrit.wikimedia.org/r/96811 [20:42:57] mwalker: wipe the trailing whitespaces in that change :D [20:43:30] ugh; I need to set up my editor for pp files [20:45:06] mwalker: are you using vim ? [20:45:17] sometimes [20:46:08] i got some configuration for vim that play nice with puppet manifests [20:46:33] https://github.com/rodjek/vim-puppet https://github.com/scrooloose/syntastic [20:46:58] autocmd Syntax puppet set foldmethod=indent [20:47:02] autocmd BufRead */projects/operations/puppet/modules* set sw=4 ts=4 et [20:47:31] cool thanks [20:47:33] the first plugin add nice support for puppet (syntax color ..) [20:47:40] mutante: let's hope they actually check it out and call back! [20:47:42] the later "syntactic" is a must have [20:49:02] LeslieCarr: re-using the maint-announce ticket , just moved it to core-ops for visibility [20:49:07] cool [20:49:08] thanks [20:50:15] (03CR) 10Jgreen: [C: 032 V: 031] Initial Puppet Try for OCG::Collection Role [operations/puppet] - 10https://gerrit.wikimedia.org/r/96811 (owner: 10Mwalker) [21:09:40] !log mediawiki 1.22.0rc3 is out http://dumps.wikimedia.org/mediawiki/1.22/ [21:09:56] Logged the message, Master [21:18:38] (03PS1) 10Yurik: Created mobile portal m.wikipedia.org (will be used for redirects) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97107 [21:18:58] (03CR) 10jenkins-bot: [V: 04-1] Created mobile portal m.wikipedia.org (will be used for redirects) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97107 (owner: 10Yurik) [21:24:50] mutante: :) [21:24:55] (re !log) [21:28:46] Reedy, did i do something wrong with the links in https://gerrit.wikimedia.org/r/#/c/97107/ ? [21:29:06] jenkins is complaining, but its identical to how other vroots seem to be set up [21:29:29] 21:18:56 php -l docroot/mobileportal/w/mobileredirect.php [21:29:29] 21:18:56 Could not open input file: docroot/mobileportal/w/mobileredirect.php [21:29:44] /usr/local/apache/common/w/mobileredirect.php [21:29:51] You didn't put it in the w folder AFAIK [21:30:18] Reedy, but is that folder manually controlled? [21:30:22] reedy@tin:/a/common$ ls -al mobilelanding.php [21:30:22] -rw-rw-r-- 1 reedy wikidev 839 Nov 21 23:01 mobilelanding.php [21:30:39] symlinks gotta point to a file [21:30:43] It's manual [21:31:11] so I should just go to tin and create that link? [21:31:27] without dirsyncing? [21:32:57] No? [21:33:03] Fix it and add a new changeset [21:34:08] (03CR) 10Ori.livneh: [C: 032] Schema:Echo has outdated revision id [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96901 (owner: 10Bsitu) [21:34:18] Reedy, don't understand: yurik@tin:/usr/local/apache/common/w$ ls -al extract2.php [21:34:20] lrwxrwxrwx 1 mwdeploy mwdeploy 15 Mar 26 2013 extract2.php -> ../extract2.php [21:34:24] !log ori updated /a/common to {{Gerrit|I615e2b210}}: Schema:Echo has outdated revision id [21:34:29] that was done by some deployment script [21:34:33] it seems [21:34:37] What was? [21:34:39] Logged the message, Master [21:34:44] the link to extract2 [21:34:54] No it wasn't [21:34:59] It's just now owned by it [21:35:10] When we tidied up and moved servers [21:35:40] !log ori synchronized wmf-config/CommonSettings.php 'I615e2b210: correct schema ID of Schema:Echo' [21:35:55] Logged the message, Master [21:36:00] thanks ori-l [21:39:04] Reedy, i can't - don't have permissions for /usr/local/apache/common/w [21:39:06] ori-l: using a 10 sec $wgMaxShellWallClockTime, it definitely gets killed in time [21:39:21] but: [21:39:26] after the memory allocation error, then strace shows lots of crap getting read out [21:39:37] Reedy, this fails: ln -s ../mobilelanding.php mobilelanding.php [21:40:08] from timeout(1): "If no signal is specified, send the TERM signal upon timeout. The TERM signal kills any process that does not block or catch that signal. For other processes, it may be necessary to use the KILL (9) signal, since this signal cannot be caught." [21:41:37] but a while after the php script finishes I get '/var/www/DevWiki/core/includes/limit.sh: line 37: echo: write error: Broken pipe [21:41:38] limit.sh: timed out executing command "'/usr/bin/identify' -format "[BEGIN]page=%p\nalpha=%A\nalpha2=%r\nheight=%h\nwidth=%w\ndepth=%z[END]" '/home/aaron/Downloads/import//largetif.tif' 2>&1"' dumped into the terminal [21:49:18] hrm [21:55:19] (03PS2) 10Ori.livneh: rewrite nginx module [operations/puppet] - 10https://gerrit.wikimedia.org/r/96961 [21:58:24] (03PS1) 10Yurik: for m.wikipedia.org [operations/apache-config] - 10https://gerrit.wikimedia.org/r/97115 [21:59:31] (03PS2) 10Yurik: for m.wikipedia.org [operations/apache-config] - 10https://gerrit.wikimedia.org/r/97115 [22:01:38] (03CR) 10Ori.livneh: "For PS2, I made Puppet manage /etc/nginx/conf.d as well. I'd prefer to have roles provision conf.d files rather than clobber the base ngin" [operations/puppet] - 10https://gerrit.wikimedia.org/r/96961 (owner: 10Ori.livneh) [22:08:46] ori-l: php is also using 99% CPU waiting on identify [22:08:51] that seems kind of ... wrong [22:11:49] huh, now it's not this run [22:13:02] (03PS2) 10Yurik: Created mobile portal m.wikipedia.org and zero.wikipedia.org [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97107 [22:13:09] (03CR) 10jenkins-bot: [V: 04-1] Created mobile portal m.wikipedia.org and zero.wikipedia.org [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97107 (owner: 10Yurik) [22:14:27] mutante: did fpl call back at all ? [22:15:51] LeslieCarr: nope [22:15:59] (03PS3) 10Yurik: for m.wikipedia.org and zero.wikipedia.org [operations/apache-config] - 10https://gerrit.wikimedia.org/r/97115 [22:16:20] ori-l: gah, nvm that was some other php proc [22:18:38] LeslieCarr: at least we have a ticket 4782828 [22:18:55] is it time to bug them again right away? [22:19:03] how much does this affect us currently [22:19:25] it's been going on for a week and i keep calling them and they keep saying "oh we've figured it out and it's fine now" [22:19:40] so it doesn't kill prod because ospf metricshave been changed [22:19:42] but it does piss me off [22:20:40] :( [22:22:19] i see.. annoying.. and i noticed how many maint-announce tickets they have, i mean .. https://rt.wikimedia.org/Search/Simple.html?q=FPL [22:24:36] (03PS1) 10Yurik: Removed X-DfltLang & X-DfltPage from zero VCLs [operations/puppet] - 10https://gerrit.wikimedia.org/r/97122 [22:26:27] (03PS4) 10Yurik: for m.wikipedia.org and zero.wikipedia.org [operations/apache-config] - 10https://gerrit.wikimedia.org/r/97115 [22:27:02] oh i just realized... [22:27:09] our problems started withthe commencement of hunting season [22:27:31] back in $PREVIOUSJOB we noticed many hunters shooting at aerial fiber [22:27:32] paravoid, 3 patches to make you very happy :) [22:29:59] LeslieCarr: haha, you can only pick one root cause, the hunter or the beavers http://www.huffingtonpost.com/2013/06/29/beaver-internet-cellphone-outage_n_3521530.html [22:33:05] yurik: can you get some proper reviews for https://gerrit.wikimedia.org/r/#/c/96654/4/mobilelanding.php ? [22:33:15] why does it Vary on Cookie? [22:33:49] (03PS1) 10Jgreen: fix repo reference for ocg stuff [operations/puppet] - 10https://gerrit.wikimedia.org/r/97126 [22:35:26] or why does it send a content-type for a 302? [22:35:41] but no body as far as I see? [22:36:02] well, the content-type isn't wrong actually [22:36:19] paravoid, our security guru csteipp reviewed it a bit. The reason for varying - I copied all the regular vary headers that we currently generate, and trimmed those that were obviously not needed. I plan to include a bit more elaborate logic as far as where it should go depending on user's preferences [22:36:35] content-type - yes, i was looking all around the net for that one, [22:36:39] seems like it was a good practice [22:36:47] you know that varying on cookie makes in uncacheable, right? [22:36:57] only for logged in users [22:37:02] which is how we currently handle it :( [22:37:28] ok, i can remove cookie until we decide to implement something with it [22:37:30] not exactly [22:37:43] we have some special VCL logic to filter out unknown cookie values [22:38:05] so this wouldn't have this effect [22:38:06] but you don't filter out user ID stuff, right? [22:38:11] but it's wrong nevertheless [22:38:26] you probably do have to Vary on X-F-Proto though [22:38:33] is your redirect being generated with X-F-P in mind? [22:38:37] (it should) [22:38:55] not yet because X-CS is never set, but it should be [22:39:07] once we go https [22:39:13] X-F-Proto has nothing to do with X-CS [22:39:27] it does - X-CS is only set for non https [22:39:42] so? [22:39:52] this is the landing page for non-zero requests too, isn't it? [22:40:04] (03CR) 10Jgreen: [C: 032 V: 031] fix repo reference for ocg stuff [operations/puppet] - 10https://gerrit.wikimedia.org/r/97126 (owner: 10Jgreen) [22:41:20] if I go to https://m.wikipedia.org/ from my desktop or mobile phone from here, will it redirect me to https://en.m or to http://en.m? [22:42:45] also, I remember you telling me how difficult of a problem this is -- it's 19 lines of PHP, plus another 20 of copy/pasted apache config, isn't it? :-) [22:46:36] paravoid, you know that joke about a retired engineer who charged his old company $20,000 to fix some old factory machine which wasn't working - although all he did was hit the machine with a sledge hammer. The invoice details were $5 - hitting the machine, $19,995 - knowing where to hit it [22:46:54] ... [22:47:02] I told you exactly what was needed to be done and where [22:47:17] I had a hangout with dr0ptp4kt where we were opening the source together and I was pointing out the lines [22:47:20] paravoid, point being - i tried to implement it without redirects [22:47:29] and i failed [22:47:41] hence - the redirects approach, which is obviously easier :) [22:48:41] it's way worse from a client performance perspective but it's the same as now, so that's okay [22:48:48] we gotta start somewhere [22:49:02] if done right, that is [22:49:12] the missing X-F-P would make it a regression [22:49:43] adding right now [22:49:49] k, perfect [22:50:04] ori-l: https://gerrit.wikimedia.org/r/#/c/96961/2 [22:50:35] so, the donotify hack was there so that applying changes to the ssl infrastructure could be done in multiple stages [22:50:44] basically, apply the config, then do a rolling restart [22:50:48] (03PS1) 10Yurik: Mobile redirect - changed cache Vary header [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97130 [22:50:52] paravoid, ^ [22:51:07] I very badly don't want the services just restarting randomly [22:51:22] yurik: would the $redirect value be different depending X-F-P? [22:51:36] paravoid, once IPs are launched - yes [22:51:36] yurik: i.e., if X-F-P is https, will the Location header start with https:// ? [22:51:39] (we've had a number of outages due to nginx configs being pushed out and nginx not handing reload properly) [22:51:43] forget zero [22:51:48] oh, that - no, right now its // [22:51:56] we can change that on the zero side later [22:52:09] if we should explicitly set http or https [22:52:11] I don't think protocol independent redirects work [22:52:19] they do in modern browsers [22:52:29] but we should fix that on backend, agree [22:53:09] it shouldn't be in mobileredirect though [22:53:34] reference? [22:53:38] on the modern browsers? [22:54:01] also, our site supports a lot more than modern browsers, mobile even more so [22:54:04] i tried it in betalabs - chrome worked [22:54:15] yes yes, i said - i agree it should be fixed - will do it on the backend [22:54:28] okay [22:54:33] Ryan_Lane: hrm, what about not having Puppet manage the service at all, then? [22:54:35] before deploying this you mean? [22:54:49] paravoid, before you change VCL [22:54:51] okay [22:54:53] great [22:54:58] everything else can go out sooner [22:55:06] Ryan_Lane: if it's always a big deal because it's such critical infrastructure, maybe it should be done via salt [22:55:08] as its a noop without VCL :) [22:55:08] I'd be fine if you hacked it up on mobilelanding.php :) [22:55:09] (03CR) 10Ryan Lane: "The donotify hack is there because I didn't want all of the ssl servers randomly restarting nginx. nginx actually has an issue with reload" [operations/puppet] - 10https://gerrit.wikimedia.org/r/96961 (owner: 10Ori.livneh) [22:55:26] i would rather not - all logic should be in one spot [22:55:32] sure, agreed [22:55:34] ori-l: that's how I do it now [22:55:41] we might want to switch from https to http in some cases [22:55:55] I disable notify for ssl servers and do rolling restarts [22:56:08] (03PS1) 10Andrew Bogott: Fix readme so the instructions are less wrong. [operations/puppet] - 10https://gerrit.wikimedia.org/r/97131 [22:56:09] by depooling them, waiting, restarting, repooling [22:56:18] I was actually going to write a salt runner to do this [22:56:21] Ryan_Lane: ok, so yes, I agree, we need a parameter [22:56:32] I don't love 'donotify' though, maybe 'managed' on the nginx class? [22:56:34] paravoid, but feel free to start putting it into production otherwise - this way we can already test it with wget against backend servers to see if its doing the right thing [22:56:48] class { 'nginx': managed => false, } [22:56:55] before the final switchover with VCL patch [22:57:07] (03CR) 10Andrew Bogott: [C: 032] Fix readme so the instructions are less wrong. [operations/puppet] - 10https://gerrit.wikimedia.org/r/97131 (owner: 10Andrew Bogott) [22:57:20] that works for me [22:57:28] I don't care too much what the variable is :) [22:57:47] yurik: protocol-relative URLs for Location header are only becoming standard with HTTPbis [22:58:04] yurik: still in draft status, last draft expired last Sunday actually [22:58:15] good to know, thx [22:58:21] yurik: http://tools.ietf.org/html/draft-ietf-httpbis-p2-semantics-25#page-68 [22:58:27] RFC2606 doesn't allow them [22:58:40] oooo location header will allow protocol relative? :) [22:58:43] yup [22:58:46] \o/ [22:59:02] well, yurik already deployed code that it uses them :P [22:59:06] hahaha [22:59:20] i'm ahead of the curve! [22:59:27] gtg [23:11:09] ori-l: odd, running the /usr/bin/identify command directly works, and even in shell_exec(), but not with wfShellExec() (which takes forever and times out, giving boilerplate metadata responses) [23:16:08] setting $wgMaxShellMemory fixes it, having it too low results in a different error: segfault, file size exceeded, Memory allocation failed `Cannot allocate memory' @ fatal/string.c/AcquireStringInfo/183 [23:16:19] * AaronSchulz wonders why it needs so much RAM [23:16:37] seems like it just blocks trying to allocate more ram at some spots in the C code but not others [23:18:11] http://www.imagemagick.org/script/command-line-options.php#list see '-debug' [23:18:32] who here is a ML admin on medawiki-l? [23:18:49] also: http://www.imagemagick.org/script/resources.php#environment [23:19:10] specifically: MAGICK_MEMORY_LIMIT, MAGICK_THROTTLE, and MAGICK_TIME_LIMIT [23:19:32] it would be a little unsatisfying to avail ourselves to an application-specific feature for enforcing the timeout [23:20:01] MAGIC_THROTTLE is 'Periodically yield the CPU for at least the time specified in milliseconds.', I wonder if this would allow the limits to be enforced [23:20:06] -debug makes it short-circuit on some random "identify: unrecognized event type `-format' @ error/identify.c/IdentifyImageCommand/467" error that always showed previously [23:20:55] gah [23:21:17] * AaronSchulz typed the cmd wrongly [23:21:36] the error that always shows is "identify: Incompatible type for "RichTIFFIPTC"; tag ignored. `TIFFFetchNormalTag' @ warning/tiff.c/TIFFWarnings/768." ... probably minor [23:21:48] (03PS1) 10Andrew Bogott: Include $hostname in backup filename so we know what we backed up. [operations/puppet] - 10https://gerrit.wikimedia.org/r/97136 [23:22:52] (03CR) 10Andrew Bogott: [C: 032] Include $hostname in backup filename so we know what we backed up. [operations/puppet] - 10https://gerrit.wikimedia.org/r/97136 (owner: 10Andrew Bogott) [23:26:04] (03PS1) 10Jgreen: add role::ocg::collection to rhodium for enhanced testing power [operations/puppet] - 10https://gerrit.wikimedia.org/r/97137 [23:27:44] (03CR) 10Jgreen: [C: 032 V: 031] add role::ocg::collection to rhodium for enhanced testing power [operations/puppet] - 10https://gerrit.wikimedia.org/r/97137 (owner: 10Jgreen) [23:28:59] ori-l: the timeout is likely enforce now correctly [23:29:24] AaronSchulz: hm? [23:29:27] the tiff stuff shells out multiple times, so 50 a piece might get a 504 [23:30:06] but the ulimit -v for memory makes it fail either fast or by timeout depending on what line of code hit the limit [23:30:31] I wonder if the application limits would work better, though as you said, it would suck to rely on that [23:30:46] maybe we can just pass the $wg settings there as well [23:31:13] I think this would be worth doing, yeah [23:31:26] whether or not it counts as a full resolution of the bug is another matter [23:31:31] but I think it'd be worth doing. [23:31:50] the resulting metadata will still be boilerplate rubbish, which would still be a bug of it's own [23:32:00] (e.g. 0x0 pixel size showed on the File: page) [23:38:32] actually I didn't have $wgTiffUseTiffinfo set like in prod, so those bugs are different then the prod ones ;)