[00:00:01] what's wrong with forever? just point it at the main cluster once its own cluster is killed [00:00:20] (i.e. at the same IP as commons.wikimedia.org) [00:00:29] PROBLEM - Check status of defined EventLogging jobs on vanadium is CRITICAL: CRITICAL: Stopped EventLogging jobs: consumer/vanadium consumer/server-side-events-log consumer/mysql-db1047 consumer/client-side-events-log consumer/all-events-log multiplexer/all-events processor/server-side-events processor/client-side-events forwarder/8422 forwarder/8421 [00:00:37] jeremyb: I don't see a need for forever [00:00:49] gwicke: but do you see a downside? [00:01:06] sure: complexity, inefficiency etc [00:01:18] for no real gain [00:02:29] RECOVERY - Check status of defined EventLogging jobs on vanadium is OK: OK: All defined EventLogging jobs are runnning. [00:03:23] ok, well here's yet another point: if we go with the current hostname then any future redirection will be more complex than if we pick a more sensible hostname now. we can decide later (in a month or whatever) whether to actually do the redirection [00:05:45] jeremyb: afaik the current hostname can be mapped to whatever we want as well [00:07:29] PROBLEM - Check status of defined EventLogging jobs on vanadium is CRITICAL: CRITICAL: Stopped EventLogging jobs: consumer/vanadium [00:12:29] RECOVERY - Check status of defined EventLogging jobs on vanadium is OK: OK: All defined EventLogging jobs are runnning. [00:16:50] (03PS2) 10Dzahn: tabbing, quoting & aligning [operations/puppet] - 10https://gerrit.wikimedia.org/r/96354 [00:17:37] gwicke: could be AFAIK too. you were talking about complexity/efficiency. if it were redirected with current hostname it would stick out vs. other redirects/hostnames. for reference, my slightly out of date copy of the wikimedia.org dns zone has *no* CNAMEs at all for eqiad. so i guess no special-cased redirects for eqiad. [00:17:44] btw, was going to mention earlier: its somewhat mitigated by the fact that you say "alpha" at the top of the page. (actually, i would personally make that stand out more. the whole line is bold so the alpha doesn't stand out so much) OTOH, i don't really buy the we're in contact with everyone bit. you advertised it on a public list. i bet that list has over 200 subscribers. (wow, good guess, I checked and there's 225 subscribers) people ar [00:20:01] i wonder how gwicke ended up sending mail with 2 URLs both at mediawiki.org and one is HTTP and one is HTTPS. [00:21:24] https://www.mediawiki.org/wiki/Parsoid/Todo says no longer in use. http://parsoid-lb.eqiad.wikimedia.org/ says to report issues at :mw:Talk:Parsoid/Todo. [00:22:39] jeremyb: one of these days we'll get around to either remove that entry page or update it / make it prettier [00:24:46] ori-l: Let me know when it's safe to deploy, and I'm not going to get in your way [00:28:12] anyway, i guess parsoid.wikimedia.org is my default but maybe someone has a better idea for a name [00:28:22] !log ori synchronized php-1.23wmf3/extensions/MobileFrontend 'Updating MobileFrontend for I3efc1fa64' [00:28:37] Logged the message, Master [00:28:52] (03PS1) 10Dzahn: role and module structure for ishmael [operations/puppet] - 10https://gerrit.wikimedia.org/r/96403 [00:29:54] (03CR) 10Dzahn: "yea, really calling it role::ishmael" [operations/puppet] - 10https://gerrit.wikimedia.org/r/96403 (owner: 10Dzahn) [00:33:08] !log ori synchronized php-1.23wmf3/resources/startup.js 'touch' [00:33:19] Logged the message, Master [00:37:21] (03PS3) 10Dzahn: retab, quoting, linting of ishmael.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/96362 [00:37:22] (03PS2) 10Dzahn: role and module structure for ishmael [operations/puppet] - 10https://gerrit.wikimedia.org/r/96403 [00:38:46] fatal: Unable to read current working directory .. duh, yea, "git review" from within the directory i just created in this patch:) keep doing it [00:43:48] !log csteipp synchronized php-1.23wmf4/extensions/OAuth 'update OAuth to master for last blocker fix' [00:44:03] Logged the message, Master [01:26:28] (03CR) 10Dzahn: "works:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/96267 (owner: 10Hashar) [01:27:38] (03CR) 10Dzahn: "also re: adding all networks to $INTERNAL rather than just 10.0.0.0/8, see an older attempt in https://gerrit.wikimedia.org/r/#/c/88755/ " [operations/puppet] - 10https://gerrit.wikimedia.org/r/96267 (owner: 10Hashar) [01:33:10] !log ori synchronized php-1.23wmf3/extensions/WikimediaEvents 'Update WikimediaEvents to I5b8cfe592' [01:33:24] Logged the message, Master [01:35:41] !log ori synchronized php-1.23wmf3/resources/startup.js 'touch' [01:35:55] Logged the message, Master [01:37:24] !log ori synchronized php-1.23wmf4/extensions/WikimediaEvents 'Update WikimediaEvents to I5b8cfe592' [01:37:37] Logged the message, Master [01:38:11] !log rebooting ms-be1001, i/o stuck kernel bug [01:38:26] Logged the message, Master [01:40:49] PROBLEM - Host ms-be1001 is DOWN: PING CRITICAL - Packet loss = 100% [01:41:40] RECOVERY - DPKG on ms-be1001 is OK: All packages OK [01:41:40] RECOVERY - swift-object-updater on ms-be1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [01:41:40] RECOVERY - swift-account-auditor on ms-be1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [01:41:40] RECOVERY - swift-account-reaper on ms-be1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [01:41:40] RECOVERY - swift-object-server on ms-be1001 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [01:41:40] RECOVERY - swift-account-replicator on ms-be1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [01:41:40] RECOVERY - swift-container-server on ms-be1001 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [01:41:41] RECOVERY - swift-account-server on ms-be1001 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [01:41:41] RECOVERY - RAID on ms-be1001 is OK: OK: optimal, 14 logical, 14 physical [01:41:49] RECOVERY - swift-container-updater on ms-be1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [01:41:49] RECOVERY - swift-container-replicator on ms-be1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [01:41:49] RECOVERY - Host ms-be1001 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [01:41:59] RECOVERY - swift-object-auditor on ms-be1001 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [01:41:59] RECOVERY - swift-container-auditor on ms-be1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [01:42:10] RECOVERY - Disk space on ms-be1001 is OK: DISK OK [01:42:29] RECOVERY - puppet disabled on ms-be1001 is OK: OK [01:42:29] RECOVERY - swift-object-replicator on ms-be1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [01:45:11] (03PS1) 10Dzahn: move dsh to module [operations/puppet] - 10https://gerrit.wikimedia.org/r/96413 [02:02:19] (03CR) 10Faidon Liambotis: "I don't mind much where that file would be, as long as it'd be a separate database. But just to be clear: the separate database part & Tru" [operations/puppet] - 10https://gerrit.wikimedia.org/r/88261 (owner: 10Dr0ptp4kt) [02:15:35] !log LocalisationUpdate completed (1.23wmf4) at Wed Nov 20 02:15:34 UTC 2013 [02:15:51] Logged the message, Master [02:29:31] (03PS1) 10Dzahn: role classes for download servers [operations/puppet] - 10https://gerrit.wikimedia.org/r/96415 [02:31:27] !log LocalisationUpdate completed (1.23wmf3) at Wed Nov 20 02:31:26 UTC 2013 [02:31:42] Logged the message, Master [02:31:57] (03CR) 10Dzahn: "follow-up in https://gerrit.wikimedia.org/r/#/c/96415/" [operations/puppet] - 10https://gerrit.wikimedia.org/r/94408 (owner: 10Dzahn) [02:32:16] (03CR) 10Dzahn: "follow-up to https://gerrit.wikimedia.org/r/#/c/94408/" [operations/puppet] - 10https://gerrit.wikimedia.org/r/96415 (owner: 10Dzahn) [02:36:03] (03Abandoned) 10Dzahn: download server module and cleanup - WIP [operations/puppet] - 10https://gerrit.wikimedia.org/r/94408 (owner: 10Dzahn) [03:17:15] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Nov 20 03:17:15 UTC 2013 [03:17:29] Logged the message, Master [03:42:33] (03CR) 10Amire80: [C: 04-1] "This will restore the RTL problem, which the HTML tag tried to fix. As I noted in the bug report, removing the parentheses from USA will r" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96392 (owner: 10Bsitu) [04:09:30] (03PS1) 10Dzahn: let bastion hosts have base::firewall [operations/puppet] - 10https://gerrit.wikimedia.org/r/96424 [04:48:41] PROBLEM - Disk space on mw1197 is CRITICAL: DISK CRITICAL - free space: /tmp 526 MB (2% inode=92%): [05:01:53] (03PS1) 10Springle: db74 to S6 pmtpa master. pull db50 for decom [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96428 [05:02:55] (03PS2) 10Springle: db74 to S6 pmtpa master. pull db50 for decom [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96428 [05:03:17] (03CR) 10Springle: [C: 032] db74 to S6 pmtpa master. pull db50 for decom [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96428 (owner: 10Springle) [05:04:09] !log springle synchronized wmf-config/db-pmtpa.php [05:04:22] Logged the message, Master [05:20:18] PROBLEM - Disk space on mw1194 is CRITICAL: DISK CRITICAL - free space: /tmp 529 MB (2% inode=92%): [05:21:57] PROBLEM - Disk space on mw1207 is CRITICAL: DISK CRITICAL - free space: /tmp 539 MB (3% inode=92%): [05:47:56] I'll check out the disk space [05:48:24] good lord, are there always that many .png files in /tmp, or is that new? [05:48:49] "this many" == 11,817 on mw1194 [05:48:50] timeline? [05:49:00] there were a bunch that were never cleaned up [05:49:10] due to some bug [05:49:45] I opened one at random, looks math-y [05:50:29] there are timeline ones too [05:51:39] 607 timeline-*, 9,748 non-timeline, just 32 hex chars [05:52:56] yeah, definitely math. [05:53:38] maths has been timeing out a lot lately (apparently) [05:55:32] One day we'll just use MathJax or something. [05:55:33] !log /tmp on Apaches filling up with math .pngs; moving some of the oldest away as a stopgap [05:55:49] Logged the message, Master [05:55:50] ori-l: are you doing that on all boxes? [05:56:07] eek [05:56:13] nah, just the critical ones [05:56:51] I suppose I should have !logged that; I'll amend later. [05:59:03] this isn't a new problem; it just reached a watershed [05:59:07] those 9fd4e54cf03dbeb69ac6273f60d24e1b.png type paths are not coming from TempFSFile [05:59:16] * Aaron|home wonders [05:59:26] the files go back to february [06:04:51] what group of server are these? [06:04:59] I know we clear out on the scalers [06:05:52] app servers. hm [06:06:00] (03PS6) 10TTO: Clean up wgSiteName in InitialiseSettings [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86418 [06:06:09] Aaron|home: are the timeline ones definitively safe to delete? [06:09:05] the old ones yes [06:09:16] same with math [06:09:25] which are the sha1.png ones [06:09:57] well, md5 [06:10:56] how old is old, older than a day let's say? [06:11:36] for sure [06:11:53] the files should live for the duration of web requests [06:12:05] what are these: localcopy_6700be3f23ee-1.tif ? [06:12:36] copies of original files [06:12:49] they can be pruned likewise [06:12:52] great [06:13:17] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [06:13:17] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 200,000 [06:13:46] dsh "${MW_DSH_ARGS[@]}" -- " find /tmp -maxdepth 1 -iname '*.png' -mtime +200 -exec /bin/rm {} \; " [06:14:03] {{done}} [06:15:34] oh, older than a day [06:15:44] i was conservative :P [06:16:18] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (225121) [06:16:18] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (224856) [06:16:30] I mean really older than an hour is prolly fine [06:16:42] what web request lasts an hour :-P [06:17:02] don't make Elsie answer that [06:17:23] a web request lasting an hour is a web requst that needs to be shot [06:20:13] maybe I'll do that (the tifs) on all the mw hosts [06:22:03] yesh [06:22:14] yeesh, actually, is what I meant [06:23:09] the tifs are 300+ megs [06:24:01] How large is /tmp? [06:24:34] did them [06:24:37] RECOVERY - Disk space on mw1197 is OK: DISK OK [06:24:41] salt ftw [06:24:57] RECOVERY - Disk space on mw1207 is OK: DISK OK [06:25:08] salt 'mw*' cmd.run 'find /tmp -name \*tif -mmin +60 -exec rm {} \; ' from salt master [06:25:17] 19G on random apache [06:25:18] RECOVERY - Disk space on mw1194 is OK: DISK OK [06:25:32] do we want the pngs too? I guess they are cruft [06:26:07] well, [06:26:13] should we set up a cron job? [06:26:17] oh sure [06:26:23] this can't be the first time in the history of Wikimedia that /tmp has filled up [06:26:23] this is just 'clean em out for now' [06:26:32] well the scalers like I say have a cron already [06:26:50] maybe steal from that [06:26:55] * ori-l looks [06:27:09] pngs gone [06:28:20] map and err also pruned on all mws [06:28:21] manifests/imagescaler.pp:9: cron { removetmpfiles: [06:29:35] what are all these mw-cache-1.22wmf8 etc in here [06:29:53] it's where reedy hides 0-day warez [06:29:57] we have them as far back as feb [06:30:09] probably localization cache? [06:30:21] they are plenty big when you add up 20 of them [06:30:38] at 50m a pop... [06:31:45] so I removed map, err, png, tif which seems to do us pretty well on these [06:31:46] I guess the way to enforce some discipline on the usage of /tmp is to have a blanket policy of deleting anything that hasn't been modified in $DAYS [06:31:53] 7 days seems pretty generous [06:32:05] is there anything that *shouldn't* be deleted after seven days? [06:32:33] well *cough* lost+found of course [06:32:40] since we're going to be delting directories [06:33:33] anything still leaving floods of file should have bug reports [06:36:15] blargh [06:36:16] the tifs are the worse offenders for space, followed closely by the mw caches [06:38:02] mw-caches are configuration caches, created in CommonSettings.php [06:38:06] i'll file a bug for that [06:38:55] is it bad UNIX manners to not clean up after yourself in /tmp? [06:39:50] (03PS1) 10Tim Starling: Generate redirects.conf [operations/apache-config] - 10https://gerrit.wikimedia.org/r/96438 [06:42:38] yes, it is [06:42:59] always [06:43:27] oherwise you are relying on someone else writing a cron job, or regular reboots, neither of which is nice [06:44:02] * Aaron|home likes how "naive" has the diacritic [06:44:31] (03CR) 10Matanya: [C: 031] etherpad - tabbing, quoting & aligning [operations/puppet] - 10https://gerrit.wikimedia.org/r/96354 (owner: 10Dzahn) [06:44:40] (03CR) 10Dr0ptp4kt: ""I don't mind much where that file would be, as long as it'd be a separate database. But just to be clear: the separate database part & Tr" [operations/puppet] - 10https://gerrit.wikimedia.org/r/88261 (owner: 10Dr0ptp4kt) [06:45:12] heh [06:46:13] I guess I should log the cleanup, woops [06:47:01] !log removed on mw* hosts from /tmp all *png/map/err/tif older than an hour, as some tmpfs were full [06:47:15] Logged the message, Master [06:47:33] (03CR) 10Matanya: [C: 031] role classes for download servers [operations/puppet] - 10https://gerrit.wikimedia.org/r/96415 (owner: 10Dzahn) [06:48:36] aw, Tim's patch is nifty [06:48:52] what did we say the tifs come from? I mean, a local copy of an original file but what produces it? [06:50:15] Aaron|home: ? [06:50:18] includes/filebackend/SwiftFileBackend.php ? [06:50:28] really? [06:50:30] 1169: $tmpFile = TempFSFile::factory( 'localcopy_', $ext ); [06:50:39] blah [06:50:58] you filing that or shall I? [06:51:06] (03CR) 10Dr0ptp4kt: "The 50,000-60,000 cache hit/200 objects to which I refer are those under /wiki/ on mobile Wikipedia (mdot & zerodot for W0, mdot for non-W" [operations/puppet] - 10https://gerrit.wikimedia.org/r/88261 (owner: 10Dr0ptp4kt) [06:51:41] go ahead if you're up for it [06:51:44] doing [06:51:47] thank you [06:52:00] (03CR) 10Matanya: role and module structure for ishmael (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/96403 (owner: 10Dzahn) [06:52:18] apergos: how many tiffs were there? [06:52:27] ori-l: are you working on a logstash module? [06:52:37] matanya: nope [06:52:46] ugh, dunno now, cause I cleared all the old ones out [06:52:47] you would have legions of adoring fans if you wrote one tho [06:52:48] could be anything that calls getLocalReference() [06:53:07] ori-l: are you sure no one is about to? :) [06:53:15] I see two in there that as less than an hour though (on the one host I'm camped on) [06:53:16] matanya: bd808 might; I'd ping him [06:53:54] ori-l: do we have an elasticsearch or redis some where already installed? [06:53:57] could be fatals errors too, which mean any tmp files get left around [06:54:20] well it would be easy to look for fatals that match up with these two since they are recent; [06:54:33] mw1208 [06:54:40] -rw-r--r-- 1 apache apache 403688452 Nov 20 05:39 localcopy_a711e74b7da8-1.tif [06:54:41] -rw-r--r-- 1 apache apache 403688452 Nov 20 05:38 localcopy_acdd87adf505-1.tif [06:54:44] if that helps [06:55:42] matanya: plenty o' both [06:55:49] elasticsearch powers cirrussearch [06:56:26] so I can relay on them. that is god [06:56:34] *good, lees typing :) [06:57:36] Aaron|home: mw1194 pre- "-mmin +60" purge: https://dpaste.de/jfCn/raw/ [06:57:43] ori-l: issue with protocol relative redirects that's annoying to fix? rewrite redirects system. [06:57:57] seem pretty regularly produced, as I look at all the hosts we're getting new ones [06:58:03] greg-g: heh [06:58:57] https://bugzilla.wikimedia.org/show_bug.cgi?id=57282 [07:01:48] no fatal corresponding to mw1138 -rw-r--r-- 1 apache apache 403688452 Nov 20 05:57 /tmp/localcopy_72e3d84d3a36-1.tif [07:02:40] nothing in exception log either [07:03:01] exceptions wouldn't matter [07:04:33] tried swift-backend log, nothing there, outa places to look [07:04:38] can I leave it in your hands? [07:05:21] sure, i'll poke [07:05:41] thanks (gonna get to my dailies... downed hosts, broken puppet, etc) [07:11:35] i converted one to jpg and scp'd it over so i can open it, just out of curiosity [07:11:56] it looks like a hat made out of tortillas [07:12:01] :-D [07:13:01] high calorie fashion [07:14:22] https://commons.wikimedia.org/wiki/File:Zentralbibliothek_Z%C3%BCrich_-_Heinrich_Bullingers_Westerhemd_-_000012135.jpg [07:14:41] you have to admit my description was pretty accurate [07:16:31] apergos: doing <<$be->getLocalReference( array( 'src' => $path ) );>> and exiting work fine...I have the file stat() and path dumped out [07:16:41] when I leave eval.php, ls -l can't find any file [07:17:13] I don't think it's SwiftFileBackend [07:17:21] hrm [07:17:27] maybe it's some circular references with TempFSFile objects [07:17:41] ok, maybe I jumped the gun in pointing the finger [07:19:37] apergos: we have zend.enable_gc on right? [07:19:51] I have no idea [07:20:14] gc_enabled() returns true [07:20:15] (03CR) 10Matanya: "Although dsh_groups seems to be no used, please push the change dsh_groups --> dsh::groups for the sake of completeness." [operations/puppet] - 10https://gerrit.wikimedia.org/r/96413 (owner: 10Dzahn) [07:21:13] ok [07:24:40] * Aaron|home wonders why it is just tifs [07:25:21] 32287_5bef8ef1d55235f6957bd1d449af4eea.dvi [07:25:23] wtf is that for? [07:25:36] man this stuff needs prefixes [07:25:42] I was just going to say [07:30:19] i deployed and then accidentally un-deployed a two-char fix to a js file earlier [07:30:33] it's screwing up a live data collection job so i'm going to re-sync it [07:31:02] there's the pngs, the map and err files, but they are not the big spenders so I didn't list them [07:31:14] and they aren't local_ * something either [07:36:25] apergos: so 1138 is not even a scalar [07:36:50] other servers might get local copies, though it's rarer [07:37:39] 1142, 1143, 1144 some others that have them [07:38:51] !log ori synchronized php-1.23wmf4/extensions/WikimediaEvents/modules/ext.wikimediaEvents.moduleStorage.js 'Re-syncing I51d2d6495' [07:39:07] Logged the message, Master [07:40:36] !log ori synchronized php-1.23wmf3/extensions/WikimediaEvents/modules/ext.wikimediaEvents.moduleStorage.js 'Re-syncing I51d2d6495' [07:40:51] Logged the message, Master [07:41:12] done, sorry for unscheduled sync. [07:44:19] apergos: no unlink() errors in apache.log [07:44:46] hmmmm [07:44:48] gah, wfSuppressWarnings() is used [07:44:51] so you wouldn't know [07:44:52] ahhhh [07:44:54] :-D [07:50:00] (03CR) 10Akosiaris: "Patch LGTM but let's not merge this before we change the default policy to DROP. It is in" [operations/puppet] - 10https://gerrit.wikimedia.org/r/96424 (owner: 10Dzahn) [07:51:42] !log rebooting sq37, we had /dev//sdc gone and reappeared as /dev/sdc killing back-end squid [07:51:58] Logged the message, Master [07:54:39] that might have cleared it [08:06:13] (03CR) 10Akosiaris: [C: 04-1] "Very minor nitpicks. Feel free to merge after fixing them." (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/96354 (owner: 10Dzahn) [08:11:21] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [08:11:21] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 200,000 [08:21:20] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (200635) [08:21:21] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (200181) [08:22:21] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [08:22:22] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 200,000 [08:30:21] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (204724) [08:30:21] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (204717) [08:32:21] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 200,000 [08:33:20] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [08:33:47] (03CR) 10Akosiaris: [C: 04-2] "This will be solved better when we change the default policy to DROP" [operations/puppet] - 10https://gerrit.wikimedia.org/r/96040 (owner: 10Hashar) [08:46:21] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (206122) [08:46:21] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (206121) [08:52:22] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [08:53:20] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 200,000 [08:57:42] hello [09:01:07] akosiaris: hello :-] [09:01:58] hashar: hello [09:02:28] that week is crazy [09:03:19] yes [09:04:14] I am super happy we managed to get ferm applied for contint [09:04:18] that is a huge improvement [09:04:48] I am going to upgrade Zuul , could use a merge of https://gerrit.wikimedia.org/r/#/c/93457/ [09:04:58] which configure Zuul for gearman [09:05:14] !log stopping Zuul on gallium for upgrading purposes. [09:05:36] Logged the message, Master [09:05:51] ok merging now [09:06:06] (03PS5) 10Akosiaris: zuul: configuration for gearman [operations/puppet] - 10https://gerrit.wikimedia.org/r/93457 (owner: 10Hashar) [09:07:00] huh you stopped zuul .... i need to give verified +2 [09:07:08] I forgot that for a sec [09:07:25] (03CR) 10Akosiaris: [C: 032 V: 032] zuul: configuration for gearman [operations/puppet] - 10https://gerrit.wikimedia.org/r/93457 (owner: 10Hashar) [09:07:40] (03PS2) 10Akosiaris: zuul: refer to puppet variables with a @ [operations/puppet] - 10https://gerrit.wikimedia.org/r/95359 (owner: 10Hashar) [09:07:58] PROBLEM - zuul_service_running on gallium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/local/bin/zuul-server [09:08:22] (03CR) 10Akosiaris: [C: 032 V: 032] zuul: refer to puppet variables with a @ [operations/puppet] - 10https://gerrit.wikimedia.org/r/95359 (owner: 10Hashar) [09:08:57] hashar: done [09:09:02] thanks! [09:11:21] akosiaris: I think that is all what is needed for now :] [09:15:05] !log Zuul: bumped source code in integration/zuul.git to Gearman based version: 1e3adfd...6241272 labs -> master [09:15:20] Logged the message, Master [09:16:10] (03CR) 10Akosiaris: "See comments inline." (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/95363 (owner: 10QChris) [09:16:58] (03PS1) 10ArielGlenn: remove virt1 from dhcp, decommed rt #5645 [operations/puppet] - 10https://gerrit.wikimedia.org/r/96450 [09:19:48] having a list of steps to do is really helpful https://www.mediawiki.org/wiki/Continuous_integration/Zuul/gearman_upgrade#Upgrading :D [09:19:54] I don't even have to think about the upgrade [09:20:23] bah [09:20:36] !log gallium temp install of python-pip to be able to ugprade Zuul [09:20:51] Logged the message, Master [09:21:08] ah no jenkins right now eh? [09:21:15] yeah upgrading it right now [09:21:16] sorry :( [09:21:25] should I wait? it's not a rush [09:22:57] should be finished soon ™ [09:23:06] ok [09:27:37] !log removing all old versions of zuul: rm -fR /usr/local/lib/python2.7/dist-packages/zuul* [09:27:53] Logged the message, Master [09:29:29] snif ? /usr/local/ ???? [09:29:44] I will pretend I did not see that [09:30:14] yeah that is installed via pip / setup.py [09:30:16] * akosiaris will salt '*' cmd.run 'rm -rf /usr/local/*' one day [09:30:27] though with null http/https proxies [09:30:38] stop stop stop I do not want to know [09:30:48] next step is to have it packaged [09:30:51] I want to be ignorant please!!!! [09:30:57] hey [09:31:00] sure :-] [09:31:05] now that is what is like... packages [09:31:18] things in /usr and not /usr/local.... dpkg not pip :-) [09:32:58] PROBLEM - SSH on amslvs1 is CRITICAL: Server answer: [09:33:55] akosiaris: do we use hiera? [09:33:57] !log restarted Zuul with Gearman version [09:33:58] RECOVERY - SSH on amslvs1 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [09:34:11] Logged the message, Master [09:34:41] matanya: nope [09:34:43] !log enabling Gearman in Jenkins, making it register with Zuul new version [09:34:59] Logged the message, Master [09:35:21] thanks akosiaris. when you give up, do : salt '*' cmd.run 'rm -rf / [09:35:28] ' [09:35:52] we might need to take his access away on palladium and sockpuppet (saltmasters) :-P [09:35:59] but there's always dsh.. dang :-D [09:36:15] that will be a cron script triggered to run 1 year after I leave [09:36:27] just to see if you will catch it in time :P [09:36:30] and odds are no one will find it before then ;-D [09:36:34] yep [09:36:35] lol [09:37:13] (03Restored) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [operations/puppet] - 10https://gerrit.wikimedia.org/r/89002 (owner: 10Hashar) [09:37:19] (03PS5) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [operations/puppet] - 10https://gerrit.wikimedia.org/r/89002 [09:38:09] !log Zuul upgraded with Gearman support to trigger build in Jenkins. Will monitor for the rest of the morning. [09:38:16] akosiaris: success :-] THANK YOU !! [09:38:23] Logged the message, Master [09:38:26] so since the salt master is necessary for deployments and other things, I'd like to have some other host than palladium be a second saltmaster (and get off of sockpuppet completely )... where if the host gets slow due to puppetmaster we don't also have salt issues [09:38:43] hashar: wow you are fast [09:38:47] nice !!! [09:39:21] apergos: seems reasonable [09:39:34] since it allows for multimaster setup I am happy with it [09:39:38] yep [09:39:49] it's slightly annoying to have to accept keys on both masters but it's not so often [09:40:09] hashar: potential problem: https://integration.wikimedia.org/ci/job/mwext-VisualEditor-doc-test/6203/console [09:40:10] I need a good candidate host, obviously salt doesn't need to be the only thing on it [09:40:20] hashar: It is executing on deployment-parsoid2 [09:40:58] akosiaris: I did the upgrade countless time on labs [09:40:58] apergos: well there is a patch upstream to have salt use puppet's CA infra [09:41:03] Krinkle: aarghh [09:41:10] https://integration.wikimedia.org/ci/job/mwext-VisualEditor-doc-test/6205/console [09:41:12] deployment-bastion [09:41:15] ryan had opened it [09:41:19] so have it be also on strontium? [09:41:41] of course then if the puppet backends are overloaded our saltmasters go too :-D [09:41:46] apergos: https://github.com/saltstack/salt/issues/5752 [09:42:08] there you go... if this is fixed we will no longer have to maintain two CAs [09:42:30] that would be really nice [09:45:27] (03PS2) 10Akosiaris: Change default ferm policy to DROP [operations/puppet] - 10https://gerrit.wikimedia.org/r/96265 [09:45:57] Krinkle: forgot to restart jenkinsdamn me [09:46:04] !log restarting Jenkins to update gearman plugin [09:46:19] Logged the message, Master [09:48:29] what about carbon as the other multimaster? [09:48:56] currently: backup::client, misc::install-server::tftp-server [09:49:12] (03CR) 10Akosiaris: "Moved main-input-default-drop.conf file in-module and using it now, however I am puzzled by" [operations/puppet] - 10https://gerrit.wikimedia.org/r/96265 (owner: 10Akosiaris) [09:49:19] ugh public ip nm [09:50:20] also, whatever 'client-side' is in ganglia misc eqiad, it should probably be gone... [09:55:26] sigh no good candidates, need it to be a roots only host w/o public ip and without critical services like eg dns [09:58:17] !log shutting down Zuul and reverting upgrade :-( [09:58:26] ouch [09:58:29] wtf? [09:58:32] oh dear [09:58:33] Logged the message, Master [09:59:15] I forgot to properly test out labels to tie jobs on specific node [09:59:40] turns out the default behavior is to run jobs on any slave available, even if the slave is configured to only run jobs that are explicitly assigned to it [09:59:42] upstream bug :( [10:02:01] !log reverting Zuul to last known version: wmf-deploy-20131023 + 6241272...1e3adfd wmf-deploy-20131023 -> master (forced update) [10:02:16] Logged the message, Master [10:02:38] :-( [10:05:02]