[00:00:01] what's wrong with forever? just point it at the main cluster once its own cluster is killed [00:00:20] (i.e. at the same IP as commons.wikimedia.org) [00:00:29] PROBLEM - Check status of defined EventLogging jobs on vanadium is CRITICAL: CRITICAL: Stopped EventLogging jobs: consumer/vanadium consumer/server-side-events-log consumer/mysql-db1047 consumer/client-side-events-log consumer/all-events-log multiplexer/all-events processor/server-side-events processor/client-side-events forwarder/8422 forwarder/8421 [00:00:37] jeremyb: I don't see a need for forever [00:00:49] gwicke: but do you see a downside? [00:01:06] sure: complexity, inefficiency etc [00:01:18] for no real gain [00:02:29] RECOVERY - Check status of defined EventLogging jobs on vanadium is OK: OK: All defined EventLogging jobs are runnning. [00:03:23] ok, well here's yet another point: if we go with the current hostname then any future redirection will be more complex than if we pick a more sensible hostname now. we can decide later (in a month or whatever) whether to actually do the redirection [00:05:45] jeremyb: afaik the current hostname can be mapped to whatever we want as well [00:07:29] PROBLEM - Check status of defined EventLogging jobs on vanadium is CRITICAL: CRITICAL: Stopped EventLogging jobs: consumer/vanadium [00:12:29] RECOVERY - Check status of defined EventLogging jobs on vanadium is OK: OK: All defined EventLogging jobs are runnning. [00:16:50] (03PS2) 10Dzahn: tabbing, quoting & aligning [operations/puppet] - 10https://gerrit.wikimedia.org/r/96354 [00:17:37] gwicke: could be AFAIK too. you were talking about complexity/efficiency. if it were redirected with current hostname it would stick out vs. other redirects/hostnames. for reference, my slightly out of date copy of the wikimedia.org dns zone has *no* CNAMEs at all for eqiad. so i guess no special-cased redirects for eqiad. [00:17:44] btw, was going to mention earlier: its somewhat mitigated by the fact that you say "alpha" at the top of the page. (actually, i would personally make that stand out more. the whole line is bold so the alpha doesn't stand out so much) OTOH, i don't really buy the we're in contact with everyone bit. you advertised it on a public list. i bet that list has over 200 subscribers. (wow, good guess, I checked and there's 225 subscribers) people ar [00:20:01] i wonder how gwicke ended up sending mail with 2 URLs both at mediawiki.org and one is HTTP and one is HTTPS. [00:21:24] https://www.mediawiki.org/wiki/Parsoid/Todo says no longer in use. http://parsoid-lb.eqiad.wikimedia.org/ says to report issues at :mw:Talk:Parsoid/Todo. [00:22:39] jeremyb: one of these days we'll get around to either remove that entry page or update it / make it prettier [00:24:46] ori-l: Let me know when it's safe to deploy, and I'm not going to get in your way [00:28:12] anyway, i guess parsoid.wikimedia.org is my default but maybe someone has a better idea for a name [00:28:22] !log ori synchronized php-1.23wmf3/extensions/MobileFrontend 'Updating MobileFrontend for I3efc1fa64' [00:28:37] Logged the message, Master [00:28:52] (03PS1) 10Dzahn: role and module structure for ishmael [operations/puppet] - 10https://gerrit.wikimedia.org/r/96403 [00:29:54] (03CR) 10Dzahn: "yea, really calling it role::ishmael" [operations/puppet] - 10https://gerrit.wikimedia.org/r/96403 (owner: 10Dzahn) [00:33:08] !log ori synchronized php-1.23wmf3/resources/startup.js 'touch' [00:33:19] Logged the message, Master [00:37:21] (03PS3) 10Dzahn: retab, quoting, linting of ishmael.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/96362 [00:37:22] (03PS2) 10Dzahn: role and module structure for ishmael [operations/puppet] - 10https://gerrit.wikimedia.org/r/96403 [00:38:46] fatal: Unable to read current working directory .. duh, yea, "git review" from within the directory i just created in this patch:) keep doing it [00:43:48] !log csteipp synchronized php-1.23wmf4/extensions/OAuth 'update OAuth to master for last blocker fix' [00:44:03] Logged the message, Master [01:26:28] (03CR) 10Dzahn: "works:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/96267 (owner: 10Hashar) [01:27:38] (03CR) 10Dzahn: "also re: adding all networks to $INTERNAL rather than just 10.0.0.0/8, see an older attempt in https://gerrit.wikimedia.org/r/#/c/88755/ " [operations/puppet] - 10https://gerrit.wikimedia.org/r/96267 (owner: 10Hashar) [01:33:10] !log ori synchronized php-1.23wmf3/extensions/WikimediaEvents 'Update WikimediaEvents to I5b8cfe592' [01:33:24] Logged the message, Master [01:35:41] !log ori synchronized php-1.23wmf3/resources/startup.js 'touch' [01:35:55] Logged the message, Master [01:37:24] !log ori synchronized php-1.23wmf4/extensions/WikimediaEvents 'Update WikimediaEvents to I5b8cfe592' [01:37:37] Logged the message, Master [01:38:11] !log rebooting ms-be1001, i/o stuck kernel bug [01:38:26] Logged the message, Master [01:40:49] PROBLEM - Host ms-be1001 is DOWN: PING CRITICAL - Packet loss = 100% [01:41:40] RECOVERY - DPKG on ms-be1001 is OK: All packages OK [01:41:40] RECOVERY - swift-object-updater on ms-be1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [01:41:40] RECOVERY - swift-account-auditor on ms-be1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [01:41:40] RECOVERY - swift-account-reaper on ms-be1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [01:41:40] RECOVERY - swift-object-server on ms-be1001 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [01:41:40] RECOVERY - swift-account-replicator on ms-be1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [01:41:40] RECOVERY - swift-container-server on ms-be1001 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [01:41:41] RECOVERY - swift-account-server on ms-be1001 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [01:41:41] RECOVERY - RAID on ms-be1001 is OK: OK: optimal, 14 logical, 14 physical [01:41:49] RECOVERY - swift-container-updater on ms-be1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [01:41:49] RECOVERY - swift-container-replicator on ms-be1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [01:41:49] RECOVERY - Host ms-be1001 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [01:41:59] RECOVERY - swift-object-auditor on ms-be1001 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [01:41:59] RECOVERY - swift-container-auditor on ms-be1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [01:42:10] RECOVERY - Disk space on ms-be1001 is OK: DISK OK [01:42:29] RECOVERY - puppet disabled on ms-be1001 is OK: OK [01:42:29] RECOVERY - swift-object-replicator on ms-be1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [01:45:11] (03PS1) 10Dzahn: move dsh to module [operations/puppet] - 10https://gerrit.wikimedia.org/r/96413 [02:02:19] (03CR) 10Faidon Liambotis: "I don't mind much where that file would be, as long as it'd be a separate database. But just to be clear: the separate database part & Tru" [operations/puppet] - 10https://gerrit.wikimedia.org/r/88261 (owner: 10Dr0ptp4kt) [02:15:35] !log LocalisationUpdate completed (1.23wmf4) at Wed Nov 20 02:15:34 UTC 2013 [02:15:51] Logged the message, Master [02:29:31] (03PS1) 10Dzahn: role classes for download servers [operations/puppet] - 10https://gerrit.wikimedia.org/r/96415 [02:31:27] !log LocalisationUpdate completed (1.23wmf3) at Wed Nov 20 02:31:26 UTC 2013 [02:31:42] Logged the message, Master [02:31:57] (03CR) 10Dzahn: "follow-up in https://gerrit.wikimedia.org/r/#/c/96415/" [operations/puppet] - 10https://gerrit.wikimedia.org/r/94408 (owner: 10Dzahn) [02:32:16] (03CR) 10Dzahn: "follow-up to https://gerrit.wikimedia.org/r/#/c/94408/" [operations/puppet] - 10https://gerrit.wikimedia.org/r/96415 (owner: 10Dzahn) [02:36:03] (03Abandoned) 10Dzahn: download server module and cleanup - WIP [operations/puppet] - 10https://gerrit.wikimedia.org/r/94408 (owner: 10Dzahn) [03:17:15] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Nov 20 03:17:15 UTC 2013 [03:17:29] Logged the message, Master [03:42:33] (03CR) 10Amire80: [C: 04-1] "This will restore the RTL problem, which the HTML tag tried to fix. As I noted in the bug report, removing the parentheses from USA will r" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96392 (owner: 10Bsitu) [04:09:30] (03PS1) 10Dzahn: let bastion hosts have base::firewall [operations/puppet] - 10https://gerrit.wikimedia.org/r/96424 [04:48:41] PROBLEM - Disk space on mw1197 is CRITICAL: DISK CRITICAL - free space: /tmp 526 MB (2% inode=92%): [05:01:53] (03PS1) 10Springle: db74 to S6 pmtpa master. pull db50 for decom [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96428 [05:02:55] (03PS2) 10Springle: db74 to S6 pmtpa master. pull db50 for decom [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96428 [05:03:17] (03CR) 10Springle: [C: 032] db74 to S6 pmtpa master. pull db50 for decom [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96428 (owner: 10Springle) [05:04:09] !log springle synchronized wmf-config/db-pmtpa.php [05:04:22] Logged the message, Master [05:20:18] PROBLEM - Disk space on mw1194 is CRITICAL: DISK CRITICAL - free space: /tmp 529 MB (2% inode=92%): [05:21:57] PROBLEM - Disk space on mw1207 is CRITICAL: DISK CRITICAL - free space: /tmp 539 MB (3% inode=92%): [05:47:56] I'll check out the disk space [05:48:24] good lord, are there always that many .png files in /tmp, or is that new? [05:48:49] "this many" == 11,817 on mw1194 [05:48:50] timeline? [05:49:00] there were a bunch that were never cleaned up [05:49:10] due to some bug [05:49:45] I opened one at random, looks math-y [05:50:29] there are timeline ones too [05:51:39] 607 timeline-*, 9,748 non-timeline, just 32 hex chars [05:52:56] yeah, definitely math. [05:53:38] maths has been timeing out a lot lately (apparently) [05:55:32] One day we'll just use MathJax or something. [05:55:33] !log /tmp on Apaches filling up with math .pngs; moving some of the oldest away as a stopgap [05:55:49] Logged the message, Master [05:55:50] ori-l: are you doing that on all boxes? [05:56:07] eek [05:56:13] nah, just the critical ones [05:56:51] I suppose I should have !logged that; I'll amend later. [05:59:03] this isn't a new problem; it just reached a watershed [05:59:07] those 9fd4e54cf03dbeb69ac6273f60d24e1b.png type paths are not coming from TempFSFile [05:59:16] * Aaron|home wonders [05:59:26] the files go back to february [06:04:51] what group of server are these? [06:04:59] I know we clear out on the scalers [06:05:52] app servers. hm [06:06:00] (03PS6) 10TTO: Clean up wgSiteName in InitialiseSettings [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86418 [06:06:09] Aaron|home: are the timeline ones definitively safe to delete? [06:09:05] the old ones yes [06:09:16] same with math [06:09:25] which are the sha1.png ones [06:09:57] well, md5 [06:10:56] how old is old, older than a day let's say? [06:11:36] for sure [06:11:53] the files should live for the duration of web requests [06:12:05] what are these: localcopy_6700be3f23ee-1.tif ? [06:12:36] copies of original files [06:12:49] they can be pruned likewise [06:12:52] great [06:13:17] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [06:13:17] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 200,000 [06:13:46] dsh "${MW_DSH_ARGS[@]}" -- " find /tmp -maxdepth 1 -iname '*.png' -mtime +200 -exec /bin/rm {} \; " [06:14:03] {{done}} [06:15:34] oh, older than a day [06:15:44] i was conservative :P [06:16:18] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (225121) [06:16:18] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (224856) [06:16:30] I mean really older than an hour is prolly fine [06:16:42] what web request lasts an hour :-P [06:17:02] don't make Elsie answer that [06:17:23] a web request lasting an hour is a web requst that needs to be shot [06:20:13] maybe I'll do that (the tifs) on all the mw hosts [06:22:03] yesh [06:22:14] yeesh, actually, is what I meant [06:23:09] the tifs are 300+ megs [06:24:01] How large is /tmp? [06:24:34] did them [06:24:37] RECOVERY - Disk space on mw1197 is OK: DISK OK [06:24:41] salt ftw [06:24:57] RECOVERY - Disk space on mw1207 is OK: DISK OK [06:25:08] salt 'mw*' cmd.run 'find /tmp -name \*tif -mmin +60 -exec rm {} \; ' from salt master [06:25:17] 19G on random apache [06:25:18] RECOVERY - Disk space on mw1194 is OK: DISK OK [06:25:32] do we want the pngs too? I guess they are cruft [06:26:07] well, [06:26:13] should we set up a cron job? [06:26:17] oh sure [06:26:23] this can't be the first time in the history of Wikimedia that /tmp has filled up [06:26:23] this is just 'clean em out for now' [06:26:32] well the scalers like I say have a cron already [06:26:50] maybe steal from that [06:26:55] * ori-l looks [06:27:09] pngs gone [06:28:20] map and err also pruned on all mws [06:28:21] manifests/imagescaler.pp:9: cron { removetmpfiles: [06:29:35] what are all these mw-cache-1.22wmf8 etc in here [06:29:53] it's where reedy hides 0-day warez [06:29:57] we have them as far back as feb [06:30:09] probably localization cache? [06:30:21] they are plenty big when you add up 20 of them [06:30:38] at 50m a pop... [06:31:45] so I removed map, err, png, tif which seems to do us pretty well on these [06:31:46] I guess the way to enforce some discipline on the usage of /tmp is to have a blanket policy of deleting anything that hasn't been modified in $DAYS [06:31:53] 7 days seems pretty generous [06:32:05] is there anything that *shouldn't* be deleted after seven days? [06:32:33] well *cough* lost+found of course [06:32:40] since we're going to be delting directories [06:33:33] anything still leaving floods of file should have bug reports [06:36:15] blargh [06:36:16] the tifs are the worse offenders for space, followed closely by the mw caches [06:38:02] mw-caches are configuration caches, created in CommonSettings.php [06:38:06] i'll file a bug for that [06:38:55] is it bad UNIX manners to not clean up after yourself in /tmp? [06:39:50] (03PS1) 10Tim Starling: Generate redirects.conf [operations/apache-config] - 10https://gerrit.wikimedia.org/r/96438 [06:42:38] yes, it is [06:42:59] always [06:43:27] oherwise you are relying on someone else writing a cron job, or regular reboots, neither of which is nice [06:44:02] * Aaron|home likes how "naive" has the diacritic [06:44:31] (03CR) 10Matanya: [C: 031] etherpad - tabbing, quoting & aligning [operations/puppet] - 10https://gerrit.wikimedia.org/r/96354 (owner: 10Dzahn) [06:44:40] (03CR) 10Dr0ptp4kt: ""I don't mind much where that file would be, as long as it'd be a separate database. But just to be clear: the separate database part & Tr" [operations/puppet] - 10https://gerrit.wikimedia.org/r/88261 (owner: 10Dr0ptp4kt) [06:45:12] heh [06:46:13] I guess I should log the cleanup, woops [06:47:01] !log removed on mw* hosts from /tmp all *png/map/err/tif older than an hour, as some tmpfs were full [06:47:15] Logged the message, Master [06:47:33] (03CR) 10Matanya: [C: 031] role classes for download servers [operations/puppet] - 10https://gerrit.wikimedia.org/r/96415 (owner: 10Dzahn) [06:48:36] aw, Tim's patch is nifty [06:48:52] what did we say the tifs come from? I mean, a local copy of an original file but what produces it? [06:50:15] Aaron|home: ? [06:50:18] includes/filebackend/SwiftFileBackend.php ? [06:50:28] really? [06:50:30] 1169: $tmpFile = TempFSFile::factory( 'localcopy_', $ext ); [06:50:39] blah [06:50:58] you filing that or shall I? [06:51:06] (03CR) 10Dr0ptp4kt: "The 50,000-60,000 cache hit/200 objects to which I refer are those under /wiki/ on mobile Wikipedia (mdot & zerodot for W0, mdot for non-W" [operations/puppet] - 10https://gerrit.wikimedia.org/r/88261 (owner: 10Dr0ptp4kt) [06:51:41] go ahead if you're up for it [06:51:44] doing [06:51:47] thank you [06:52:00] (03CR) 10Matanya: role and module structure for ishmael (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/96403 (owner: 10Dzahn) [06:52:18] apergos: how many tiffs were there? [06:52:27] ori-l: are you working on a logstash module? [06:52:37] matanya: nope [06:52:46] ugh, dunno now, cause I cleared all the old ones out [06:52:47] you would have legions of adoring fans if you wrote one tho [06:52:48] could be anything that calls getLocalReference() [06:53:07] ori-l: are you sure no one is about to? :) [06:53:15] I see two in there that as less than an hour though (on the one host I'm camped on) [06:53:16] matanya: bd808 might; I'd ping him [06:53:54] ori-l: do we have an elasticsearch or redis some where already installed? [06:53:57] could be fatals errors too, which mean any tmp files get left around [06:54:20] well it would be easy to look for fatals that match up with these two since they are recent; [06:54:33] mw1208 [06:54:40] -rw-r--r-- 1 apache apache 403688452 Nov 20 05:39 localcopy_a711e74b7da8-1.tif [06:54:41] -rw-r--r-- 1 apache apache 403688452 Nov 20 05:38 localcopy_acdd87adf505-1.tif [06:54:44] if that helps [06:55:42] matanya: plenty o' both [06:55:49] elasticsearch powers cirrussearch [06:56:26] so I can relay on them. that is god [06:56:34] *good, lees typing :) [06:57:36] Aaron|home: mw1194 pre- "-mmin +60" purge: https://dpaste.de/jfCn/raw/ [06:57:43] ori-l: issue with protocol relative redirects that's annoying to fix? rewrite redirects system. [06:57:57] seem pretty regularly produced, as I look at all the hosts we're getting new ones [06:58:03] greg-g: heh [06:58:57] https://bugzilla.wikimedia.org/show_bug.cgi?id=57282 [07:01:48] no fatal corresponding to mw1138 -rw-r--r-- 1 apache apache 403688452 Nov 20 05:57 /tmp/localcopy_72e3d84d3a36-1.tif [07:02:40] nothing in exception log either [07:03:01] exceptions wouldn't matter [07:04:33] tried swift-backend log, nothing there, outa places to look [07:04:38] can I leave it in your hands? [07:05:21] sure, i'll poke [07:05:41] thanks (gonna get to my dailies... downed hosts, broken puppet, etc) [07:11:35] i converted one to jpg and scp'd it over so i can open it, just out of curiosity [07:11:56] it looks like a hat made out of tortillas [07:12:01] :-D [07:13:01] high calorie fashion [07:14:22] https://commons.wikimedia.org/wiki/File:Zentralbibliothek_Z%C3%BCrich_-_Heinrich_Bullingers_Westerhemd_-_000012135.jpg [07:14:41] you have to admit my description was pretty accurate [07:16:31] apergos: doing <<$be->getLocalReference( array( 'src' => $path ) );>> and exiting work fine...I have the file stat() and path dumped out [07:16:41] when I leave eval.php, ls -l can't find any file [07:17:13] I don't think it's SwiftFileBackend [07:17:21] hrm [07:17:27] maybe it's some circular references with TempFSFile objects [07:17:41] ok, maybe I jumped the gun in pointing the finger [07:19:37] apergos: we have zend.enable_gc on right? [07:19:51] I have no idea [07:20:14] gc_enabled() returns true [07:20:15] (03CR) 10Matanya: "Although dsh_groups seems to be no used, please push the change dsh_groups --> dsh::groups for the sake of completeness." [operations/puppet] - 10https://gerrit.wikimedia.org/r/96413 (owner: 10Dzahn) [07:21:13] ok [07:24:40] * Aaron|home wonders why it is just tifs [07:25:21] 32287_5bef8ef1d55235f6957bd1d449af4eea.dvi [07:25:23] wtf is that for? [07:25:36] man this stuff needs prefixes [07:25:42] I was just going to say [07:30:19] i deployed and then accidentally un-deployed a two-char fix to a js file earlier [07:30:33] it's screwing up a live data collection job so i'm going to re-sync it [07:31:02] there's the pngs, the map and err files, but they are not the big spenders so I didn't list them [07:31:14] and they aren't local_ * something either [07:36:25] apergos: so 1138 is not even a scalar [07:36:50] other servers might get local copies, though it's rarer [07:37:39] 1142, 1143, 1144 some others that have them [07:38:51] !log ori synchronized php-1.23wmf4/extensions/WikimediaEvents/modules/ext.wikimediaEvents.moduleStorage.js 'Re-syncing I51d2d6495' [07:39:07] Logged the message, Master [07:40:36] !log ori synchronized php-1.23wmf3/extensions/WikimediaEvents/modules/ext.wikimediaEvents.moduleStorage.js 'Re-syncing I51d2d6495' [07:40:51] Logged the message, Master [07:41:12] done, sorry for unscheduled sync. [07:44:19] apergos: no unlink() errors in apache.log [07:44:46] hmmmm [07:44:48] gah, wfSuppressWarnings() is used [07:44:51] so you wouldn't know [07:44:52] ahhhh [07:44:54] :-D [07:50:00] (03CR) 10Akosiaris: "Patch LGTM but let's not merge this before we change the default policy to DROP. It is in" [operations/puppet] - 10https://gerrit.wikimedia.org/r/96424 (owner: 10Dzahn) [07:51:42] !log rebooting sq37, we had /dev//sdc gone and reappeared as /dev/sdc killing back-end squid [07:51:58] Logged the message, Master [07:54:39] that might have cleared it [08:06:13] (03CR) 10Akosiaris: [C: 04-1] "Very minor nitpicks. Feel free to merge after fixing them." (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/96354 (owner: 10Dzahn) [08:11:21] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [08:11:21] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 200,000 [08:21:20] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (200635) [08:21:21] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (200181) [08:22:21] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [08:22:22] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 200,000 [08:30:21] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (204724) [08:30:21] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (204717) [08:32:21] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 200,000 [08:33:20] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [08:33:47] (03CR) 10Akosiaris: [C: 04-2] "This will be solved better when we change the default policy to DROP" [operations/puppet] - 10https://gerrit.wikimedia.org/r/96040 (owner: 10Hashar) [08:46:21] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (206122) [08:46:21] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (206121) [08:52:22] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [08:53:20] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 200,000 [08:57:42] hello [09:01:07] akosiaris: hello :-] [09:01:58] hashar: hello [09:02:28] that week is crazy [09:03:19] yes [09:04:14] I am super happy we managed to get ferm applied for contint [09:04:18] that is a huge improvement [09:04:48] I am going to upgrade Zuul , could use a merge of https://gerrit.wikimedia.org/r/#/c/93457/ [09:04:58] which configure Zuul for gearman [09:05:14] !log stopping Zuul on gallium for upgrading purposes. [09:05:36] Logged the message, Master [09:05:51] ok merging now [09:06:06] (03PS5) 10Akosiaris: zuul: configuration for gearman [operations/puppet] - 10https://gerrit.wikimedia.org/r/93457 (owner: 10Hashar) [09:07:00] huh you stopped zuul .... i need to give verified +2 [09:07:08] I forgot that for a sec [09:07:25] (03CR) 10Akosiaris: [C: 032 V: 032] zuul: configuration for gearman [operations/puppet] - 10https://gerrit.wikimedia.org/r/93457 (owner: 10Hashar) [09:07:40] (03PS2) 10Akosiaris: zuul: refer to puppet variables with a @ [operations/puppet] - 10https://gerrit.wikimedia.org/r/95359 (owner: 10Hashar) [09:07:58] PROBLEM - zuul_service_running on gallium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/local/bin/zuul-server [09:08:22] (03CR) 10Akosiaris: [C: 032 V: 032] zuul: refer to puppet variables with a @ [operations/puppet] - 10https://gerrit.wikimedia.org/r/95359 (owner: 10Hashar) [09:08:57] hashar: done [09:09:02] thanks! [09:11:21] akosiaris: I think that is all what is needed for now :] [09:15:05] !log Zuul: bumped source code in integration/zuul.git to Gearman based version: 1e3adfd...6241272 labs -> master [09:15:20] Logged the message, Master [09:16:10] (03CR) 10Akosiaris: "See comments inline." (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/95363 (owner: 10QChris) [09:16:58] (03PS1) 10ArielGlenn: remove virt1 from dhcp, decommed rt #5645 [operations/puppet] - 10https://gerrit.wikimedia.org/r/96450 [09:19:48] having a list of steps to do is really helpful https://www.mediawiki.org/wiki/Continuous_integration/Zuul/gearman_upgrade#Upgrading :D [09:19:54] I don't even have to think about the upgrade [09:20:23] bah [09:20:36] !log gallium temp install of python-pip to be able to ugprade Zuul [09:20:51] Logged the message, Master [09:21:08] ah no jenkins right now eh? [09:21:15] yeah upgrading it right now [09:21:16] sorry :( [09:21:25] should I wait? it's not a rush [09:22:57] should be finished soon ™ [09:23:06] ok [09:27:37] !log removing all old versions of zuul: rm -fR /usr/local/lib/python2.7/dist-packages/zuul* [09:27:53] Logged the message, Master [09:29:29] snif ? /usr/local/ ???? [09:29:44] I will pretend I did not see that [09:30:14] yeah that is installed via pip / setup.py [09:30:16] * akosiaris will salt '*' cmd.run 'rm -rf /usr/local/*' one day [09:30:27] though with null http/https proxies [09:30:38] stop stop stop I do not want to know [09:30:48] next step is to have it packaged [09:30:51] I want to be ignorant please!!!! [09:30:57] hey [09:31:00] sure :-] [09:31:05] now that is what is like... packages [09:31:18] things in /usr and not /usr/local.... dpkg not pip :-) [09:32:58] PROBLEM - SSH on amslvs1 is CRITICAL: Server answer: [09:33:55] akosiaris: do we use hiera? [09:33:57] !log restarted Zuul with Gearman version [09:33:58] RECOVERY - SSH on amslvs1 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [09:34:11] Logged the message, Master [09:34:41] matanya: nope [09:34:43] !log enabling Gearman in Jenkins, making it register with Zuul new version [09:34:59] Logged the message, Master [09:35:21] thanks akosiaris. when you give up, do : salt '*' cmd.run 'rm -rf / [09:35:28] ' [09:35:52] we might need to take his access away on palladium and sockpuppet (saltmasters) :-P [09:35:59] but there's always dsh.. dang :-D [09:36:15] that will be a cron script triggered to run 1 year after I leave [09:36:27] just to see if you will catch it in time :P [09:36:30] and odds are no one will find it before then ;-D [09:36:34] yep [09:36:35] lol [09:37:13] (03Restored) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [operations/puppet] - 10https://gerrit.wikimedia.org/r/89002 (owner: 10Hashar) [09:37:19] (03PS5) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [operations/puppet] - 10https://gerrit.wikimedia.org/r/89002 [09:38:09] !log Zuul upgraded with Gearman support to trigger build in Jenkins. Will monitor for the rest of the morning. [09:38:16] akosiaris: success :-]  THANK YOU !! [09:38:23] Logged the message, Master [09:38:26] so since the salt master is necessary for deployments and other things, I'd like to have some other host than palladium be a second saltmaster (and get off of sockpuppet completely )... where if the host gets slow due to puppetmaster we don't also have salt issues [09:38:43] hashar: wow you are fast [09:38:47] nice !!! [09:39:21] apergos: seems reasonable [09:39:34] since it allows for multimaster setup I am happy with it [09:39:38] yep [09:39:49] it's slightly annoying to have to accept keys on both masters but it's not so often [09:40:09] hashar: potential problem: https://integration.wikimedia.org/ci/job/mwext-VisualEditor-doc-test/6203/console [09:40:10] I need a good candidate host, obviously salt doesn't need to be the only thing on it [09:40:20] hashar: It is executing on deployment-parsoid2 [09:40:58] akosiaris: I did the upgrade countless time on labs [09:40:58] apergos: well there is a patch upstream to have salt use puppet's CA infra [09:41:03] Krinkle: aarghh [09:41:10] https://integration.wikimedia.org/ci/job/mwext-VisualEditor-doc-test/6205/console [09:41:12] deployment-bastion [09:41:15] ryan had opened it [09:41:19] so have it be also on strontium? [09:41:41] of course then if the puppet backends are overloaded our saltmasters go too :-D [09:41:46] apergos: https://github.com/saltstack/salt/issues/5752 [09:42:08] there you go... if this is fixed we will no longer have to maintain two CAs [09:42:30] that would be really nice [09:45:27] (03PS2) 10Akosiaris: Change default ferm policy to DROP [operations/puppet] - 10https://gerrit.wikimedia.org/r/96265 [09:45:57] Krinkle: forgot to restart jenkinsdamn me [09:46:04] !log restarting Jenkins to update gearman plugin [09:46:19] Logged the message, Master [09:48:29] what about carbon as the other multimaster? [09:48:56] currently: backup::client, misc::install-server::tftp-server [09:49:12] (03CR) 10Akosiaris: "Moved main-input-default-drop.conf file in-module and using it now, however I am puzzled by" [operations/puppet] - 10https://gerrit.wikimedia.org/r/96265 (owner: 10Akosiaris) [09:49:19] ugh public ip nm [09:50:20] also, whatever 'client-side' is in ganglia misc eqiad, it should probably be gone... [09:55:26] sigh no good candidates, need it to be a roots only host w/o public ip and without critical services like eg dns [09:58:17] !log shutting down Zuul and reverting upgrade :-( [09:58:26] ouch [09:58:29] wtf? [09:58:32] oh dear [09:58:33] Logged the message, Master [09:59:15] I forgot to properly test out labels to tie jobs on specific node [09:59:40] turns out the default behavior is to run jobs on any slave available, even if the slave is configured to only run jobs that are explicitly assigned to it [09:59:42] upstream bug :( [10:02:01] !log reverting Zuul to last known version: wmf-deploy-20131023 + 6241272...1e3adfd wmf-deploy-20131023 -> master (forced update) [10:02:16] Logged the message, Master [10:02:38] :-( [10:05:02] !log restarted Zuul with non gearman version. [10:05:17] Logged the message, Master [10:05:57] RECOVERY - zuul_service_running on gallium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/local/bin/zuul-server [10:06:30] the good thing is that upstream documented their code so I know where the issue is \O/ [10:06:38] will write a retrospective somewhere [10:12:37] !log restarted again stalled Jenkins [10:12:53] Logged the message, Master [10:14:29] !log Jenkins: disabled gearman plugin, Zuul is no more a Gearman server. [10:14:44] Logged the message, Master [10:15:17] (03PS6) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [operations/puppet] - 10https://gerrit.wikimedia.org/r/89002 [10:16:47] Jenkins should be back in action now. Sorry. Had to revert my Zuul upgrade :-( [10:18:00] how can I resubmit a job (patchset went up after the upgrade was in process)? [10:18:11] recheck [10:18:50] how? [10:19:21] just add a comment with the word 'recheck' [10:19:28] oh a comment [10:19:31] I see [10:19:43] i think though it is going to give you a +1 on verified [10:19:49] meh [10:19:51] not a +2 ... for some reason [10:19:57] I'll trivially edit the commit message then [10:20:05] that will work too :-) [10:20:45] (03PS2) 10ArielGlenn: remove virt1 from dhcp, decommed rt #5645 [operations/puppet] - 10https://gerrit.wikimedia.org/r/96450 [10:21:49] (03CR) 10ArielGlenn: [C: 032] remove virt1 from dhcp, decommed rt #5645 [operations/puppet] - 10https://gerrit.wikimedia.org/r/96450 (owner: 10ArielGlenn) [10:25:28] (03PS1) 10ArielGlenn: removing virt1, wiped/decommed in rt #5645 [operations/dns] - 10https://gerrit.wikimedia.org/r/96455 [10:27:25] (03CR) 10ArielGlenn: [C: 032] removing virt1, wiped/decommed in rt #5645 [operations/dns] - 10https://gerrit.wikimedia.org/r/96455 (owner: 10ArielGlenn) [10:56:57] PROBLEM - SSH on amslvs1 is CRITICAL: Server answer: [10:57:57] RECOVERY - SSH on amslvs1 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [11:00:16] (03PS1) 10ArielGlenn: remove mobile1001-3 mgmt, leave asset tag names [operations/dns] - 10https://gerrit.wikimedia.org/r/96464 [11:01:02] (03CR) 10ArielGlenn: [C: 032] remove mobile1001-3 mgmt, leave asset tag names [operations/dns] - 10https://gerrit.wikimedia.org/r/96464 (owner: 10ArielGlenn) [11:01:57] PROBLEM - SSH on amslvs1 is CRITICAL: Server answer: [11:03:02] RECOVERY - SSH on amslvs1 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [11:11:44] (03PS1) 10ArielGlenn: remove caesium from dhcp, dsh files, it's spare [operations/puppet] - 10https://gerrit.wikimedia.org/r/96466 [11:13:34] (03CR) 10ArielGlenn: [C: 032] remove caesium from dhcp, dsh files, it's spare [operations/puppet] - 10https://gerrit.wikimedia.org/r/96466 (owner: 10ArielGlenn) [11:18:24] (03PS1) 10ArielGlenn: remove nonmgmt entries for caesium, it's spare [operations/dns] - 10https://gerrit.wikimedia.org/r/96468 [11:19:10] (03CR) 10ArielGlenn: [C: 032] remove nonmgmt entries for caesium, it's spare [operations/dns] - 10https://gerrit.wikimedia.org/r/96468 (owner: 10ArielGlenn) [11:35:49] (03PS1) 10ArielGlenn: remove molybdenum, renamed in rt #2291 [operations/dns] - 10https://gerrit.wikimedia.org/r/96470 [11:37:55] (03CR) 10ArielGlenn: [C: 032] remove molybdenum, renamed in rt #2291 [operations/dns] - 10https://gerrit.wikimedia.org/r/96470 (owner: 10ArielGlenn) [11:38:12] (03PS1) 10Siebrand: Remove underscore from class names LBFactory_* [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96472 [11:43:00] (03CR) 10Siebrand: [C: 04-1] "Sticking a -1 on this for now until it's clear when the core patch can be merged." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96472 (owner: 10Siebrand) [11:48:54] (03PS1) 10Jforrester: Deploy VisualEditor as default to "phase 3" Wikipedias [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96473 [11:49:05] (03CR) 10jenkins-bot: [V: 04-1] Deploy VisualEditor as default to "phase 3" Wikipedias [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96473 (owner: 10Jforrester) [11:51:48] (03PS2) 10Jforrester: Deploy VisualEditor as default to "phase 3" Wikipedias [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96473 [11:52:21] (03PS1) 10ArielGlenn: remove voip, we actually use voip.corp.wm.o handled elsewhere [operations/dns] - 10https://gerrit.wikimedia.org/r/96474 [11:54:45] (03PS3) 10Catrope: Deploy VisualEditor as default to "phase 3" Wikipedias [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96473 (owner: 10Jforrester) [11:54:59] (03CR) 10Catrope: [C: 04-2] Deploy VisualEditor as default to "phase 3" Wikipedias [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96473 (owner: 10Jforrester) [11:55:03] (03CR) 10ArielGlenn: [C: 032] remove voip, we actually use voip.corp.wm.o handled elsewhere [operations/dns] - 10https://gerrit.wikimedia.org/r/96474 (owner: 10ArielGlenn) [12:31:32] !log end of an era: shut down ms7 webserver and opening ticket for decom [12:31:48] Logged the message, Master [12:37:00] (03PS1) 10ArielGlenn: remove last entries for ms7, going away at last [operations/puppet] - 10https://gerrit.wikimedia.org/r/96476 [12:39:30] (03CR) 10ArielGlenn: [C: 032] remove last entries for ms7, going away at last [operations/puppet] - 10https://gerrit.wikimedia.org/r/96476 (owner: 10ArielGlenn) [12:51:28] (03PS1) 10ArielGlenn: remoe last entries for ms8, to be decommed [operations/puppet] - 10https://gerrit.wikimedia.org/r/96477 [13:06:48] (03PS1) 10Petrb: inserted QT libraries to resolve b57241 [operations/puppet] - 10https://gerrit.wikimedia.org/r/96478 [13:45:12] apergos: no more solaris? [13:45:31] as far as I know, nope [13:45:52] those boxes have been unused for awhile but this is the official end [13:59:36] !log jenkins: added slave integration-slave01 with label hasNpm. That is a slave running in labs ( integration-slave01.pmtpa.wmflabs ) [13:59:51] Logged the message, Master [14:00:13] (03PS1) 10Hashar: role::ci::slave::labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/96483 [14:00:18] (03PS1) 10Petrb: installed socat to resolve !b 57005 [operations/puppet] - 10https://gerrit.wikimedia.org/r/96484 [14:00:57] apergos: got any spare minutes to review a CI change https://gerrit.wikimedia.org/r/96483 ? [14:01:15] that add a new role to install slaves in labs, would come with some specific packages (pip, npm) we don't want in production [14:01:18] anomie ^^^ [14:01:20] in a few minutes yes indeed [14:01:37] going out for a haircut anyway, so take your time [14:05:29] (03PS1) 10Petrb: ufraw-batch to fix !b 57008 [operations/puppet] - 10https://gerrit.wikimedia.org/r/96486 [14:08:09] (03PS1) 10Petrb: installing package to fix !b 57004 [operations/puppet] - 10https://gerrit.wikimedia.org/r/96487 [14:09:26] off for some haircut, be back in an hour or so [14:12:11] (03PS2) 10coren: Tool Labe: install socat to exec environment [operations/puppet] - 10https://gerrit.wikimedia.org/r/96484 (owner: 10Petrb) [14:12:18] (03CR) 10jenkins-bot: [V: 04-1] Tool Labe: install socat to exec environment [operations/puppet] - 10https://gerrit.wikimedia.org/r/96484 (owner: 10Petrb) [14:12:50] ohhhh a coren, maybe you know about virt13 and 14? by that I mean... [14:13:19] Wait, that was already merged? [14:13:23] * Coren grumbles. [14:13:24] https://rt.wikimedia.org/Ticket/Display.html?id=5673 this says they were supposed to be deployed [14:13:54] and apparently that never happened, do we still want it (tampa... going away someday... etc) [14:13:57] ? [14:14:35] apergos: Probably not, from what I see there was two sets of two nodes put aside for the same use; that ticket was mooted when the other two got deployed. [14:15:27] can we reclaim these as spares... hmm to be either donated or shipped as rob/chris see fit? [14:19:01] (03CR) 10Yuvipanda: [C: 04-1] "No X or related desktop things (dbus?) on toollabs please :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/96478 (owner: 10Petrb) [14:19:38] Coren: I -1'd it only because I don't have rights to -2, but I guess you know better anyway :) [14:19:40] Coren: ? or should I ask ryan to be 100% sure? [14:22:23] apergos: I expect Ryan or Rob would be authoritative. [14:23:23] (03CR) 10coren: [C: 04-2] "No X11 toolkits on grid nodes." [operations/puppet] - 10https://gerrit.wikimedia.org/r/96478 (owner: 10Petrb) [14:23:32] ok, I'll check with ryan, thanks! [14:27:43] and hashar is gone [14:28:11] I'll merge it when he's back [14:30:14] (03PS3) 10coren: Tool Labs: install socat to exec environment [operations/puppet] - 10https://gerrit.wikimedia.org/r/96484 (owner: 10Petrb) [14:33:36] (03PS1) 10Dzahn: remove outdated subnets from dhcpd [operations/puppet] - 10https://gerrit.wikimedia.org/r/96489 [14:34:00] apergos: / Coren: ^ dhcpd.conf option domain-name "tesla.wikimedia.org"; :) [14:34:17] I was already looking [14:34:24] apergos: re RT #3801, heh [14:34:26] yep [14:34:33] I opened a few today [14:34:39] getting closer! [14:41:19] (03CR) 10coren: [C: 032] Tool Labs: install socat to exec environment [operations/puppet] - 10https://gerrit.wikimedia.org/r/96484 (owner: 10Petrb) [14:43:48] (03PS2) 10coren: Tool Labs: install rrdtool [operations/puppet] - 10https://gerrit.wikimedia.org/r/96487 (owner: 10Petrb) [14:46:28] (03CR) 10coren: [C: 032] Tool Labs: install rrdtool [operations/puppet] - 10https://gerrit.wikimedia.org/r/96487 (owner: 10Petrb) [14:58:05] (03PS2) 10coren: Tool Labs: install ufraw-batch [operations/puppet] - 10https://gerrit.wikimedia.org/r/96486 (owner: 10Petrb) [14:58:55] (03PS1) 10Ottomata: Adding $net_topology_script_template parameter to make Hadoop rack/row aware [operations/puppet/cdh4] - 10https://gerrit.wikimedia.org/r/96490 [14:58:58] (03CR) 10ArielGlenn: [C: 031] "I'm ok with this in light of the explanation given on the bug." [operations/apache-config] - 10https://gerrit.wikimedia.org/r/92925 (owner: 10Dzahn) [14:59:43] (03CR) 10Ottomata: [C: 032 V: 032] Adding $net_topology_script_template parameter to make Hadoop rack/row aware [operations/puppet/cdh4] - 10https://gerrit.wikimedia.org/r/96490 (owner: 10Ottomata) [15:00:15] (03CR) 10coren: [C: 032] Tool Labs: install ufraw-batch [operations/puppet] - 10https://gerrit.wikimedia.org/r/96486 (owner: 10Petrb) [15:01:09] (03CR) 10Nemo bis: "Worth fixing the commit message here too given it's -2'ed... Ahem, I got an internal server error and the -2 disappeared, that shouldn't h" [operations/puppet] - 10https://gerrit.wikimedia.org/r/96478 (owner: 10Petrb) [15:01:46] (03CR) 10coren: [C: 032] "Just putting the -2 back." [operations/puppet] - 10https://gerrit.wikimedia.org/r/96478 (owner: 10Petrb) [15:01:48] (03CR) 10Jeremyb: [C: 04-1] "fix commit msg to match format in Icf5c2be75d6442aed81bfadea72be27d583cc86c" [operations/puppet] - 10https://gerrit.wikimedia.org/r/96478 (owner: 10Petrb) [15:01:55] (03CR) 10coren: [C: 04-2] Inserted QT libraries [operations/puppet] - 10https://gerrit.wikimedia.org/r/96478 (owner: 10Petrb) [15:02:59] (03PS1) 10Petrb: inserted lynx and links to fix !b 56997 [operations/puppet] - 10https://gerrit.wikimedia.org/r/96491 [15:03:53] (03PS9) 10ArielGlenn: beta: symlink /a/common [operations/puppet] - 10https://gerrit.wikimedia.org/r/65254 (owner: 10Hashar) [15:04:03] eww [15:04:08] (/a/common) [15:04:41] not going to make him fix production in order to get his change in though [15:05:19] (03CR) 10Aude: [C: 031] "looks good to me :)" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/92925 (owner: 10Dzahn) [15:05:28] ^ +4 :D [15:06:16] (03CR) 10ArielGlenn: [C: 032] beta: symlink /a/common [operations/puppet] - 10https://gerrit.wikimedia.org/r/65254 (owner: 10Hashar) [15:06:50] no, just fix betalabs to not depend on it [15:06:56] what toollabs change didn't get merged on paladium? [15:07:14] production doesn't have /a anymore [15:07:14] class toollabs::exec_environ [15:07:16] fwiw [15:07:18] * apergos looks at Coren [15:07:40] tin has it [15:08:00] yes, as a workaround for some things that were broken [15:08:12] That also makes puppet ensure /a/common exists in production. [15:08:13] jesus [15:08:20] on a change labeled "beta" [15:08:55] apergos: Hmm? [15:09:23] I merged your ufraw-batch change on palladium [15:11:37] (03PS2) 10coren: Tool Labs: install links and lynx to dev_environ [operations/puppet] - 10https://gerrit.wikimedia.org/r/96491 (owner: 10Petrb) [15:18:31] so which hosts should have /a/common and which should not? [15:18:35] paravoid: [15:18:41] why are you asking me? :) [15:19:24] because you know at least that the/some production hosts shouldn't have it [15:19:36] I know they don't [15:19:40] I don't know if they shouldn't :) [15:19:51] (some production hosts = appservers) [15:20:22] ok, well that's 'some information', so good :-D [15:20:28] paravoid: mind if i take https://rt.wikimedia.org/Ticket/Display.html?id=6344 ? [15:20:57] matanya: I don't, but I know it's going to be complicated [15:21:15] it's a new major release where they replaced their javascript engine [15:21:31] not nodejs anymore? [15:21:58] (03CR) 10Faidon Liambotis: "INVALID is more than just non-SYNs; for example, it applies to ICMP & UDP traffic as well, e.g. unsolicited UDP packets." [operations/puppet] - 10https://gerrit.wikimedia.org/r/96265 (owner: 10Akosiaris) [15:22:31] (03CR) 10coren: [C: 032] Tool Labs: install links and lynx to dev_environ [operations/puppet] - 10https://gerrit.wikimedia.org/r/96491 (owner: 10Petrb) [15:22:32] Replace Esprima.js with pure-ruby JS parser RKelly (for now we're using our own fork of RKelly). [15:23:51] so, yeah, complicated [15:24:04] probably [15:24:52] it's actually called RKelly? oh my.. [15:24:53] I'd be more than happy to not have it on my plate, I'm just warning you/giving you all the data before you start working on it :-) [15:27:52] thanks paravoid if i see gem2deb will fail me, i'll bug you :) [15:27:58] it will [15:28:02] don't start from gem2deb [15:28:09] start from the previous version we have in the repo [15:28:11] I did that [15:28:30] you see it is good to ask before? [15:29:03] how would you recommend to do it then? by hand? [15:29:18] we have the 4.8.something version in the repo already [15:29:24] (apt.wikimedia.org) [15:29:38] (so if you do apt-get source ruby-jsduck from a Labs instance, you'll get that) [15:30:09] then bump the version number in the changelog, fetch the orig.tar.gz, check/bump/replace the dependencies, build, test [15:30:52] so paravoid you bet import-orig --pristine-tar --uscan and the link won't work? [15:31:03] *like [15:31:15] we have no git repository for it iirc [15:31:20] and that's fine [15:31:41] ok. the hard way. i'll try to find time for it this weekend [15:31:50] wish me luck :) [15:32:04] good luck :) [15:37:08] hashar: there's no chance that the contint stuff and the beta autoupdater will be on the same instance is there? (referring to https://gerrit.wikimedia.org/r/#/c/96483/ ) [15:38:03] apergos: checking, they should be different kind of slaves [15:38:19] the beta auto updater run on the deployment-prep project bastion [15:38:38] I ask cause they both declare npm [15:38:40] ahh [15:39:39] role::ci::slave::labs is never going to be used on the deployment-prep project. so that is fine [15:39:52] ok [15:41:24] ok I'll merge this now [15:41:34] thankkks [15:41:40] after rebase [15:41:40] will run puppet on labs [15:42:11] I don't think the subdir issue you raise will be a problem [15:43:09] (03CR) 10ArielGlenn: [C: 032] role::ci::slave::labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/96483 (owner: 10Hashar) [15:43:48] done [15:43:51] your turn [15:45:39] (03CR) 10Akosiaris: "I was aware that INVALID meant a lot more things, I guess I did not make that clear. Anyway that is why I suggested it." [operations/puppet] - 10https://gerrit.wikimedia.org/r/96265 (owner: 10Akosiaris) [15:45:55] (03PS3) 10Akosiaris: Change default ferm policy to DROP [operations/puppet] - 10https://gerrit.wikimedia.org/r/96265 [15:46:20] notice: /Stage[main]/Contint::Packages::Labs/Package[python-pip]/ensure: ensure changed 'purged' to 'present' [15:46:24] apergos: thank you very much :-] [15:46:32] yw [15:46:32] pip ? [15:46:35] i did not see that [15:46:36] beta [15:46:37] don't ask [15:46:43] er I mean labs, don't ask [15:46:45] anyways [15:46:57] as I said... I did NOT see that [15:46:59] nope [15:47:58] akosiaris: on labs [15:48:12] akosiaris: need it to be able to HTTP_PROXY=. HTTPS_PROXY=. python setup.py [15:49:09] and to fetch some dependencies for javascript jobs using npm :-( [15:50:30] I have added some if $::realm == 'production' {  fail("can't be applied in production") } [15:51:02] hashar: please stop mentioning language-derived distribution utilities. My brain hearts [15:51:40] I hate it as well and would love us to package every single dependencies around [15:52:09] in case of npm, that is not that trivial though since direct dependencies can themselves depend on the same module albeit with different version [15:52:50] well aware of that [15:53:06] my feelings the first time i discovered what npm does [15:53:18] were "Brilliant and stupid at the same time" [15:53:31] hey, that's more charitable than I was... [15:53:59] I was not force to deal with it, that is why [15:54:11] I could afford to be charitable [15:54:23] :-D [15:54:25] the more however I have to deal with it, the less I like it [15:54:49] so this is why I merge and grit my teeth and you put your headphones on and work on more production stuff :-D [15:55:04] exactly :-) [15:55:07] ah which reminds me [15:55:25] and I get my node modules dependencies deployed and focus on writing mooaar jenkins jobs :] [15:55:43] then we are all being granted badges for being pragmatic [15:55:49] (03CR) 10Akosiaris: [C: 032] "Let's see what this breaks" [operations/puppet] - 10https://gerrit.wikimedia.org/r/96265 (owner: 10Akosiaris) [15:56:00] hashar: that bell tolls for you :P ^ [15:56:08] :] [15:56:20] no damnit, I will have to restart a bunch of things on vanadium to see if they are broken [15:56:22] can't check it out right now though [15:56:24] well I am going to, fed up [15:56:46] apergos: I haven't merged yet [15:56:51] Want me to revert ? [15:56:53] no no [15:57:08] talking about removing the, well some of, /usr/local/lib on vanadium [15:57:15] ok ok [15:57:17] thanks [15:57:23] I can't remove it all because puppet puts some of it back apparently >_< [15:59:20] /usr/local/lib/nagios/plugins... really? and used, too [15:59:33] that one is on purpose [15:59:35] my doing [15:59:45] * apergos raises an eyebrow [15:59:59] but you will have the guilty conscience at night so.. :-D [16:00:00] the idea is to not ship our own home made plugins [16:00:12] in the same directory as the system ones [16:00:17] do not rebuild nagios-plugins just to make it icinga-plugins .. come on:) [16:00:28] mostly filesystem hygiene [16:00:35] mutante: I am not that crazy [16:00:39] hehe, ok [16:00:39] we already removed the nagios.wikimedia.org CNAME might as well rename the plugin :D [16:00:40] and the ipython stuff? :-P [16:00:55] I have nothing to do with that [16:01:01] I was hoping not! [16:01:06] ok, well that's likely.... [16:01:09] neither the python 3.3 stuff [16:01:11] operations/puppet$ git grep nagios|wc -l [16:01:12] 637 [16:01:24] someone who is not here so I will ping him later! [16:01:39] anyways the python 33 i tossed and it stayed gone [16:01:49] PROBLEM - SSH on gallium is CRITICAL: Connection timed out [16:02:05] uh oh [16:02:06] aaah and I was about to say that all was well [16:02:19] PROBLEM - SSH on antimony is CRITICAL: Connection timed out [16:02:19] PROBLEM - HTTPS on antimony is CRITICAL: Connection timed out [16:02:22] that is a result of 96265 [16:02:29] all of them [16:02:33] antimony and gallium [16:02:40] neon is not allowed to connect to them [16:02:44] let's see.... [16:03:08] akosiaris: ooh.. but hashar added neon to that already [16:03:14] and i even saw the ACCEPT for neon [16:03:22] in iptables -L [16:03:46] for nrpe [16:03:47] not ssh [16:04:11] I think I am gonna do an all ports whitelisting of neon anyway [16:04:11] oh, of course, yea [16:04:24] expected the NRPE checks to also timeout for some reason, but they dont [16:04:56] akosiaris: well, 22, 443, 80, 8080 .. ehm [16:09:57] @seen matanya [16:10:22] I will go with all mutante... quick and less maintenance for us in the future [16:13:15] PROBLEM - NTP on gallium is CRITICAL: NTP CRITICAL: No response from NTP server [16:13:27] ntp ? [16:13:29] wtf ? [16:14:04] PROBLEM - NTP on antimony is CRITICAL: NTP CRITICAL: No response from NTP server [16:15:02] o_O [16:15:12] That, also svn just paged me. [16:16:47] after everything is settled we still want to be able to ssh into gallium via proxycommand right? (cause obviously that's broken right now too) [16:17:33] NTP is on all [16:17:45] it's command_line $USER1$/check_ntp_time -H $HOSTADDRESS$ [16:18:56] mutante: akosiaris: I have added neon IP address to the ferm::rule for nrpe, that is a [16:18:58] hack though [16:19:02] should be a ferm variable I guess [16:19:15] hashar: it's not that, it's that the default is DROP now [16:19:51] 08:06 < akosiaris> that is a result of 96265 [16:19:51] ahh [16:19:53] \O/ [16:21:10] hashar: heh, but doing that in nrpe.pp itself.. less hack than doing it in some acutally unrelated role as we started out:) [16:21:51] and ferm variables see the comment on https://gerrit.wikimedia.org/r/#/c/88755/ [16:27:33] will skip :-] [16:28:55] akosiaris: ntp being checked without nrpe but directly on service ? [16:29:25] (03PS1) 10Akosiaris: Punch hole for icinga servers to monitor all [operations/puppet] - 10https://gerrit.wikimedia.org/r/96511 [16:29:26] yes... for some reason [16:29:31] I understand not why [16:32:28] (03CR) 10Akosiaris: [C: 032] Punch hole for icinga servers to monitor all [operations/puppet] - 10https://gerrit.wikimedia.org/r/96511 (owner: 10Akosiaris) [16:34:58] (03PS1) 10Akosiaris: Revert "nrpe: iptables accept neon public IP address" [operations/puppet] - 10https://gerrit.wikimedia.org/r/96512 [16:38:32] (03CR) 10Akosiaris: [C: 032] Revert "nrpe: iptables accept neon public IP address" [operations/puppet] - 10https://gerrit.wikimedia.org/r/96512 (owner: 10Akosiaris) [16:39:26] :-( [16:39:52] * akosiaris :-) [16:40:06] akosiaris: ahha [16:40:12] i fixed your FIXME comment in less than 24 hours... what else do you want ? [16:40:16] you are removing the bastion-ssh hole apparently [16:40:18] https://gerrit.wikimedia.org/r/#/c/96511/1/modules/base/manifests/init.pp,unified [16:40:46] that's what I iwas saying [16:40:50] akosiaris: hashar , how's that https://gerrit.wikimedia.org/r/#/c/96424/ [16:41:06] we need a way to ssh in ourselves... one way or another :-D [16:41:19] crap typo [16:45:29] I messed up sorry. fixing it [16:45:43] the good news is that ferm denied my change [16:46:30] do we have a bast in ulsfo yet? [16:46:47] yes [16:46:48] apergos: yes, 4001 [16:46:50] or rather, I guess we do-ish but [16:46:59] it doesnt include the same things bast1001 includes [16:47:00] (03PS1) 10Akosiaris: Amend "Punch hole for icinga servers to monitor all" [operations/puppet] - 10https://gerrit.wikimedia.org/r/96515 [16:47:05] 198.35.26.5 not in $BASTION_V4 [16:47:08] yet [16:47:10] which makes me think if "role::bastion" isn't really the right role [16:47:13] it is in reality [16:47:20] s/right/same [16:48:52] (03CR) 10Akosiaris: [C: 032] Amend "Punch hole for icinga servers to monitor all" [operations/puppet] - 10https://gerrit.wikimedia.org/r/96515 (owner: 10Akosiaris) [16:50:04] RECOVERY - NTP on antimony is OK: NTP OK: Offset 0.0005956888199 secs [16:50:14] RECOVERY - SSH on antimony is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [16:50:15] RECOVERY - HTTPS on antimony is OK: OK - Certificate will expire on 08/22/2015 22:23. [16:50:20] and fixed... [16:51:04] RECOVERY - SSH on gallium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [16:51:15] RECOVERY - NTP on gallium is OK: NTP OK: Offset 0.0005853176117 secs [16:51:27] !! [16:54:38] akosiaris: sweet [16:56:13] the plan is for this to be on all hosts with public facing ips? [16:57:50] more or less [16:58:03] well we could have it even on non-public hosts [16:58:11] gotta think about things like brewster I guess [16:59:16] and see if dsh is involved in deployments from tin (doesn't that use ssh?) if we want it on all hosts [16:59:43] hmmm deployment uses salt [16:59:52] well Ryan's deployment system uses salt [17:00:04] yeah I have a vague otion that some stuff from tin might not [17:00:07] yah [17:00:17] I know there is an old deployment system somewhere for mediawiki [17:00:18] *notion [17:00:48] and there's still some apache conf stuff on fenari (on my list to convert to ryan-git-sartoris-trebuchet-deployment) [17:01:03] heh... one thing at a time [17:01:04] (03PS1) 10Dr0ptp4kt: Point Wikpedia app for Firefox OS submodule at gerrit. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96517 [17:01:17] yeah [17:01:22] but yeah... before its deployed cluster wise it will take some time [17:01:25] so bast4001 is really ready to go right? [17:01:27] probably months [17:01:34] maybe even years ? [17:01:36] as in people should/could be using it? [17:01:54] (no worries, let's take the time needed) [17:02:10] bast4001 is kind of weird... It is a bastion host, but not really [17:02:21] it is there more like as a brewster replacement [17:02:24] :-D [17:02:37] ok well should we be using it to ssh through? that's the basic q [17:02:40] but.... I think we should call it a bastion [17:02:56] and make it a full bastion, allowing ssh to it [17:03:00] you can't deploy apache-config without ssh [17:03:22] well it's also a bastion... so we should definitely allow it in [17:03:22] sync-file , sync-common etc [17:05:57] no ipv6 for bast4001? [17:07:11] (03PS1) 10Akosiaris: Remove redundant DROP rules. [operations/puppet] - 10https://gerrit.wikimedia.org/r/96518 [17:07:13] heya RobH, [17:07:20] ? [17:07:25] is it possible to programitcally ask racktables which node a row is in? [17:07:28] (03PS1) 10ArielGlenn: bast4001 as allowed bastion in firewalls [operations/puppet] - 10https://gerrit.wikimedia.org/r/96520 [17:07:31] which row a node is in * [17:07:32] ottomata: lol [17:07:33] nice [17:07:35] ? [17:07:42] uh, node being server? [17:07:48] yes [17:07:48] cuz if so, yes, yes there is [17:07:56] really ? [17:07:56] given an IP or hostname? [17:08:01] racktables changed that much ? [17:08:04] hostname, not IP [17:08:06] wow... amazed [17:08:13] racktables tracks the rack row and U in rack [17:08:18] always has... [17:08:27] so mysql query can return that. [17:08:32] but thats about it [17:08:35] uh if by changed you mean it's not a huge horrible table with a pile of views, well, no [17:08:38] we dont store IP info in racktables [17:08:44] no, its a horrible tag view table [17:08:44] I though you meant it had an API [17:08:48] hahahaha [17:08:49] oh, no api. [17:09:02] ah... ok back to square 1 [17:09:17] well, i'll repeat here so lots of folks see it [17:09:18] you can craft a butt-ugly query and get something out. = "api" [17:09:24] WE ALL HATE RACKTABLES BUT HAVE NO ALTERNATIVE [17:09:25] not to sound very pompous but servermon has that [17:09:29] hah [17:09:30] so switch us! [17:09:31] ok ok [17:09:35] i got the point [17:09:37] the only requirement is we need rack row graphics [17:09:45] ie: rack layout showing what U's are populated with what [17:09:52] that'f ine! [17:09:54] hmm [17:09:55] welllll [17:09:58] hmmmm [17:10:11] ok ok ... I 'll start working on adding the few missing features to servermon [17:10:14] so any kind of replacement is fine, as long as we can have limited edit access and if its open read even better. [17:10:24] what language is servermon in? [17:10:25] ideally we would disclose all our rack layouts to anon [17:10:31] promise that by the end of the year we will at least be ready to switch [17:10:32] and require login for edit [17:10:33] apergos: python [17:10:35] RobH, whatcha think, since we've moved hadoop nodes to different rows [17:10:42] I want to make hadoop topology aware [17:10:50] mmm [17:10:55] i just need to give hadoop a script that returns a unique row name based on ip or hostname [17:10:59] yea... thats gonna be super hacky with racktables [17:11:00] i could hardcode them all [17:11:04] grossss [17:11:07] as you'll have to setup a user for it to query directly [17:11:07] apergos: https://github.com/akosiaris/servermon [17:11:09] yeah, i don't really want to install mysql just to do that [17:11:17] <^demon|sick> hashar: Did something change with gallium's ssh? I'm getting connect timeouts from gerrit. [17:11:19] there is no api, so meh [17:11:21] RobH: interesting/neat re public view of layouts [17:11:27] <^demon|sick> (on replication) [17:11:28] aye [17:11:31] will check it out [17:11:32] ^demon|sick: yes [17:11:33] we used to have it on wikitech [17:11:44] racktables is only private because it lacks the ability to be public view private edit [17:11:45] uh you have to go in through an official bastion now [17:11:49] ^demon|sick: [17:11:59] ^demon|sick: that is in case you mean SSH to gallium from another host [17:12:02] oh i think we need to open that [17:12:03] RobH: cool (well, not that racktables is limited, you know what I mean) [17:12:06] I hate that its private and that folks cannot just see that part of our infrastructure, yep [17:12:06] since that's everyone's gerrit/git [17:12:08] <^demon|sick> Yes :) [17:12:09] i know what ya mean [17:12:18] :) [17:12:23] did you guys firewall off git? [17:12:25] (heh) [17:12:50] antimony and gallium are the first two hosts that have a firewall with DROP by default policy [17:12:56] <^demon|sick> I can change replication rules. [17:13:06] <^demon|sick> I wonder if I can ssh via the .eqiad.wmnet name. [17:13:17] (03CR) 10ArielGlenn: [C: 032] bast4001 as allowed bastion in firewalls [operations/puppet] - 10https://gerrit.wikimedia.org/r/96520 (owner: 10ArielGlenn) [17:13:29] can someone please just template network.pp? :) [17:13:39] template defs.conf from network.pp that is [17:13:39] ^demon|sick: you can not [17:13:58] ^demon|sick: tell me what you need and me will fix [17:14:05] paravoid: I will do that [17:14:12] okay, works for me [17:14:19] I just commented because of apergos' bast4001 change [17:14:34] we're only going to have more of these if we don't do a proper fix [17:15:11] <^demon|sick> akosiaris: Gerrit requires ssh access to antimony, lanthanum, gallium and github.com [17:15:24] <^demon|sick> We replicate everything to those hosts. [17:15:43] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (203940) [17:15:52] ^demon|sick: we don't do outgoing firewalls yet [17:16:16] <^demon|sick> Yeah the github part is fine. [17:17:03] ^demon|sick: ok I will add ytterbium as an exception to those 2 (gallium, antimony). lanthanum is without a firewall [17:17:19] so that ytterbium can connect to gallium and antimony via ssh [17:17:37] 127.0.1.1 ytterbium.wikimedia.org ytterbium [17:17:38] argh [17:17:51] once talked to a guy at a conference about racktables API plans.. he never got back [17:18:00] :-D [17:18:19] I looked at the code once for 5 minutes [17:18:29] after I put my gouged out eyes back in that was it [17:18:55] just out of curiosity... why does ytterbium have a second IP ? HTTPS ? [17:19:04] <^demon|sick> akosiaris: If we moved antimony to an internal IP like mark and I planned then won't have this problem :) [17:19:09] service IP, I think [17:19:13] <^demon|sick> Yes [17:19:36] ^demon|sick: No we would still have it [17:19:39] maybe now it's a good time to think of redirecting port the service ip's 22 to 29418 [17:19:59] (03CR) 10Hashar: "notice: Finished catalog run in 3315.75 seconds" [operations/puppet] - 10https://gerrit.wikimedia.org/r/96483 (owner: 10Hashar) [17:20:20] <^demon|sick> paravoid: That would be wonderful. And the whole reason Ryan and I moved to using a service IP ;-) [17:20:30] it's a simple ferm rule [17:21:17] huh? [17:21:24] paravoid: this looks bad again https://gdash.wikimedia.org/dashboards/reqerror/ [17:21:46] <^demon|sick> akosiaris: Using port 29418 for gerrit has confused people since day 1. It's been on our todo list to remove that. [17:22:01] <^demon|sick> Using 22 would allow people's sane git defaults to Just Work. [17:22:34] oh, that is why you added the service IP. Now that makes sense [17:22:38] <^demon|sick> Yep. [17:22:41] ok ok thanks for explaining [17:22:53] <^demon|sick> I also wonder how many characters have been wasted typing :29418 by all of us over time ;-) [17:23:12] Nemo_bis: *sigh* [17:23:18] Nemo_bis: thanks... [17:23:19] none for me... I never had the problem [17:23:34] git review with a sane git review file always [17:23:47] <^demon|sick> It's in my ~/.ssh/config [17:24:19] none for me either [17:24:23] but surely by many folks [17:24:42] <^demon|sick> (gerrit also has a non-interactive ssh daemon on port 29418, can do some fun things with it) [17:25:02] <^demon|sick> `ssh -p 29418 gerrit.wikimedia.org gerrit` [17:25:27] can someone please have a look on why our reqerrors are through the roof again? I'm way to busy for the next hour and a half [17:25:27] out *wave* [17:27:20] <^demon|sick> !log gerrit: disabled replication plugin fix, pending firewall updates. everything's timing out and flooding logs with errors. [17:27:33] (03CR) 10Akosiaris: [C: 032] Remove redundant DROP rules. [operations/puppet] - 10https://gerrit.wikimedia.org/r/96518 (owner: 10Akosiaris) [17:27:34] Logged the message, Master [17:37:46] (03PS1) 10Ottomata: Adding net-topology.py.erb to make hadoop topology aware. [operations/puppet] - 10https://gerrit.wikimedia.org/r/96526 [17:38:42] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 200,000 [17:39:24] greg-g: zero depl ready when you are :) [17:39:45] yurik_: yessir [17:39:49] (03PS1) 10Akosiaris: Allow ytterbium access to CI and gitblit [operations/puppet] - 10https://gerrit.wikimedia.org/r/96527 [17:40:44] (03CR) 10Faidon Liambotis: [C: 031] "Awesome!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/96526 (owner: 10Ottomata) [17:40:53] we wouldn't want to break the 12 wiki crashes in 12 days now, do we? [17:41:02] yurik_: shush [17:41:22] erm [17:41:27] don't worry about that [17:41:39] site's broken now anyway [17:41:42] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (200225) [17:41:43] ???! [17:41:45] http://gdash.wikimedia.org/dashboards/reqerror/ [17:42:00] I'm busy with meetings for the next 2 hours or so [17:42:03] (03PS2) 10Ottomata: Adding net-topology.py.erb to make hadoop topology aware. [operations/puppet] - 10https://gerrit.wikimedia.org/r/96526 [17:42:08] (03CR) 10Ottomata: [C: 032 V: 032] Adding net-topology.py.erb to make hadoop topology aware. [operations/puppet] - 10https://gerrit.wikimedia.org/r/96526 (owner: 10Ottomata) [17:42:20] I don't think anyone else cares enough to investigate atm [17:42:38] started at 8:20 utc? [17:42:42] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 200,000 [17:42:44] looks like it [17:42:56] (03CR) 10Akosiaris: [C: 032] Allow ytterbium access to CI and gitblit [operations/puppet] - 10https://gerrit.wikimedia.org/r/96527 (owner: 10Akosiaris) [17:42:57] s/cares enough/feels empowered/ maybe :) [17:43:49] ori-l: I don't want to think it was you, but the errors started appearing 40 minutes after your midnight deploy [17:43:57] 40 minutes is a long time [17:44:03] but I've got nothing else :/ [17:44:17] feels like they know enough to actually make progress [17:45:36] also, why are the 500s so regular [17:45:42] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (202879) [17:46:02] shush you jobqueue [17:46:10] ^demon|sick: I think you should be ok now [17:47:04] <^demon|sick> akosiaris: Mmk, will re-enable and see. [17:49:48] <^demon|sick> Stupid gerrit :\ [17:50:13] <^demon|sick> !log restarting gerrit since replication plugin won't reload [17:50:27] Logged the message, Master [17:51:30] <^demon|sick> !log gerrit up, forcing replication of all repos [17:51:46] Logged the message, Master [17:51:48] greg-g: so is the site stable for us to deploy: 1) firefox app to bits (shouldn't impact the site), 2) new zero code 3) minor config change [17:52:19] yurik_: can you take a look at the error logs real quick and let me know what the current issue looks like (to you)? [17:53:25] <^demon|sick> akosiaris: Things look good again, thanks. [17:53:42] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 200,000 [17:53:54] greg-g: fatalmonitor seems quiet [17:54:38] ^demon|sick: :-) [17:54:40] yeah :/ [17:54:48] hmm, i wonder what "fault (11)" means [17:54:52] how else to determine what's causing the 500s? [17:54:53] never seen this before [17:55:14] <^demon|sick> job queue check on arsenic? [17:55:17] <^demon|sick> that seems...wrong [17:55:20] * ^demon|sick sighs [17:55:41] * greg-g can't keep straight all of the server names/purposes [17:55:56] these is this cool editing tool called wiki... [17:56:25] how is etherpad admin? [17:56:29] *who [17:57:32] matanya: ops in general, last one to work on it was akosiaris [17:57:38] yurik_: any idea on fault (11)? [17:57:42] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (204400) [17:57:46] matanya: we dont have rules like that, if you need urgent interaction ping the one listed as "on duty", about to run to bus [17:57:54] <^demon|sick> akosiaris: This is what replication of all repositories looks like: http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&h=ytterbium.wikimedia.org&m=cpu_report&s=by+name&mc=2&g=network_report&c=Miscellaneous+eqiad [17:58:07] on duty is always ottomata :) [17:58:10] and, https://wikitech.wikimedia.org/w/index.php?search=arsenic&title=Special%3ASearch gives me nothing :( [17:58:19] (useful) [17:58:31] matanya: it's rotating per etherpad [17:58:32] greg-g: no idea - i would have to look at the exception logs which i'm not even sure where they are [17:58:32] <^demon|sick> greg-g: arsenic is a temp box that nobody should worry about but me and nik. [17:58:33] so any ops that can help a sec would be nice. need to turn on tags for etherpad [17:58:45] ^demon|sick: gotcha [17:58:52] hah [17:58:55] who is on duty!? [17:59:03] Reedy: around? [17:59:05] i can help, what's up? [17:59:08] turn on tags for etherpad [17:59:11] HMMMMM [17:59:15] i don't even know what that means, but ok! [17:59:17] thanks ottomata [17:59:27] 41 Notice: Uncommitted DB writes (transaction from DatabaseBase::query (Block::newLoad)). in /usr/local/apache/common-local/php-1.23wmf4/includes/db/Database.php on line 4 [17:59:29] 052 [17:59:30] cirrus search (arsenic) [17:59:38] https://bugzilla.wikimedia.org/show_bug.cgi?id=30240 [17:59:39] very easy [17:59:41] another new one :) [17:59:53] but not production [17:59:53] ottomata: go to http://etherpad.wikimedia.org:9000/ep/admin/ [18:00:03] ottomata: The password can be found in /h/w/docs/etherpad or /etc/etherpad/etherpad.local.properties [18:00:19] matanya: this is likely not even talking about the same software,, etherpad vs. etherpad-lite [18:00:25] greg-g: so what's the status, can we go ahead and deploy? [18:00:29] i'd look but really need to go.. bbl [18:00:36] thanks mutante [18:00:56] yurik_: I'm not happy with the 500s, and in the future this might block a deploy, but for now, I'll let you go ahead and try to clean up after [18:01:11] :) [18:01:17] ie: in glorious future where we stop ignoring errors and block deploys until they're fixed [18:01:21] \o/ [18:01:33] for now, put on your horse blinders [18:01:34] stability is overrated anyway [18:02:14] what is being deployed? [18:02:28] so subversion is now alerting in watchmouse [18:02:32] matanya: ottomata that :9000 was likely etherpad .. for current etherpad-lite try http://etherpad.wikimedia.org/admin instead [18:02:35] was it firewalled off or taken offline? [18:02:36] off [18:02:42] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 200,000 [18:02:56] greg-g: Ya? [18:03:15] Reedy: see 500s https://gdash.wikimedia.org/dashboards/reqerror/deploys [18:03:27] the 1day graph is clearest [18:04:30] Reedy: yurik_ did some initial investigation, but couldn't find anything clearly at fault [18:04:39] matanya: where can I find htaccess pw? [18:04:43] akosiaris: knows maybe? [18:04:47] Did anyone do anything 8-9am? [18:04:51] fwiw nothing leaps out operations wise when I look at ganglia, or even observium, [18:04:53] ottomata: in a meeting.. [18:04:57] utc, not that I can see in SAL [18:05:12] Ditto [18:05:34] except the squid box reboot, and ori deploying 40 minutes prior, that's all I got [18:05:43] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (204735) [18:06:08] * Reedy tries to recall where the 5XX stats come from [18:06:10] can we shush the jobqueue alert until we actually care about it? [18:06:25] You should be able to get someone from ops to ack it [18:06:27] (03CR) 10Yurik: [C: 032 V: 032] Point Wikpedia app for Firefox OS submodule at gerrit. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96517 (owner: 10Dr0ptp4kt) [18:06:43] Reedy: yeah, but it just keeps coming back :) [18:06:49] Mute it? [18:07:19] !log changed default ferm policy to DROP. antimony/gallium are the first two servers to get it. [18:07:36] Logged the message, Master [18:07:53] greg-g: Simplist way to find out what is going wrong is to inspect that log and see what's coming in [18:07:56] yeah the pmtpa squid box would be irrelevant [18:08:02] Rather than (sometimes VERY accurate) guessing [18:08:20] so that's what I was trying to find and I have n oidea where those are [18:08:27] :( [18:08:28] https://wikitech.wikimedia.org/wiki/Logs [18:08:35] these are all we have written down [18:08:58] didn't find ottomata [18:09:13] yeah [18:09:20] files/graphite/gdash/dashboards/reqerror/3.5xx-sum-1day.graph:title "HTTP 5xx Responses -1day" [18:09:21] files/graphite/gdash/dashboards/reqerror/3.5xx-sum-1day.graph: :data => 'cactiStyle(alias(reqstats.5xx,"5xx resp/min"))' [18:09:34] heh I was just lookin at those [18:09:36] matanya: don't have pw, also, you sure the tagging feature exists for epl? [18:09:38] reqstats.5xx [18:10:14] All the other 5xx mentions in the puppet repo suggest to be upload related logs [18:10:24] and/or commented out [18:10:41] (03PS1) 10Ottomata: $net_topology_script_path needs to be set before hadoop-core is rendered [operations/puppet/cdh4] - 10https://gerrit.wikimedia.org/r/96531 [18:10:43] ottomata: i'm all confused with etherpad/etherpad-lite. so leave it until i find out, thanks. and remove yourself from RT duty! :) [18:10:45] And what's with all the |query.php? [18:10:47] It's long dead... [18:11:14] (03CR) 10Ottomata: [C: 032 V: 032] $net_topology_script_path needs to be set before hadoop-core is rendered [operations/puppet/cdh4] - 10https://gerrit.wikimedia.org/r/96531 (owner: 10Ottomata) [18:11:22] no good getting sidetracked on that [18:12:09] # pipe 1 <%= webrequest_filter_directory %>/5xx-filter | awk -W interactive '$9 !~ "upload.wikimedia.org|query.php"' >> <%= log_directory %>/5xx [18:12:09] .tsv.log [18:12:13] (03PS1) 10Ottomata: Updaing cdh4 to fix net-topology.py script deployment [operations/puppet] - 10https://gerrit.wikimedia.org/r/96532 [18:12:16] those are the not upload ones [18:12:49] (03PS3) 10Umherirrender: enable Echo on all beta.wmflabs.org-wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95450 [18:13:01] What replaced locke? [18:13:11] (03CR) 10Ottomata: [C: 032 V: 032] Updaing cdh4 to fix net-topology.py script deployment [operations/puppet] - 10https://gerrit.wikimedia.org/r/96532 (owner: 10Ottomata) [18:13:20] reedy: iirc erbium [18:13:20] I seem to remember asking this not so long ago [18:13:45] I has no login :( [18:13:52] emery maybe? I'm looking [18:14:00] currently active; [18:14:01] : [18:14:07] emery, erbium, oxygen [18:14:15] that's where sampled is [18:14:17] erbium (well, gadolinium originally) was the replacement for locke [18:14:22] whatcha looking for? [18:14:29] Apache 5xx logs [18:14:52] oxygen [18:15:28] /a/log/webrequest/5xx.tsv.log [18:15:35] oh apache [18:15:40] these are from varnish [18:16:08] but i guess if apache returns 5xx then varnish will too, unless there is some error at varnishside [18:16:23] yes, that's where to look all right [18:16:46] (03PS2) 10Bsitu: Remove html tag from echo email footer address [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96392 [18:17:05] the varnish returns I mean [18:19:34] (03CR) 10Amire80: [C: 031] Remove html tag from echo email footer address [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96392 (owner: 10Bsitu) [18:19:41] all .m. [18:19:46] the 503s I am looking at right now [18:19:55] m.wikipedia, m.wikimedia etc [18:20:07] ^demon|sick: online? [18:20:13] <^demon|sick> Yep. [18:20:41] sorry to bug you when you are sick, for some reason i can't get https://gerrit.wikimedia.org/r/#/c/96517/ onto the tin [18:20:50] we are doing depl right now [18:20:55] ok, here's some others... let's see if that's the rule or the excption [18:22:15] ^demon|sick: i did git pull on /a/common, and that change shows, but doing git submodule update doesn't change git remote url [18:22:44] the m.wiki* are the vast majority [18:22:49] <^demon|sick> yurik_: Git submodule is probably being silly, lemme see if I can fix. [18:22:55] thx [18:23:08] all esams [18:23:11] so that's not good [18:23:14] MaxSem, dr0ptp4kt ^demon|sick is doing git stuff now :) [18:23:38] apergos: ugh [18:24:31] <^demon|sick> url is bogus too, btw. [18:25:08] ^demon|sick: strange, i was able to get that module on my machine via git submodule update [18:25:26] (03PS1) 10Chad: Fix submodule url [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96533 [18:25:36] (03CR) 10Chad: [C: 032 V: 032] Fix submodule url [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96533 (owner: 10Chad) [18:25:45] <^demon|sick> /r/p/, not /r/ [18:26:16] ^demon|sick: are you doing git commands on tin? [18:26:22] <^demon|sick> Yes. [18:26:56] ^demon|sick: could you make sure the submodule of the submodule (js dir) is pulled as well? its a pinned version [18:28:42] <^demon|sick> Why won't you clone dangit? [18:29:25] ? [18:29:49] <^demon|sick> Bah! [18:29:52] <^demon|sick> I know what I did wrong. [18:30:07] ^demon|sick: adam said he is able to get to gerrit without /r/p [18:30:49] <^demon|sick> Fixxxxeddddd [18:30:52] <^demon|sick> :) [18:30:55] <^demon|sick> Silly git. [18:31:08] ^demon|sick: awesome!!!! checking... [18:31:11] thanks [18:31:12] command, or human :) [18:32:23] ottomata: back [18:32:39] ^demon|sick: awesome, thanks you very much, someday you will have to teach us everything you know about it ;) [18:32:48] ottomata: you were asking something about etherpad ? [18:33:29] ja akosiaris, matanya was asking if I could do this [18:33:29] https://bugzilla.wikimedia.org/show_bug.cgi?id=30240 [18:33:36] but it might not be relevant [18:34:24] <^demon|sick> yurik_: rm -R docroot/bits/WikipediaMobileFirefoxOS; rm -R .git/modules; vim .git/config (to remove submodule section); git submodule --init --recursive update [18:34:44] ottomata: we have zero plugins installed at this point and have not enabled installation of plugins [18:34:56] we don't even have the admin panel enabled [18:35:09] nor (authenticated) users or anything like that [18:35:11] brb [18:35:25] (03PS1) 10Reedy: Remove query.php from filters. query.php died a long time ago [operations/puppet] - 10https://gerrit.wikimedia.org/r/96535 [18:36:12] (03PS1) 10Bsitu: Enable Echo and Thanks on dewiki and itwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96536 [18:36:19] akosiaris: comment on bug for matanya? [18:36:55] (03PS2) 10QChris: Backup geowiki's data-private bare repository [operations/puppet] - 10https://gerrit.wikimedia.org/r/95363 [18:36:56] (03PS1) 10QChris: Extract geowiki paramaters into separate class [operations/puppet] - 10https://gerrit.wikimedia.org/r/96538 [18:37:33] ottomata: plus you are absolutely right [18:37:46] that ticket is for etherpad not etherpad-lite [18:38:37] (03PS1) 10Dr0ptp4kt: updating submodule [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96539 [18:38:47] (03CR) 10jenkins-bot: [V: 04-1] updating submodule [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96539 (owner: 10Dr0ptp4kt) [18:40:29] ^demon|sick: we are still having fun issues - https://gerrit.wikimedia.org/r/#/c/96539/1 [18:40:43] is it just jenkins having bad day? [18:41:14] on amslvs1 which I guess has mobile right now, irqs for cpu0 are at 70% but the others are 55-60% which isn't so far off [18:41:16] (03CR) 10QChris: Backup geowiki's data-private bare repository (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/95363 (owner: 10QChris) [18:41:47] <^demon|sick> yurik_: No clue. In a meeting. [18:41:57] (03CR) 10Yurik: [C: 032 V: 032] updating submodule [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96539 (owner: 10Dr0ptp4kt) [18:44:52] (03PS1) 10Yurik: Revert "updating submodule" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96540 [18:45:18] (03CR) 10Yurik: [C: 032 V: 032] Revert "updating submodule" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96540 (owner: 10Yurik) [18:45:49] greg-g: could you remind me what is being deployed during this slot btw? [18:47:25] apergos: right now? zero [18:47:36] yurik_: ^^ can you elucidate more explicitly what's going out [18:47:52] apergos: by zero I don't mean nothing, I mean zero rated :) [18:48:09] greg-g: we are trying to get zero git in order on tin :( [18:48:28] so literally nothing yet [18:48:40] so affecting mobile. (just cause that's where I was seeing the 503's we've been getting today) [18:49:07] (and yeah I got what zero is, I've been following the mails more or less ) [18:49:10] ( :-P ) [18:49:15] :) [18:49:24] totally off topic, I used to work at 'Zeero Knowledge' [18:49:27] s/ee/e/ [18:49:35] yurik_: see apergos' note re mobile is where the errors are appearing [18:49:37] got a lot of people confused by that name [18:49:48] they have been going well before you started [18:49:57] just saying it's a concern for you. [18:50:20] wait, rephrase? [18:50:36] the 503 errors that have been happening today. [18:50:49] mostly mobile for the period of time I have been looking [18:50:57] which was from before yur ik started to deploy [18:51:03] right, gotcha [18:51:20] just saying, it could be a concern for you, that there are already problems with the service [18:51:50] yurik_: you're the closest person who can figure out mobile related issues, I'd like you to look into them before you finally push your code. [18:54:03] because I can't tell if something is really wrong on the directors (lvs) ne option might be to move that traffic to eqiad to see how it is, just mobile... but I won't do that now (and I likely won't do that later because I can't commit to being around for too much later after later, it's getting into the evening here already) [18:54:11] so this is something I would pass onto someone in sf tz [18:55:43] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 200,000 [18:55:52] greg-g: we won't push code changes today it seems, or might do it later, but we have to get the wikipedia mozila OS app out - its not part of the code, it just sits on bits [18:56:06] ok [18:56:30] apergos: so, other than paravoid who's interviewing right now, who else should look into this? [18:56:38] well someone in an sf tz would be good [18:56:41] he's not either [18:56:48] will review it once done, please delay next depl by a bit, having one too many fun issues with git between dr0ptp4kt & MaxSem & myself :) [18:56:51] you get to point a finger at someone and say "you, you investigate" [18:56:55] hahaha [18:57:01] (03PS1) 10MaxSem: Update FF app [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96545 [18:57:37] apergos: just like at an emergency where lots of by-standers around, you can't just say "someone call 911" no one will, you have to say "you, in the red shirt, call 911" [18:57:38] (03CR) 10Yurik: [C: 032 V: 032] Update FF app [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96545 (owner: 10MaxSem) [18:57:45] :-D [18:57:59] ('tis totally true) [18:58:35] I would tag leslie to make sure there is not a network/link issue someplace, and then move mobile traffic to eqiad or pass it to someone else to do further investigation [18:58:43] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (224931) [18:58:47] I think leslie is afk? [18:59:02] maybe in that same interview [18:59:38] so, next on the list? [18:59:40] :) [19:00:12] uhh [19:01:41] ottomata: you're apparently on rt duty, who with network skillz can help diagnose a problem now? [19:01:55] greg-g: tell me [19:02:13] apergos: ^^ [19:02:32] 503s, mostly mobile, all the mobile are esams, [19:02:49] not seeing obvious issues on amslvs1 but maybe I would not recognize it, [19:02:50] syncing bits [19:03:13] brb [19:03:45] can a root nuke /a/common/docroot/bits/WikipediaMobileFirefoxOS.bak2 on tin, please? [19:03:45] so someone (you?) with network chops to see if that's the issue again and/or maybe move traffic to eqiad but [19:03:48] (03PS1) 10Ottomata: Adding ganglia aggregators for elasticsearch cluster [operations/puppet] - 10https://gerrit.wikimedia.org/r/96550 [19:03:49] be able to keep an eye out [19:03:56] your day iss also into the vening though [19:04:06] so keeping an eye out = not so much for you either [19:04:37] also, ^demon|sick, seems like when you check things out or set links, i can't remove them afterwards :( MaxSem said something about bitmask ;) [19:04:47] umask [19:05:04] <^demon|sick> My umask is wrong? [19:05:06] * ^demon|sick sighs [19:05:22] (03PS2) 10Ottomata: Adding ganglia aggregators for elasticsearch cluster [operations/puppet] - 10https://gerrit.wikimedia.org/r/96550 [19:05:48] but if you would be willing to check 'is this a link/interface saturation issue' that would be helpful [19:05:50] akosiaris: [19:06:08] <^demon|sick> MaxSem: 0002 [19:06:35] ^demon|sick, yet we can't delete your files [19:06:41] apergos: yes I am trying to figure out if there is a problem... not finding anything yet [19:06:55] ok. thank you for looking. [19:07:39] <^demon|sick> MaxSem: I don't set anything funky in my .bashrc or elsewhere. [19:07:51] weird [19:08:01] * MaxSem blames Linus [19:08:03] (03PS1) 10Addshore: Start wikidata puppet module for builder [operations/puppet] - 10https://gerrit.wikimedia.org/r/96552 [19:08:05] !log yurik synchronized docroot/bits/WikipediaMobileFirefoxOS [19:08:18] Logged the message, Master [19:08:28] all looks well... there was a spike of CPU usage in esams some 10 minutes ago but not anymore [19:08:37] yeah, this has been going for hours [19:08:51] so the problem persists ? [19:08:52] hmmm [19:09:15] all right, next step will be to find someone who is willing to move traffic to eqiad *and babysit * [19:09:19] https://gdash.wikimedia.org/dashboards/reqerror/ yes, hours [19:10:41] three cheers for MaxSem, ^demon|sick, yurik_, and brion for all of the help on getting the firefox os wikipedia app working. it's out there now! [19:11:20] (with a handful of fixes, it was already quite the project with many other committers) [19:11:40] greg-g: we deployed the firefox OS thingy, so either we postpone the next depl, figure out 503s, push out zero stuff, and yield, or how do you want to procede? [19:11:42] <^demon|sick> I'm just glad you guys aren't cloning from git://github.com anymore ;-) [19:11:58] ^demon|sick: its live! :) [19:12:04] ^demon|sick, are you sure? git submodule updates were probably easier before ;) [19:12:04] thank you for your help!!!! [19:12:14] ^demon|sick, i kid. [19:12:20] i'm glad it's in house now [19:12:22] should be cleaner [19:12:23] <^demon|sick> dr0ptp4kt: Well it wouldn't work. tin can't hit github (on purpose) [19:12:43] thank goodness! exfiltration == bad [19:12:44] <^demon|sick> Also, git protocol has no authentication whatsoever :D [19:12:59] <^demon|sick> So who knows what files you're getting sent! [19:13:08] (03PS1) 10Jgreen: add host rhodium for ocg pipeline testing [operations/puppet] - 10https://gerrit.wikimedia.org/r/96553 [19:13:11] I would like to have the mobile 503s addressed [19:13:28] even if the 'addressed' is just 'have eqiad serve it for now' [19:13:48] (if that were to make it happier) [19:13:52] apergos, any stats what URLs are failing? [19:14:15] let me look at that [19:15:19] (03PS1) 10RobH: rhodium assigned internal ip [operations/dns] - 10https://gerrit.wikimedia.org/r/96555 [19:15:31] mostly GETS [19:15:40] greg-g: btw, our zero is ready to go, should not take long, are you sure you want to stall it because of 503s? [19:15:51] yurik_: well, I don't want to complicate things [19:16:00] something is wrong with the mobile domains [19:16:05] and zero implicates mobile, so [19:16:09] traffic at esams is at its normal level [19:16:19] nothing weird there [19:16:19] half Special:something and half not [19:17:01] so, please work with akosiaris and apergos on figuring out what's causing the mobile 503s before proceeding, they shouldn't be there [19:17:05] yurik_: ^& [19:17:06] -& [19:17:16] (03CR) 10RobH: [C: 032] rhodium assigned internal ip [operations/dns] - 10https://gerrit.wikimedia.org/r/96555 (owner: 10RobH) [19:17:19] I have to run, be back in about 30 [19:17:23] this started at 8:00 UTC... [19:17:29] right [19:17:29] yeah [19:17:31] packet loss once again? [19:17:36] niah [19:17:43] marktraceur: do you need your held window today? [19:17:46] all esams though [19:18:08] we are sure about that ? [19:18:13] Ahhh no [19:18:20] marktraceur: cool, thanks. [19:18:45] as long as I've been greeping it's amlost completely esams yes [19:18:50] grepping too :-P [19:18:51] alright, thanks akosiaris for looking into it, I have to go. [19:19:03] Jeff_Green: rhodium is all set for you to use, in dns [19:19:21] ok [19:20:30] is this a normal graph? http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=Bits+caches+esams&m=cpu_report&s=by+name&mc=2&g=network_report [19:20:40] last 100k lines in 5xx log: 88k 503's, 77k are mobile, all but about 430 of those are esams [19:20:43] so yeah [19:21:00] 430 ? [19:21:13] (03CR) 10Ottomata: [C: 032 V: 032] Adding ganglia aggregators for elasticsearch cluster [operations/puppet] - 10https://gerrit.wikimedia.org/r/96550 (owner: 10Ottomata) [19:21:15] yeah. 430 lines. out of 100k [19:21:19] are not esams [19:21:23] ok [19:22:20] cp3012, cp3011, and if we care mostly cp3012 (64k to 12k) [19:22:30] (03PS1) 10Ottomata: Setting $ganglia_aggregator to true for elastic100[17] [operations/puppet] - 10https://gerrit.wikimedia.org/r/96557 [19:22:44] looking at the bits graph now, MaxSem: [19:22:55] (03CR) 10Ottomata: [C: 032 V: 032] Setting $ganglia_aggregator to true for elastic100[17] [operations/puppet] - 10https://gerrit.wikimedia.org/r/96557 (owner: 10Ottomata) [19:22:57] (03PS2) 10Jgreen: add host rhodium for ocg pipeline testing [operations/puppet] - 10https://gerrit.wikimedia.org/r/96553 [19:23:30] is the issue with mysql on terbium known already? [19:24:17] (03PS3) 10Jgreen: add host rhodium for ocg pipeline testing [operations/puppet] - 10https://gerrit.wikimedia.org/r/96553 [19:24:18] don't know, Jamesofur, but can you hold that thought for just a little bit? [19:24:20] [issue == won't work, Could not open input file: /a/common/multiversion/MWScript.php (followed by mysql help text). /a/common looks empty for some reason ... [19:24:24] yup no rush [19:24:44] apergos: we could switch mobile to eqiad [19:24:45] ok, please ping again after we have the 503 issue handled. [19:24:58] yes, that's the only thought I have, but I woudl want someone to babysit it after [19:24:59] but this can not be network [19:25:02] will do no worries [19:25:08] (03PS4) 10Jgreen: add host rhodium for ocg pipeline testing [operations/puppet] - 10https://gerrit.wikimedia.org/r/96553 [19:25:10] only mobile suffers [19:25:11] and I'm at hour 14.5 already so not willing to volunteer [19:25:30] if it was a network issue everything would have problems [19:25:33] not just mobile [19:25:38] greg-g: oki [19:25:57] yes, that sounds right [19:26:18] otoh why only esams mobile? [19:27:26] looking at the non mobile entries just to see if they shed any light on it [19:28:04] they are almost all posts [19:28:05] (03PS5) 10Jgreen: add host rhodium for ocg pipeline testing [operations/puppet] - 10https://gerrit.wikimedia.org/r/96553 [19:28:16] eqiad and esams so [19:28:58] apergos, akosiaris, so do we know which pages are causing 503s? [19:29:01] ahh gerrit. i love gerrit! [19:29:12] * yurik_ chokes gerrit [19:29:23] * Jeff_Green watches and cheers on [19:29:30] hey [19:29:33] back from the interview [19:29:35] what's up? [19:29:36] we had about a split between specials and regular [19:29:45] uhoh [19:29:53] oh, beating my head against 503s [19:30:01] just mobile, not regular site ? [19:30:14] well the vast majority are mobile and of those almost all esams [19:30:30] bits esams looks unhealthy [19:31:04] paravoid: what specifically ? [19:31:14] the graph is stuttery [19:31:25] 1000 . . n_wrk - N worker threads [19:31:29] exhausted of threads [19:31:49] those that are not mobile are almost all post (edit, submit) and a mix of esams/eqiad [19:32:11] hm, maybe not [19:33:16] I did not see, except that the irq is still somewhat though not tragically unbalanced for cpu0, anything on amslvs1 [19:33:32] (and besides why so lopsidedly mobile?) [19:33:59] why do you say that mobile esams has trouble? [19:34:00] (03CR) 10Jgreen: [C: 032 V: 031] add host rhodium for ocg pipeline testing [operations/puppet] - 10https://gerrit.wikimedia.org/r/96553 (owner: 10Jgreen) [19:34:23] the 503s are mostly mobile esams [19:34:30] where do you see that? [19:34:33] (back, cancelled that other appt) [19:34:42] 5xx log on oxygen [19:35:29] note that that log filters out upload.wm.org and query.php requests [19:35:37] (03CR) 10QChris: "As discussed some time ago in a hangout, Analytics will have to" [operations/puppet] - 10https://gerrit.wikimedia.org/r/93006 (owner: 10Yurik) [19:35:38] so if that is your only source, you might not be seeing some data [19:35:50] ok, that is good to know [19:36:15] found it [19:36:22] that's the input to gdash reqerror right? [19:36:28] no, found the 50x spike [19:36:32] well, one of them anyway [19:36:37] let's have it [19:37:41] ottomata: ? [19:37:56] hm [19:37:57] dunno [19:38:04] :-D [19:38:17] not via udp2log [19:38:24] at least, not what i'm looking at on oxygen [19:38:29] but something like that does sound familiar [19:38:33] * apergos goes to look atthe manifests again [19:38:45] yes [19:38:46] udp2log [19:38:48] on emery [19:38:52] ## This feeds all http related graphs in graphite / gdash.wikimedia.org [19:38:52] pipe 2 /usr/local/bin/sqstat 2 [19:38:56] sqstat [19:39:00] that isn't filtered though [19:39:03] just sampled 1/2 [19:39:27] ottomata: i thought query.php was dead ages ago [19:39:29] ok, so it's possible we don't have all the data [19:39:49] (03PS1) 10Faidon Liambotis: Varnish: filter-noise on mobile-frontend as well [operations/puppet] - 10https://gerrit.wikimedia.org/r/96567 [19:39:56] apergos: for gdash, it shoudl be complete i think [19:39:57] that's one [19:40:12] (03CR) 10Faidon Liambotis: [C: 032] Varnish: filter-noise on mobile-frontend as well [operations/puppet] - 10https://gerrit.wikimedia.org/r/96567 (owner: 10Faidon Liambotis) [19:40:44] paravoid: is that 403 Noise or 503 noise? [19:40:49] 403 noise [19:40:51] k [19:40:54] it's a joomla exploit [19:41:02] that people are running against us [19:41:06] and mediawiki 503s for some reason [19:41:19] (03CR) 10Faidon Liambotis: [V: 032] Varnish: filter-noise on mobile-frontend as well [operations/puppet] - 10https://gerrit.wikimedia.org/r/96567 (owner: 10Faidon Liambotis) [19:41:27] oh, interesting [19:41:40] (03CR) 10Yurik: "Its only for analytics, so we decide not to go this route, we should rethink our strategy." [operations/puppet] - 10https://gerrit.wikimedia.org/r/93006 (owner: 10Yurik) [19:41:44] hence noise, hence 403, forbidden [19:42:28] PROBLEM - Host msfe1002 is DOWN: PING CRITICAL - Packet loss = 100% [19:42:37] msfe1002? still?? seriously? [19:43:00] paravoid: any idea on the proportion of 503s that was causing? ie: still more investigation needed? [19:43:10] not yet [19:43:40] k [19:46:08] RECOVERY - Host msfe1002 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [19:46:31] RobH: hey, why do we have msfe1002 still running? [19:47:03] RobH: it hasn't been used since before Ben left and I've pinged you about it at least 2-3 times [19:47:14] i dont see this in racktables [19:47:20] lemme see if there is ticket [19:47:30] last time you told me it was in your notebook and no ticket was needed [19:47:45] ok, last time i lied [19:47:49] lol [19:47:51] kill it anyway :) [19:47:52] becuase notebook has been thrown away [19:47:55] haha [19:47:57] well, im trying to see what it is [19:48:02] cuz i dunno what server its on [19:48:19] Something happened to fenari or noc.wikimedia.org for that matter? [19:48:29] paravoid: bah, still getting the spike at the same level according to gdash [19:48:34] I know [19:48:35] looking [19:48:39] sorry [19:48:42] just saying [19:49:03] I was hopeful as it was in the middle of the trough [19:49:03] no sorry needed, thanks for helping out [19:49:04] paravoid: Do you know what its ip is? [19:49:08] cuz i dont see it in dns [19:49:15] and i dont know what sysetem it is [19:49:20] and its not in racktables by that name anymore [19:49:42] RobH: was 208.80.154.148 [19:49:43] RobH: sorry, no. check neon's /etc/icinga [19:49:54] mgmt 10.65.3.57 [19:50:08] (git log in the ops dns repo ftw) [19:50:26] since I'm not being otherwise productive at this point [19:50:37] apergos: thx [19:50:41] that'll do [19:50:41] ure [19:50:43] *sure [19:51:13] paravoid: its rhodium now [19:51:18] so dunno whats showing up where [19:51:27] but its an entirely different server these days [19:51:31] RobH: not cleaned up from puppet db [19:51:34] lemme check a report [19:52:38] indeed, storedconfigs [19:52:42] uhh [19:52:46] it's got nothign else but it does have that [19:52:47] i try to clean from stored and says not there. [19:52:51] so dunno. [19:53:09] try all the possibilities (wikimedia.org, eqiad.wmnet, etc) [19:53:36] oh yes, was external [19:53:37] now clean [19:53:40] so should be fine now [19:54:02] also ensuring key and salt are clear [19:54:16] they are, did that check [20:06:48] hmm, the 5xx graph has just set a record high [20:07:14] (I'm pretty clocked out at this point, folks... just fyi) [20:07:43] MaxSem: yeah, that was a combination of the already existing issue, plus a new "one" [20:08:53] ergh [20:08:58] WTF is going on? [20:11:17] looks like not our fault [20:12:36] (03PS1) 10Jgreen: add dhcp and partman for rhodium [operations/puppet] - 10https://gerrit.wikimedia.org/r/96575 [20:12:59] (03PS1) 10Faidon Liambotis: Varnish: expand filter_noise URLs [operations/puppet] - 10https://gerrit.wikimedia.org/r/96576 [20:13:34] (03CR) 10Faidon Liambotis: [C: 032] Varnish: expand filter_noise URLs [operations/puppet] - 10https://gerrit.wikimedia.org/r/96576 (owner: 10Faidon Liambotis) [20:13:52] (03CR) 10Faidon Liambotis: [V: 032] Varnish: expand filter_noise URLs [operations/puppet] - 10https://gerrit.wikimedia.org/r/96576 (owner: 10Faidon Liambotis) [20:16:03] (03CR) 10Jgreen: [C: 032 V: 031] add dhcp and partman for rhodium [operations/puppet] - 10https://gerrit.wikimedia.org/r/96575 (owner: 10Jgreen) [20:20:13] (03PS1) 10Faidon Liambotis: Varnish: add unset Range hack for mobile too [operations/puppet] - 10https://gerrit.wikimedia.org/r/96579 [20:20:48] (03CR) 10Faidon Liambotis: [C: 032 V: 032] Varnish: add unset Range hack for mobile too [operations/puppet] - 10https://gerrit.wikimedia.org/r/96579 (owner: 10Faidon Liambotis) [20:33:46] (03PS1) 10Jgreen: ah rhodium is dual HDD, switching to raid1-1partition partman recipe [operations/puppet] - 10https://gerrit.wikimedia.org/r/96582 [20:36:32] (03CR) 10Jgreen: [C: 032 V: 031] ah rhodium is dual HDD, switching to raid1-1partition partman recipe [operations/puppet] - 10https://gerrit.wikimedia.org/r/96582 (owner: 10Jgreen) [21:11:50] Ryan_Lane: for the new PDF renderer thing I'm building -- it exists in a repo; it has a upstart conf; and it has a configuration file in /etc that can be puppetized -- what Jeff_Green and I are not sure on is how to get the stuff from git onto a newly puppetized machine. I was hoping you had a couple minutes to talk to me about salt [21:12:40] ^^^ and packaging [21:12:53] !log deployed Parsoid 20c6afe [21:13:09] Logged the message, Master [21:14:34] well, we can deploy the code itself using trebuchet [21:14:45] puppet can install the upstart and dependencies [21:15:09] you don't need to understand salt to use trebuchet ;) [21:15:25] https://wikitech.wikimedia.org/wiki/Trebuchet#Adding_a_new_repo [21:15:49] mwalker: ^^ [21:15:53] Jeff_Green: ^^ [21:15:59] reading [21:16:49] also look at how parsoid is configured in puppet: manifests/role/deployment.pp [21:17:46] and note that parsoid has a really shitty init script and that instead you can just use service.restart in the checkout_module_calls config [21:29:16] Ryan_Lane: thanks. we'll study this and probably come back to you with more questions :-) [21:29:24] ok, cool [21:34:16] !log ran checksetup.pl on bugzilla, deploy gerrit 96479, replaces internal errors on WeeklyReport with proper error pages [21:34:29] andre__: done. example https://bugzilla.wikimedia.org/component-report.cgi?tops=15&days=1 [21:34:30] Logged the message, Master [21:34:34] now looks like the labs one [21:34:44] as opposed to really ugly internal error [21:35:31] no wait, not really the same..hmm [21:42:42] (03PS1) 10Faidon Liambotis: Revert "Enable CentralNotice CrossWiki Hiding" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96641 [21:43:45] Coren: have a spare minute? [21:43:52] (03PS2) 10Faidon Liambotis: Revert "Enable CentralNotice CrossWiki Hiding" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96641 [21:44:04] I have minutes. They're not spare, but they can be given to help. :-) [21:44:09] (03CR) 10Faidon Liambotis: [C: 032] Revert "Enable CentralNotice CrossWiki Hiding" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96641 (owner: 10Faidon Liambotis) [21:46:14] (03CR) 10Faidon Liambotis: [V: 032] Revert "Enable CentralNotice CrossWiki Hiding" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96641 (owner: 10Faidon Liambotis) [21:46:52] !log faidon updated /a/common to {{Gerrit|I866838bc2}}: Revert "Enable CentralNotice CrossWiki Hiding" [21:46:57] (03PS1) 10Jgreen: add $gid=500 to rhodium [operations/puppet] - 10https://gerrit.wikimedia.org/r/96642 [21:47:09] Logged the message, Master [21:47:42] !log faidon synchronized wmf-config/CommonSettings.php 'revert CentralNotice CrossWiki Hiding' [21:47:57] Logged the message, Master [21:51:07] * AaronSchulz rarrrs at bug https://bugzilla.wikimedia.org/show_bug.cgi?id=57282 [21:51:15] those files keep coming back, same file [21:51:48] paravoid: I wonder if mw1208 has some segfault entries in the syslogs of interest? [21:52:05] sec, dealing with mobile outage right now [21:55:32] (03PS1) 10Se4598: correct default Echo help page [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96647 [21:56:20] ori-l: what was the name of that file? [21:56:27] heh [22:01:38] (03Abandoned) 10Jgreen: add $gid=500 to rhodium [operations/puppet] - 10https://gerrit.wikimedia.org/r/96642 (owner: 10Jgreen) [22:01:39] TimStarling: hello [22:01:49] hello [22:01:52] paravoid: comfortable with Echo deploying now? [22:01:54] RFC review meeting starting now [22:02:10] in #wikimedia-meetbot [22:02:18] * Elitre waiting for Echo on it.wp. [22:02:37] (03PS1) 10Ottomata: Setting up JournalNode on analytics1014 [operations/puppet] - 10https://gerrit.wikimedia.org/r/96649 [22:02:54] Elitre: yeah, hoping to get you it, just don't want to confuse things for para-void. [22:02:58] (03CR) 10Ottomata: [C: 032 V: 032] Setting up JournalNode on analytics1014 [operations/puppet] - 10https://gerrit.wikimedia.org/r/96649 (owner: 10Ottomata) [22:03:05] I don't forsee any issue, but I wanted to make sure [22:03:17] the site is not better yet, no [22:03:26] ugh [22:03:32] not to you, to the issue [22:03:39] ? [22:03:47] bsitu: Elitre what para-void is referring to: https://gdash.wikimedia.org/dashboards/reqerror/ [22:03:58] that bumpity purple line shouldn't be doing that [22:04:26] bumpity is the technical ops term for errors that express themselves like that [22:04:44] greg-g: thanks, I will wait till when it's ready [22:05:06] bsitu: luckily your window is long, but this issue has been going all morning (sf time) :/ [22:05:11] * greg-g crosses fingers [22:05:41] (03PS1) 10Jgreen: add $gid=500 to rhodium (again) [operations/puppet] - 10https://gerrit.wikimedia.org/r/96650 [22:06:01] greg-g: it's isolated to mobile, though; if your deployment isn't affecting mobile, you can proceed [22:06:29] paravoid: ok, bsitu's shouldn't affect mobile (echo isn't on mobile, right bsitu ?) [22:06:41] GARG GERRIT [22:06:57] greg-g: I think echo is on mobile web [22:06:58] is the jenkins/review stuff disabled or something? [22:07:13] greg-g: but I am only doing configuration change [22:07:17] Echo works on mobile as well, AFAIK. [22:07:19] no echo code deploy [22:07:27] (03CR) 10Jgreen: [C: 032 V: 032] add $gid=500 to rhodium (again) [operations/puppet] - 10https://gerrit.wikimedia.org/r/96650 (owner: 10Jgreen) [22:07:39] bsitu: right, just enabling on more wikis, german and italian [22:08:18] bsitu: go forth, but please stick around a little bit (of course) in case we need a revert to diagnose any more issues [22:08:30] greg-g: okay [22:11:14] (03CR) 10Bsitu: [C: 032] correct default Echo help page [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96647 (owner: 10Se4598) [22:11:50] (03CR) 10Bsitu: [C: 032] Remove html tag from echo email footer address [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96392 (owner: 10Bsitu) [22:13:38] (03CR) 10Bsitu: [C: 032] Enable Echo and Thanks on dewiki and itwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96536 (owner: 10Bsitu) [22:14:05] (03PS1) 10Jgreen: add groups::wikidev to rhodium [operations/puppet] - 10https://gerrit.wikimedia.org/r/96653 [22:14:30] (03CR) 10Jgreen: [C: 032 V: 032] add groups::wikidev to rhodium [operations/puppet] - 10https://gerrit.wikimedia.org/r/96653 (owner: 10Jgreen) [22:16:29] (03CR) 10Bsitu: [V: 032] correct default Echo help page [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96647 (owner: 10Se4598) [22:17:08] (03CR) 10Bsitu: [V: 032] Remove html tag from echo email footer address [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96392 (owner: 10Bsitu) [22:17:29] (03CR) 10Bsitu: [V: 032] Enable Echo and Thanks on dewiki and itwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96536 (owner: 10Bsitu) [22:19:56] !log bsitu updated /a/common to {{Gerrit|Id9e3b9a03}}: correct default Echo help page [22:20:11] Logged the message, Master [22:20:24] (03PS1) 10Yurik: Mobile m. and zero. landing page redirect handling by ZERO [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96654 [22:21:23] paravoid: ^ [22:22:08] greg-g: what's the server status? can we do a quicky? [22:22:26] no. [22:22:53] paravoid ? servers are down? we want to deploy other zero extension stuff [22:23:07] yurik_: why? [22:23:11] !log bsitu synchronized wmf-config/InitialiseSettings.php 'Update email footer and help page' [22:23:19] because we couldn't due to 503s this morning :) [22:23:19] yurik_: and no, not right now, see ^^ benny is deploying something [22:23:24] Logged the message, Master [22:23:37] !log bsitu synchronized echowikis.dblist 'Enable Echo and Thanks on dewiki and itwiki' [22:23:40] greg-g: no rush - just wondering if we can do it sometime today [22:23:44] hopefully [22:23:45] yurik_: mobile deploys frozen until further notice [22:23:53] Logged the message, Master [22:24:01] paravoid: ok [22:28:05] !log bsitu synchronized echowikis.dblist 'Enable Echo and Thanks on dewiki and itwiki' [22:28:17] Logged the message, Master [22:29:07] PROBLEM - SSH on amslvs1 is CRITICAL: Server answer: [22:29:47] !log bsitu synchronized wmf-config/InitialiseSettings.php 'touch' [22:29:51] (03PS1) 10Faidon Liambotis: Switch mobile-lb to eqiad [operations/dns] - 10https://gerrit.wikimedia.org/r/96655 [22:29:57] Ryan_Lane: copied your comments re trebuchet into https://www.mediawiki.org/wiki/Parsoid/Packaging [22:30:03] Logged the message, Master [22:31:07] RECOVERY - SSH on amslvs1 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [22:31:29] heh. even the part about the really shitty init script "D [22:31:31] (03CR) 10Faidon Liambotis: [C: 032] Switch mobile-lb to eqiad [operations/dns] - 10https://gerrit.wikimedia.org/r/96655 (owner: 10Faidon Liambotis) [22:31:38] wtf is up with jenkins? [22:32:01] (03CR) 10Faidon Liambotis: [V: 032] Switch mobile-lb to eqiad [operations/dns] - 10https://gerrit.wikimedia.org/r/96655 (owner: 10Faidon Liambotis) [22:55:19] yurik_: you can now go ahead provided greg-g gives you the go-ahead [22:55:44] awesome!!! [22:55:46] thanks [22:56:56] (03PS2) 10Yurik: Mobile m. and zero. landing page redirect handling by ZERO [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96654 [22:56:59] bblack: here? [22:57:00] yurik_: one second [22:57:21] cause of the outage: zero! [22:57:23] oh, benny is gone [22:57:29] paravoid: yeah [22:57:33] Nov 20 08:25:01 cp3012 CRON[8188]: (netmap) CMD (/usr/share/varnish/netmapper_update.sh "zero.json" "http://meta.wikimedia.org/w/api.php?action=zeroconfig&type=ips") [22:57:37] Nov 20 08:25:11 cp3012 frontend[18335]: Child (8339) said varnishd: tpp.c:63: __pthread_tpp_change_priority: Assertion `new_prio == -1 || (new_prio >= __sched_fifo_min_prio && new_prio <= __sched_fifo_max_prio)' failed. [22:57:41] every ten minutes, like clockwork [22:57:41] oh no, not again :) [22:57:54] looks like an underlying glibc/nptl bug [22:57:58] yurik_: just checking in with benny [22:58:17] http://sourceware.org/bugzilla/show_bug.cgi?id=3610 specifically [22:58:38] caused this: https://graphite.wikimedia.org/render/?title=HTTP%205xx%20Responses%20-1day&from=-1%20day&width=1024&height=500&until=now&areaMode=none&hideLegend=false&lineWidth=2&lineMode=staircase&target=color(cactiStyle(alias(reqstats.5xx,%225xx%20resp/min%22)),%22blue%22)&target=color(cactiStyle(alias(reqstats.500,%22500%20resp/min%22)),%22red%22) [22:59:13] took me a while to figure out why stats were resetting every ten minutes, duh [22:59:28] hmmm [22:59:34] bblack: happens on both mobile esams boxes [22:59:43] but not elsewhere? [22:59:46] strangely enough, not in eqiad [22:59:52] they did have a higher concurrency though [23:00:25] started at exactly 08:25:01 UTC [23:00:58] for cp3012 [23:00:59] and [23:01:04] Nov 20 18:06:44 cp3011 frontend[24289]: Child (26056) died signal=6 [23:01:07] for cp3011 [23:02:22] yurik_: so, paravoid gave you the go ahead, even though he said the outage was zero's fault, but, nothing else is blocking you right now, so I guess I go with mixed signals and let you go ahead :) [23:02:43] greg-g: I was half-joking, it's not exactly zero's fault :) [23:02:48] haha, ok [23:02:49] hehe :) [23:02:57] (03PS1) 10Jgreen: add a role::ocg::test class for testing config [operations/puppet] - 10https://gerrit.wikimedia.org/r/96658 [23:02:59] it will be in 15 min ;) [23:03:00] well that is an interesting end to the 503 saga, I now feel much better about having not found that right away :-P :-D [23:03:04] yurik_ was just complaining to me yesterday about not being responsive enough [23:03:11] apergos: you may now sleep ;) [23:03:16] hahaha [23:03:21] actually... now eating dinner :-D [23:03:28] and the irony of me dealing with this for hours... :) [23:03:32] paravoid: you are on top of it!!! (today :-P ) [23:03:44] yurik_: dude, 14 outages in 14 days [23:03:50] we are on a streak [23:03:53] yei! [23:04:00] lets not break it now, shall we? [23:04:16] we've already had the one for today, no more needed [23:04:20] let's not get greedy or anything [23:04:22] greg-g: remind me, what broke yesterday? I remember being away and the site being broken while someone was deploying something [23:04:33] csteipp: oauth is tomorrow right? [23:04:34] so we've had it be: the links, the caches, the lvs box... what's next [23:04:35] ugh, search was monday.... [23:04:35] (03CR) 10Jgreen: [C: 032 V: 031] add a role::ocg::test class for testing config [operations/puppet] - 10https://gerrit.wikimedia.org/r/96658 (owner: 10Jgreen) [23:04:39] no not that [23:04:56] call [23:04:59] paravoid: my rear on that 3610 bug is it's not a glibc/nptl bug, it was a programmer bug at the app layer [23:05:02] s/rear/read/ [23:05:03] (03PS1) 10Ori.livneh: Make rsyslog forward Apache error log to fluorine [operations/puppet] - 10https://gerrit.wikimedia.org/r/96659 [23:05:18] greg-g: vectorbeta? [23:05:38] we can go 15/15 if we play our cards right [23:05:52] bblack: well, glibc should never assert under your feet; but yes, some app layer code is triggering it [23:05:59] bblack: egrep '(died|netmapper)' syslog is interesting [23:06:10] * apergos glares at AaronSchulz [23:06:35] we should enable it but forget to add the tables to meta [23:06:43] eh, there are lots of ways you can misuse glibc interface to make it assert or crash. it's not like libc interfaces triple-check for everything and try to be graceful [23:06:44] so every prefs view everywhere would be a DB error [23:06:46] boom, 15/15 [23:07:01] the app layer code, in that case, is using a mutex or mutexattr without first initializing it [23:07:06] AaronSchulz: Yep [23:07:18] uninitialized vars foreva! [23:07:20] csteipp: do you have the conf as a patch somewhere? [23:07:34] (03CR) 10Ori.livneh: [C: 032] Make rsyslog forward Apache error log to fluorine [operations/puppet] - 10https://gerrit.wikimedia.org/r/96659 (owner: 10Ori.livneh) [23:08:07] bblack: so zero.json is updated every 5 minutes, varnish is crashing every 10 [23:08:18] sometimes even a minute after [23:08:22] I wonder if it's actually related or not [23:08:29] try disabling the cron? [23:09:01] well, it happened during high concurrency and now I've moved the requests elsewhere :) [23:09:12] I didn't even get the chance to get a gdb trace [23:09:21] anyone want to review my redirects.conf rewrite apart from the people who are already on the reviewer list? [23:09:35] TimStarling: I added myself on the list, so I will at some point [23:09:37] there's really nothing pthread-y going on with those updates. the thread that watches for them is already spawned, and it just sits in a sleep loop checking mtime periodically. if the file changes (which wouldn't be often), even then it's just rcu sync stuff, no pthread calls [23:09:44] TimStarling: awesome work btw :) [23:09:48] was that the one with it's own DSL for making apache conf? [23:09:51] thanks [23:09:57] AaronSchulz: yes [23:10:16] although calling it a DSL is overdoing it a bit [23:10:18] I wasn't sure about a /* comment in there [23:10:23] it's just a configuration file really [23:10:24] yeah, I was just trolling [23:10:53] yeah, I should at least add a mathematical expression parser if I'm going to call it a DSL, right? [23:11:15] paravoid: but there's some non-zero probability I don't understand something about when or how vmod_init() is called and then the per_vcl_fini() hook as well, which could be interfering with varnishd's pthreads stuff somehow, since that does use a mutex (an initialized one, though!) [23:11:47] funnel» sep11.wikipedia.org» http://wayback.archive.org/web/20030315000000*/http://sep11.wikipedia.org/wiki/In_Memoriam [23:12:17] $line = preg_replace( '/#.*$/', '', $line ); [23:13:08] you like my comment tokenizer? [23:13:22] I guess comments have to be #, I thought I saw a // one but that was just some protocol relative url [23:13:44] but I still don't get that sep11 entry [23:13:53] I was planning on doing a PHP script with CDB, but I cut back the project in order to get it finished in under two days [23:14:27] what about it? [23:14:36] the * is a literal *, it was in the original [23:15:13] does that URL even work? [23:15:20] (03PS1) 10Ori.livneh: Include role::applicationserver in role::applicationserver::webserver [operations/puppet] - 10https://gerrit.wikimedia.org/r/96660 [23:15:22] It's funky URL syntax for a half open search at archive.org [23:15:26] ah, I see [23:15:28] just a sucky UI [23:15:32] yes [23:15:52] you know why it is like that, right? [23:15:56] it is a bit sad [23:16:20] (03CR) 10Ori.livneh: [C: 032] Include role::applicationserver in role::applicationserver::webserver [operations/puppet] - 10https://gerrit.wikimedia.org/r/96660 (owner: 10Ori.livneh) [23:18:17] you know, we still have the sep11wiki database [23:18:18] TimStarling: maybe you can look at https://bugzilla.wikimedia.org/show_bug.cgi?id=57282 [23:18:19] bblack: and you're not at fault [23:18:28] bblack: the cronjob doesn't do anything if the md5sum is the same :) [23:18:36] it might be PHP segfaulting cause __desctruct() to never happen [23:18:57] paravoid: oh, true, I guess that means nothing happens for the vmod at all [23:19:00] * AaronSchulz can't see any of the logs that would mention such crashes [23:19:02] yes [23:19:13] I don't think it is right to have a link to archive.org [23:19:36] every ten minutes [23:19:39] it's also odd that the same file tif file keeps piling up, maybe something crashes processing it [23:20:23] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 200,000 [23:20:27] * AaronSchulz wonders why that wiki was ever created, being so wildly inconsistent with everything else [23:20:43] maybe it is that exif integer overflow [23:20:50] we haven't patched that, have we? [23:20:57] TimStarling: no we haven't [23:20:59] do you read php internals? [23:21:02] also of note is that that server is not even a scaler [23:21:11] it might be maybeUpgrade() endlessly firing [23:21:16] I pinged php internals about that bug, said that maybe someone should look at it [23:21:28] one reply: Rasmus saying "I think you just volunteered" [23:21:33] TimStarling: I was wondering if it's exploitable... [23:21:42] but I think you told me it's not? [23:21:45] (03PS1) 10Ori.livneh: Include role::applicationserver in role::applicationserver::configuration::php [operations/puppet] - 10https://gerrit.wikimedia.org/r/96662 [23:22:03] well, you could read bit of the process address space, but it's a fairly expensive way to do it [23:22:16] since you would have to upload a 100MB file for each offset [23:22:59] (One site is rumoured to have had varnish restarting every 10 minutes and *still* provide better service than their CMS system.) [23:23:04] https://commons.wikimedia.org/wiki/File:Zentralbibliothek_Z%C3%BCrich_-_Heinrich_Bullingers_Westerhemd_-_000012135.jpg [23:23:04] from the varnish docs [23:23:13] haha [23:23:18] well, yeah, sure [23:23:28] but it's crashing every 10 minutes and it provides worse service than our "CMS" [23:23:34] AaronSchulz: it's a tortilla hat! the label is wrong. [23:24:22] brace yourselves, zero deploying ... 23-3 [23:24:35] (03CR) 10Ori.livneh: [C: 032] Include role::applicationserver in role::applicationserver::configuration::php [operations/puppet] - 10https://gerrit.wikimedia.org/r/96662 (owner: 10Ori.livneh) [23:24:51] bblack: so, I think that after yurik_ is done, I'll put traffic back to esams and attach gdb, try to get a backtrace [23:25:06] find the mutex then see its initalization [23:25:12] do we have anyone I can delegate that PHP bug to? [23:25:13] it's a very crude way of debugging an issue [23:25:15] paravoid: I did some grepping in the varnish source, but I haven't turned up any obvious related buggy constructs [23:25:28] the only use of mutex attrs is in a small spot in jemalloc and it's sane [23:25:29] TimStarling: very funny [23:25:32] I thought we had lots of C programmers now, and I am not really that keen to do it myself [23:25:43] bblack & paravoid, can you repo it in betalabs? because the same code runs there [23:25:46] TimStarling: I'm talking with one right now [23:25:52] yurik_: nope [23:26:42] TimStarling: (and I didn't mean you :) [23:27:08] soooo, mwscript broken on terbium eh? [23:27:29] !log restarting rsyslog on application servers for I31c76fdde. [23:27:39] stupid config file doesn't notify => the service [23:27:44] Logged the message, Master [23:28:53] bblack: did you see TimStarling's request above? :) [23:28:54] could not open input file: /a/common/multiversion/MWScript.php [23:29:27] yes, but I'm looking through backscroll trying to find some original reference [23:29:39] https://bugzilla.wikimedia.org/show_bug.cgi?id=55541 [23:29:51] yurik_: are you done with the deploy? [23:30:01] (for bblack) [23:30:03] apache2 error logs from app servers aggregated on fluorine:/a/mw-log/apache2.log [23:30:11] paravoid: still copying [23:30:26] so no one knows who broke terbium? [23:30:26] !log yurik synchronized php-1.23wmf3/extensions/ZeroRatedMobileAccess/ [23:30:26] it used to be much faster for some reason [23:30:32] AaronSchulz: I'll look, hang on [23:30:42] Logged the message, Master [23:31:21] nice work ori-l [23:31:29] sounds useful [23:31:41] paravoid: i looked at one today and found all manner of alarming things [23:31:47] that had clearly been going on for a while [23:31:59] now we can ignore these issues in aggregate [23:32:05] heh [23:32:20] well you know, we'll look at it on the next outage [23:32:32] we fix a bunch of unrelated issues with each outage anyway [23:32:43] and by the current rate of outages, lots of things will get fixe [23:32:48] *fixed [23:33:10] you jest, but that's not inaccurate [23:33:31] AaronSchulz: what was /a/common earlier? symlink maybe? [23:34:37] a regular dir [23:34:41] paravoid: first portion completed, deploing v4, should be another few min [23:34:45] at least judging from stat on tin [23:34:57] AaronSchulz: did it recently work? [23:35:16] it worked a few days ago, in fact yesterday afaik [23:35:29] * ori-l checks puppet.log [23:35:35] https://commons.wikimedia.org/wiki/File:Zentralbibliothek_Z%C3%BCrich_-_Heinrich_Bullingers_Westerhemd_-_000012135.tif 504 great [23:35:51] tortilla hat nooooooo [23:36:20] maybe it was a symlink [23:36:47] to common-local [23:37:03] the reason I ask is that I merged this: [23:37:39] !log puppet.log on terbium: could not set file on ensure: No such file or directory - /usr/local/apache/common/php/extensions/FlaggedRevs/maintenance/wikimedia-periodic-update.sh.puppettmp_8141 at /etc/puppet/manifests/misc/maintenance.pp:155 [23:37:50] (03PS1) 10Faidon Liambotis: Revert "Switch mobile-lb to eqiad" [operations/dns] - 10https://gerrit.wikimedia.org/r/96666 [23:37:54] Logged the message, Master [23:38:16] https://gerrit.wikimedia.org/r/#/c/65254/ [23:38:57] which makes the assumption that it's a directory in production; if it was a symlink on terbium, that would explain it [23:39:03] apergos: can I drag you into helping me with something, possibly tomorrow your morning? [23:39:10] tell me what it is [23:39:23] cp3013 & cp1034 are waiting to be installed [23:39:36] ok [23:39:43] uh [23:39:50] er [23:39:52] 3014 [23:39:54] ok [23:40:08] do they need any special treatment? [23:40:44] !log yurik synchronized php-1.23wmf4/extensions/ZeroRatedMobileAccess/ [23:40:45] I can handle the Varnish setup [23:40:55] but be careful because they're set up as Varnish backends already [23:41:00] Logged the message, Master [23:41:04] manifests/role/cache.pp:211 [23:41:14] so they'll immediately get traffic if Varnish runs [23:41:26] apergos: I don't see it in the puppet log, tho [23:41:32] might be wiser to remove them from cache.pp first and force-run puppet on 3011/3012 [23:41:33] paravoid: done [23:41:45] (03CR) 10Faidon Liambotis: [C: 032] Revert "Switch mobile-lb to eqiad" [operations/dns] - 10https://gerrit.wikimedia.org/r/96666 (owner: 10Faidon Liambotis) [23:41:50] sure [23:42:04] I will certainly ot even think about doing this tonight [23:42:10] a symlink would make more sense [23:42:18] no, but I'll probably be dead sleeping tomorrow our morning :) [23:42:20] it looks like puppet is asserting an empty directory? [23:42:25] which is all I see on terbium [23:42:35] well gues who else will be :-D [23:42:47] however it will be on my early part of tomorrow's work day queue [23:43:01] they're set up as mobile boxes [23:43:12] might have helped with today's load, it's a bit insane that we have just two boxes [23:43:20] do we still have wikidata crons on terbium? [23:43:22] I'm pretty sure if we lost one the site wouldn't work [23:43:26] I can't imagine that would be working now [23:43:54] yes, every time I look at that pool of two I wonder [23:44:18] one the mgmt switches was dead when mark was setting them up, I think [23:44:22] but they're reachable now [23:44:47] actually [23:45:12] cp3013-4 - both in menu 'continue with no disk/login to iSCSI target' on mgmt console [23:45:19] these are on my 'what's going on with these boxes' list [23:45:44] so that's what I know about them at this moment [23:45:44] apergos: also, since we're out of EU peak hours, there's a fair chance varnish won't crash now and will tomorrow morning; monitor gdash's reqerror & oxygen 5xx and revert that DNS change above if they're at fault [23:45:54] yep [23:45:59] if I'm asleep, who knows [23:47:34] well I need to finish dinner, then wait one hour, then I can sleep [23:47:42] I will sleep til I wake, and then we shall see [23:48:58] apergos: I'm going to disable puppet for now and replace the /a/common with a symlink to /usr/local/apache/common [23:49:05] feel free [23:49:18] log please though so I/we remember [23:49:38] I'll point hashar at that too [23:52:08] [0;32minfo: /Stage[main]/Misc::Deployment::Vars/File[/a/common]: Recursively backing up to filebucket^[[0m [23:52:10] ^[[0;36mnotice: /Stage[main]/Misc::Deployment::Vars/File[/a/common]/ensure: ensure changed 'link' to 'directory'^[[0m [23:52:17] indeed [23:52:26] apergos: yeah, that change definitely clobbered that link [23:52:28] I should check to see where else that might have happened [23:52:29] so I guess it was a symlink before [23:53:09] the easy (but not really better) fix is to remove the creation of that directory from the stanza, the problem though is that some places need it as a dir, some as a link, some not at all [23:53:23] I don't feel equipped to solve that at the moment [23:53:43] which is why I will likely pass the buck to the writer of the changeset :-P [23:55:16] replace => false [23:55:18] on the file resource [23:55:29] should make puppet respect the symlink if it exists [23:56:00] i'll patch it [23:56:45] irritating that it isn't recorded in the puppet log though [23:56:55] it is, i missed it [23:56:58] aaron pasted it above [23:57:13] oh, it is there [23:57:35] sorry, I thought that was from a local test [23:58:00] nope, from terbium