[00:00:00] ori-l: although another reason, was because we arn't sure how to measure our memcache usage if its mixed in with the standard servers. With dedicated memcache instances we can use the stats generated by them to extrapolate what kind of memory usage we would have on a larger deployment [00:00:15] mem/cpu/etc. [00:00:17] ah, that makes sense [00:00:54] PiRSquared: I'm not really sure what the protocol is for resolving such issues -- presumably you need to pick a suffix for dupes [00:00:57] <^demon|away> ori-l: Did we ever figure out what magic gnomes were increasing memc usage last time around? [00:01:11] ^demon|away: yeah, it was a wikibase config object [00:01:18] ori-l: how about "/old" or "broken" or something? [00:01:20] <^demon|away> Ah, that sites thing, yes. [00:01:30] It doesn't really matter, as long as you can restore the content of those pages [00:01:36] <^demon|away> werdna: Oh, and while we're sorta on the subject...what kind of unholy global-scoped hell is this container.php? :D [00:02:15] ^demon|away: Well, it's not included globally scoped [00:02:16] ask ebernhardson [00:02:30] <^demon|away> Yeah, that's completely unclear unless you start grepping where it's used. [00:02:34] <^demon|away> That's scary as hell. [00:02:37] <^demon|away> :D [00:02:47] but basically the idea is that you initialise Flow by reading from the container [00:02:53] get all the objects you need [00:03:02] namespaces! [00:03:06] and then pass them up the stack to get somewhere [00:03:32] ebernhardson: in that case, too, you probably explicitly don't want an existing MC host [00:03:48] ori-l: is there another channel for requests like mine (angwiki)? I'm sorry for bothering you... [00:04:51] PiRSquared: Ok, I used '-old' as the suffix [00:04:58] okay [00:04:59] let me paste the output, so you know which pages were affected [00:05:14] Thanks. [00:05:23] PiRSquared: https://dpaste.de/WOD0/raw [00:06:08] ebernhardson: let me poke around for a sec to see if there's an existing node that would be an obvious fit for this purpose [00:06:17] Thanks so much, ori-l. [00:06:17] if not, you'll probably want to file an RT ticket [00:08:31] ebernhardson: actually, you should just get a dedicated host for this; I think there are a bunch of spares, so there's no reason to pile this on a host that is already doing something else. That way system metrics will correspond exclusively to your usage [00:08:44] so file an RT ticket and ask for one [00:09:22] and poke RobH about it when he's around :P [00:09:35] <^demon|away> ori-l: Bunch is relative :) https://wikitech.wikimedia.org/wiki/Server_Spares [00:09:46] <^demon|away> But yeah, RT is the way to go. [00:09:55] well, there arent a tton of spares [00:09:59] there are spares to go around yes [00:10:05] ebernhardson: I see the conversation moved on a bit, but I was going to suggest either an email to ops@ or an RT ticket outlining your needs and we can follow up from there. [00:10:08] but with the tampa migration, we have to be reasonable. [00:10:15] rt procurement ticket is best yes [00:10:39] ^demon|away: ah, I forgot about that page, thanks [00:10:46] <^demon|away> yw [00:13:13] (03CR) 10Chad: "This should be able to go in now now that the exception issue is resolved." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93622 (owner: 10Chad) [00:14:04] ebernhardson: what you missed: https://dpaste.de/iE5t/raw [00:32:46] !log stopping replication on sanitarium db1054:3308 and labsdb1002:3308 while restoring dewiki to labs [00:33:01] Logged the message, Master [00:39:52] (03CR) 10Aklapper: "Just FYI, keeping audit (for admins like me), metrics (for Quim Gil and the Tech Community Metrics folks) and bugzilla_report (for everybo" [operations/puppet] - 10https://gerrit.wikimedia.org/r/94075 (owner: 10Dzahn) [00:49:31] keen: what exactly is the issue? [00:49:58] jasper_deng: http://paste.lopsa.org/133 [00:50:17] keen: I see that, but I see no real issue w/ that [00:50:23] just passing it along really. [00:50:42] randy seems to think it was enough of a problem to chase down..but I dont know what the original issue was. [00:51:03] (03CR) 10Bsitu: Enable Flow discussions on a few test wiki pages (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94106 (owner: 10Spage) [00:52:19] looks like matt walker has since responded though, so I'll leave it alone now. ;) [00:53:01] werdna: Security and ops review are incomplete. December 4 seems unlikely. [00:53:46] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (218292) [00:56:26] * keen . o O ( my guess would be that he clicked a donate link that landed him on the "anti spam and abuse statement" instead of a page where he could actually donate... if I follow the his "success" at the bottom, I can arrive at an actual donation form.. ) [00:58:20] ori-l, did you add an ability to do "vagrant dbupdate" or something like that by any chance? easier than to do vagrant ssh / php update.php :) [00:59:59] yurik-road: mediawiki::extension types take a boolean needs_update parameter that will automatically run update.php when the extension is enabled [01:00:17] but there is no facility for generic, unscheduled db updates [01:01:08] ori-l,the most common case i have is doing various "git pull" in core and extensions, and having to vagrant ssh to do the php update [01:01:42] it would be great if we have something like "vagrant pull-and-update" :) [01:02:43] yurik-road: easy to do; 'vagrant run-tests' runs core's phpunit tests; it's implemented in lib/mediawiki-vagrant/run-tests.rb [01:02:56] you could follow that example to implement pull-and-update [01:03:16] which would go through all mediawiki core & extensions, do git pull for master branch (not sure if git pull master is the right command), and do php update [01:03:23] note that it invokes /usr/local/bin/run-mediawiki-tests [01:03:36] oki, might have a crack at it :) [01:03:41] which is just: [01:04:02] . /etc/profile.d/puppet-managed/set_mw_install_path.sh ; cd "$MW_INSTALL_PATH" ; php tests/phpunit/phpunit.php --testdox "$@" [01:05:17] yurik-road: sorry, that's lib/mediawiki-vagrant/run-tests.rb ; i had an old branch checked out [01:07:47] ori-l, forgot to tell you btw - what do you think about renaming all the *-role into role-* ? this would allow quick autocomplete :) [01:08:04] more importantly it would group them together in the help list [01:08:30] sure, especially if we retained the older syntax as a fallback [01:21:47] (03CR) 10Spage: "Why I'm guessing that officewiki should be a separate database." (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94106 (owner: 10Spage) [01:28:07] (03CR) 10Aude: Enable Flow discussions on a few test wiki pages (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94106 (owner: 10Spage) [02:08:29] !log LocalisationUpdate completed (1.23wmf4) at Wed Nov 27 02:08:28 UTC 2013 [02:08:46] Logged the message, Master [02:15:19] !log LocalisationUpdate completed (1.23wmf5) at Wed Nov 27 02:15:19 UTC 2013 [02:15:34] Logged the message, Master [02:24:51] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [02:28:51] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (203586) [02:29:51] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [02:39:53] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (207734) [02:40:52] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Nov 27 02:40:52 UTC 2013 [02:41:07] Logged the message, Master [02:43:53] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [02:47:53] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (223879) [02:55:03] PROBLEM - Puppet freshness on professor is CRITICAL: No successful Puppet run for 0d 18h 3m 27s [03:01:23] PROBLEM - Disk space on db1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:21:54] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [03:27:54] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (202245) [03:30:54] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [03:35:00] ori-l, almost done with the script (had to step away for a bit) - but it fails because ssh doesn't pass in my credentials. Is there an option to pipe my creds to vagrant? [03:50:52] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (202285) [03:51:52] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [04:26:25] (03PS2) 10MZMcBride: Create "Draft" namespace on the English Wikipedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97675 [04:37:35] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (202729) [04:47:35] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [04:54:35] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (203247) [04:56:36] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [04:59:36] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (205451) [05:15:33] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [05:18:33] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (201108) [05:20:33] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [05:25:33] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (210614) [05:55:25] PROBLEM - Puppet freshness on professor is CRITICAL: No successful Puppet run for 0d 21h 3m 49s [06:53:01] (03PS1) 10ArielGlenn: depool db1019 (s3) temporarily for lvm resize [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97888 [06:59:08] (03CR) 10ArielGlenn: [C: 032] depool db1019 (s3) temporarily for lvm resize [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97888 (owner: 10ArielGlenn) [07:01:15] !log ariel updated /a/common to {{Gerrit|I4372bb602}}: depool db1019 (s3) temporarily for lvm resize [07:01:32] Logged the message, Master [07:02:26] !log ariel synchronized wmf-config/db-eqiad.php 'depool db1019 (s3) temporarily for lvm resize' [07:02:41] Logged the message, Master [07:10:36] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [07:13:36] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (202900) [07:33:36] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [07:36:33] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (211383) [07:38:26] !log rebooting db1019 after kernel upgrade, fix for broken xfs_growfs [07:38:41] Logged the message, Master [07:39:43] PROBLEM - Host db1019 is DOWN: PING CRITICAL - Packet loss = 100% [07:41:13] RECOVERY - Host db1019 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [07:53:34] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [07:58:08] (03PS1) 10ArielGlenn: warm up db1019 (s3) after lvm resize [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97890 [07:59:01] (03CR) 10ArielGlenn: [C: 032] warm up db1019 (s3) after lvm resize [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97890 (owner: 10ArielGlenn) [07:59:21] !log ariel updated /a/common to {{Gerrit|I50354e622}}: warm up db1019 (s3) after lvm resize [07:59:36] Logged the message, Master [08:00:03] !log ariel synchronized wmf-config/db-eqiad.php 'warm up db1019 (s3) aftr lvm resize' [08:00:19] Logged the message, Master [08:05:42] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [08:15:42] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (201108) [08:16:32] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [08:46:50] (03CR) 10Ori.livneh: "Ottomata: OK, no objection then." [operations/puppet] - 10https://gerrit.wikimedia.org/r/94169 (owner: 10Ottomata) [08:47:36] (03CR) 10Ori.livneh: "Yay! Thanks!" [operations/debs/python-kafka] (debian) - 10https://gerrit.wikimedia.org/r/97848 (owner: 10Ottomata) [08:48:16] PROBLEM - Disk space on labstore4 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=98%): /exp/public/datasets 0 MB (0% inode=98%): /exp/public/keys 0 MB (0% inode=98%): /exp/public/repo 0 MB (0% inode=98%): [08:49:29] ahhh [08:51:06] apergos: paravoid: labstore4 out of disk space :/ /ext/public/datasets filled :/ [08:51:19] not sure what that machine is, my labs project has labstore1 [08:51:26] uugghhh [08:51:48] guess we need to clear some things [08:51:52] lemme look [09:17:46] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [09:29:57] found it, now how to fix... [09:34:16] RECOVERY - Disk space on labstore4 is OK: DISK OK [09:48:30] PROBLEM - RAID on db9 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:48:57] !log there was no mount /srv/pagecounts on labstore4, so rsync to /exp/pagecounts wrote to and filled /; did the mkdir and now things seem ok [09:49:11] Logged the message, Master [09:49:50] PROBLEM - DPKG on db9 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:49:50] PROBLEM - Disk space on db9 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:50:31] PROBLEM - mysqld processes on db9 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:51:30] PROBLEM - RAID on db9 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:53:20] PROBLEM - SSH on db9 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:53:21] PROBLEM - MySQL disk space on db9 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:53:30] PROBLEM - puppet disabled on db9 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:53:30] RECOVERY - mysqld processes on db9 is OK: PROCS OK: 1 process with command name mysqld [09:53:40] RECOVERY - DPKG on db9 is OK: All packages OK [09:56:31] PROBLEM - mysqld processes on db9 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:56:40] RECOVERY - Disk space on db9 is OK: DISK OK [09:59:20] RECOVERY - SSH on db9 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [09:59:50] PROBLEM - DPKG on db9 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:01:15] apergos: oooh, pagecounts got done? [10:01:16] nice [10:01:37] I think they've been there for some time [10:01:48] there was just a hiccup on labstore4 this time, not sure why [10:01:50] RECOVERY - DPKG on db9 is OK: All packages OK [10:01:51] hmm, I remember my patch got reverted [10:02:10] RECOVERY - MySQL disk space on db9 is OK: DISK OK [10:02:20] RECOVERY - puppet disabled on db9 is OK: OK [10:02:21] RECOVERY - mysqld processes on db9 is OK: PROCS OK: 1 process with command name mysqld [10:08:17] !log shot some old puppet processes hogging memory on db9 (from march and earlier) [10:08:30] Logged the message, Master [10:17:32] (03PS1) 10ArielGlenn: db1019 (s3) back to full weight in pool [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97896 [10:17:59] YuviPanda: that was due to using nfs instead of rsyncing to the remote server [10:18:05] right [10:18:07] that was fixed the same or the next day [10:18:10] aaah [10:18:15] i didn't know that :) [10:18:19] thanks! [10:18:25] I didn't know you didn't know :-) [10:19:01] (03CR) 10ArielGlenn: [C: 032] db1019 (s3) back to full weight in pool [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97896 (owner: 10ArielGlenn) [10:19:27] !log ariel updated /a/common to {{Gerrit|If5ebd6194}}: db1019 (s3) back to full weight in pool [10:19:43] Logged the message, Master [10:20:14] !log ariel synchronized wmf-config/db-eqiad.php 'db1019 (s3) back to full weight in the pool' [10:20:27] Logged the message, Master [10:50:18] (03CR) 10Faidon Liambotis: [C: 031] "FWIW, this is good to go, I'm just waiting for the dependency." [operations/puppet] - 10https://gerrit.wikimedia.org/r/97004 (owner: 10Dr0ptp4kt) [10:53:57] PROBLEM - Host ssl1 is DOWN: PING CRITICAL - Packet loss = 100% [10:55:07] (03CR) 10Faidon Liambotis: [C: 032] Zero: Changed 470-01 to whitelist all languages [operations/puppet] - 10https://gerrit.wikimedia.org/r/97860 (owner: 10Yurik) [10:55:42] (03CR) 10Faidon Liambotis: [C: 032] Serve gdash.wikimedia.org on misc varnish [operations/puppet] - 10https://gerrit.wikimedia.org/r/97693 (owner: 10Ori.livneh) [10:55:56] thanks [10:56:07] RECOVERY - Host ssl1 is UP: PING OK - Packet loss = 0%, RTA = 35.44 ms [11:00:07] works, too [11:00:16] mind if i do the dns change? [11:01:22] (03CR) 10Faidon Liambotis: [C: 032] gdash.wm.o: noc -> misc varnish [operations/dns] - 10https://gerrit.wikimedia.org/r/97698 (owner: 10Ori.livneh) [11:01:30] faidon speaks in gerrit [11:03:30] sorry, I wasn't watching irc [11:03:43] and you didn't use my name/nick, so it didn't notify me [11:04:04] what, you don't get pinged by 'dns'? :P [11:04:07] :P [11:06:03] ori-l / paravoid mind merging: https://gerrit.wikimedia.org/r/#/c/97697/1 ? [11:06:42] the arrow alignment fix is welcome, the line break isn't [11:07:04] nod [11:07:12] languages that don't have multi-line strings lose the right to complain about line length in my book :P [11:07:28] i usually break long lines, but sometimes it's clearer [11:07:45] if you want to do the line break, you should do [\n File[...],\n File[...]\n, ] [11:08:05] i.e. keep the [ & ] in their own lines and indent the File resources [11:08:11] maybe I missed an indentation level there [11:08:46] https://dpaste.de/Boc3/raw [11:09:39] right [11:09:43] that's what I meant, thanks :-) [11:11:29] !log ssl1 rebooted itself about 15 mins ago, no idea why [11:11:43] Logged the message, Master [11:11:45] because I rebooted it? [11:11:58] it was kinda of obvious if you looked at auth.log :) [11:11:59] I didn't see it in the logs (I did look) [11:12:05] SAL that is [11:12:51] well.. wanna log that then? :-P [11:12:57] no [11:13:21] rebooting a random depooled ssl server in tampa, who cares [11:14:48] as long as icinga is still notifying in here, it's worth logging for that reason [11:23:13] (03PS2) 10Matanya: webperf :lint-clean [operations/puppet] - 10https://gerrit.wikimedia.org/r/97697 [11:23:32] paravoid: ^ [11:23:57] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (206939) [11:24:06] (03CR) 10Ori.livneh: [C: 032 V: 032] "Thank you very much" [operations/puppet] - 10https://gerrit.wikimedia.org/r/97697 (owner: 10Matanya) [11:24:32] thanks ori-l :) [11:24:37] (03PS1) 10Ori.livneh: decom gdash on professor [operations/puppet] - 10https://gerrit.wikimedia.org/r/97900 [11:24:37] damn [11:24:40] he's faster [11:24:55] i think i'm going on a lint project [11:25:06] it drives me made the tabs/spaces mess [11:25:09] ori-l: don't merge yet [11:25:18] cached dns entries? [11:25:21] yes [11:25:22] ttl is 1h [11:25:31] *mad [11:25:42] I deployed the change at 11:00 UTC [11:25:45] that leaves 35' more minutes [11:25:57] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [11:26:04] not that I expect gdash to have many users, at least at this hour [11:26:08] but you never kno [11:26:15] yeah, still nice to do the right thing [11:28:36] !log faidon switched gdash.wm.o from professor.pmtpa -> tungsten.eqiad behind misc-varnish & rebooted ssl1 in tampa [11:28:45] lol [11:28:50] Logged the message, Master [11:28:55] polite nudge? [11:29:09] I'll take it :) [11:29:14] i was alarmed by the alert too [11:29:44] fair enough [11:29:57] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (205308) [11:31:49] ok, going to get some sleep for real now, thanks for the merges / puppet runs / dns change [11:31:57] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [11:33:34] and for udpprofiler! :D [11:33:51] matanya: and the lint! [11:37:43] (03PS1) 10Aude: Fix Wikibase noc symlink [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97903 [11:39:48] (03PS6) 10Aude: Enable Wikidata build on beta labs [WIP] [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95996 [11:40:34] (03CR) 10Aude: [C: 04-1] "needs more testing and scrutiny to ensure this doesn't break localisation cache stuff on beta" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95996 (owner: 10Aude) [11:50:41] (03CR) 10ArielGlenn: [C: 031] "As a first step this is ok. It desperately needs instance-specific stuff separated out into a role module in later steps." (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/96403 (owner: 10Dzahn) [12:42:02] (03PS5) 10Addshore: Start wikidata puppet module for builder [operations/puppet] - 10https://gerrit.wikimedia.org/r/96552 [12:42:57] (03CR) 10Addshore: "Tested on Labs and so far this all seems to work as expected, Now to try and implement the next stage ontop of this!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/96552 (owner: 10Addshore) [13:29:50] (03PS1) 10Matanya: varnish : lint clean up [operations/puppet] - 10https://gerrit.wikimedia.org/r/97910 [13:30:43] (03CR) 10jenkins-bot: [V: 04-1] varnish : lint clean up [operations/puppet] - 10https://gerrit.wikimedia.org/r/97910 (owner: 10Matanya) [13:31:47] (03PS2) 10Matanya: varnish : lint clean up [operations/puppet] - 10https://gerrit.wikimedia.org/r/97910 [13:39:29] paravoid: now ori-l isn't here, you can be faster :) ^ [14:02:17] (03CR) 10Faidon Liambotis: [C: 032] decom gdash on professor [operations/puppet] - 10https://gerrit.wikimedia.org/r/97900 (owner: 10Ori.livneh) [14:45:50] (03CR) 10ArielGlenn: "What changed in the apache configs that would make this work now?" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/91209 (owner: 10Reedy) [15:06:59] hello, is meta having issues? i just got: Our servers are currently experiencing a technical problem. This is probably temporary and should be fixed soon. Please try again in a few minutes. If you report this error to the Wikimedia System Administrators, please include the details below. [15:07:00] Request: POST http://meta.wikimedia.org/wiki/Special:CentralNoticeBanners/edit/B13_1125_nmsrt_enYY, from 208.80.154.134 via cp1066 frontend ([10.2.2.25]:80), Varnish XID 1146839306 [15:07:01] Forwarded for: 199.83.222.237, 208.80.154.134 [15:07:01] Error: 503, Service Unavailable at Wed, 27 Nov 2013 15:03:23 GMT [15:09:55] i am trying to clone a banner and just got: Internal error - Meta [15:09:56] [d16b359b] 2013-11-27 15:08:49: Fatal exception of type BannerExistenceException [15:11:16] oh i guess it actually already cloned [15:12:01] Jeff_Green: have you seen anything weird? [15:14:00] meganhernandez: looking [15:17:56] meganhernandez: can you try it again? [15:18:07] hi akosiaris! I have a couple of .deb reviews that need some love, whenever you have a few minutes [15:18:37] the big one is the varnishkafka .deb, I'd like to get final approval from either you or faidon before we tag and merge [15:18:38] https://gerrit.wikimedia.org/r/#/c/78782/ [15:18:40] ottomata: gime a couple of minutes and I 'll be there [15:18:45] suuure thanks [15:19:16] it seems ok now Jeff_Green [15:21:59] https://gdash.wikimedia.org/dashboards/reqerror/ so there's that [15:26:18] apergos: I haven't really been tuned in on the 503 issue, been heads-down on fundraising cluster prep [15:26:47] ottomata: so [15:27:08] varnishkafka is number 1. What other deb ? [15:27:21] lkoigster [15:27:23] logster [15:27:23] https://gerrit.wikimedia.org/r/#/c/95556/ [15:27:29] and python-kafka [15:27:29] https://gerrit.wikimedia.org/r/#/c/97848/ [15:27:43] both of those are pretty simple —with-python2 packages [15:28:00] apergos: I've seen and skimmed the various email, is that a decent summary? [15:28:10] 20 patch sets ? [15:28:12] man... [15:28:14] haha, yeah [15:28:23] well, faidon originally just committed a skeleton [15:28:34] and we added logging, and logrotate, and a postinst, etc. etc. [15:29:23] (03PS2) 10ArielGlenn: mark stuff in decomm.pp with their rt tickets for easier tracking [operations/puppet] - 10https://gerrit.wikimedia.org/r/93930 [15:29:35] oo, actually, i think I just thought of something I might need to add….not sure, [15:29:41] it's been a variety of issues, mostly covered in the emails [15:29:45] Jeff_Green: [15:29:52] so, akosiaris, varnishkafka will log periodic json stats to /var/cache/varnishkafka.stats.json [15:29:53] (sorry, tuned out for a minute in another window) [15:30:05] varnishkafka starts up as varnishlog user [15:30:31] i *think* that it doesn't have permissions to write to /var/cache at first [15:31:04] hmm [15:31:14] maybe I should make varnishkafka itself just write to /tmp/varnishkafka.stats.json by default [15:31:20] and have puppet move it to /var/cache or something? [15:31:23] with proper permissions? [15:31:45] that is the fast way out. You can do that... but what are those json stats? [15:31:49] hello i was going to put up a 100% test. should i wait? i got that error about 30 mins ago but haven't got it again since then [15:32:09] (03PS3) 10ArielGlenn: mark stuff in decomm.pp with their rt tickets for easier tracking [operations/puppet] - 10https://gerrit.wikimedia.org/r/93930 [15:32:46] meganhernandez: should be fine, go for it [15:32:54] heyhey meganhernandez [15:33:26] akosiaris: https://gist.github.com/ottomata/7677672 [15:33:35] just stats about running varnishkafka and librdkafka stuff [15:33:40] hi there werdna [15:33:42] want to parse that to send to ganglia [15:33:46] ok will do Jeff_Green [15:33:49] things like txbytes, errors,e tc. [15:35:04] (03CR) 10ArielGlenn: [C: 032] mark stuff in decomm.pp with their rt tickets for easier tracking [operations/puppet] - 10https://gerrit.wikimedia.org/r/93930 (owner: 10ArielGlenn) [15:35:11] ottomata: and this gets rewritten often right ? [15:35:27] its appended to [15:35:32] every 60 seconds by default [15:35:56] so not a cache [15:36:05] suppose not? faidon told me to put it there :p [15:36:05] right ? [15:36:19] but maybe he didn't realize that it was a log? [15:36:19] can it be deleted without data loss ? [15:36:22] yes [15:36:23] no problem [15:36:26] well i mean [15:36:31] its just running stats [15:36:39] if you restart varnishkafka they will all reset [15:36:47] you'll lose history if you delete it, but no biggy [15:37:01] its only stats, has nothing to do with real operation of varnishkafka [15:37:09] as always with kafka a cornercase... [15:37:18] ha, eh? [15:37:25] ok cache it is [15:37:34] since it can be deleted without data loss [15:37:36] yeah [15:37:44] but i mean, you can delete /var/log/syslog without dataloss [15:37:46] otherwise i 'd advise spool [15:37:59] of course that is dataloss [15:38:13] but you said it gets recreated on restart [15:38:18] hmm [15:38:20] on restart [15:38:24] oh, no it will still be appended to [15:38:25] truncated ? [15:38:27] just the counters will reset [15:38:56] just out of curiosity [15:39:04] the program want to append data [15:39:07] wants* [15:39:15] how does it append ? [15:39:21] it json we are talking about [15:39:22] fopen "a" [15:39:45] https://gist.github.com/ottomata/7677672 [15:39:58] how many appends have happened here ? [15:39:59] https://github.com/wikimedia/varnishkafka/blob/master/varnishkafka.c#L1773 [15:40:07] oh that's 2 [15:40:13] each line is a full json object [15:40:19] varnishkafka will log its own stats [15:40:24] and librdkafka logs its own as well [15:40:48] gistfile2.json is just 2 lines from that file [15:40:51] it logs one line at a time [15:41:12] but those 2 are completely different from each other [15:41:42] yes [15:42:04] those are the only 2 different objects that are (currently) logged [15:42:12] but true, librdkafka keeps its own stats [15:42:19] and varnishkafka keeps its own as well [15:42:32] librdkafka ? [15:42:34] librdkafka takes a callback from which it will perodically log [15:42:37] yeah, the kafka C library [15:42:41] varnishkafka just uses it [15:42:43] wait [15:42:55] varnishkafka writes that file [15:42:58] haha, oh man [15:43:01] yes. [15:43:05] and periodically appends to it [15:43:14] what has librdkafka to do with that file ? [15:43:23] sorry, ok, librdkafka is just the C kafka library [15:43:33] varnishkafka uses it to produce varnish log messages to kafka [15:44:12] librdkafka maintains its own stats about its runtime usage [15:44:24] please tell me in a different file [15:44:31] not in a file [15:44:32] just in memory [15:44:36] :-) [15:44:38] if you want to get that data out of it, you pass it a callback that it will periodically get called [15:44:47] we pass it this vk_log_stats function [15:44:50] to get the json stats into this file [15:44:54] pull the data and write it yourself [15:45:01] https://github.com/wikimedia/varnishkafka/blob/master/varnishkafka.c#L1361 [15:45:27] rd_kafka_conf_set_stats_cb(conf.rk_conf, kafka_stats_cb); [15:45:36] (line 1975) [15:45:49] we are pulling the data and writing it ourself [15:45:55] rdkafka calls OUR callback [15:45:57] and we write it [15:46:19] so in that gist [15:46:21] s/we/varnishkafka/g :) [15:46:25] there are two json documents [15:46:38] which is which ? [15:46:57] the kafka is from rdkafka [15:46:59] kafka == librdkafka (we can change that top level key to rdkafka or something, been kinda wanting to do that) [15:47:02] and varnishkafka == varnishkafka [15:47:03] yes [15:47:05] okok [15:47:13] so it will append those two types of documents [15:47:17] periodically [15:47:21] yup [15:47:21] correct ? [15:47:31] finally I understood :-) [15:47:36] haha :) [15:48:00] so /var/cache sounds ok with it [15:48:12] ok, cool [15:48:19]