[00:00:13] the major related fix in that history is the switch to an updated/backported jemalloc implementation, the allocator in -wm16 has definite bugs [00:00:52] so way would cp1068 act so nicely compared to cp1065/1066? [00:01:02] didn't get the right triggers? [00:01:08] so, right now is when the Lightning Deploy window starts, we have 3 people lined up for it, what say you, bblack, on timing? [00:01:09] voodoo [00:01:21] as a rule all complex software always has bugs, it's just a question of whether you've managed to trigger them yet or not [00:01:58] ok, please reboot cp1066 when you feel like :) [00:02:03] greg-g: varnish restarts are fast and this should be an independent issue given lb [00:02:33] should, yeah, and we have plenty of head room, it seems, so you feel comfortable with that going on while you diagnose/upgrade? [00:03:12] person doing current diagnosis on production has say over LD, in my book ;) [00:03:19] much better now : https://ganglia.wikimedia.org/latest/graph_all_periods.php?h=cp1066.eqiad.wmnet&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2&st=1389830552&g=cpu_report&z=large&c=Text%20caches%20eqiad [00:03:24] yes [00:03:40] superm401: ok, with that, you're up [00:03:44] superm401: fire when ready [00:03:57] mutante: because varnish isn't running :) [00:04:10] i guessed that [00:04:30] greg-g, thanks, will do. [00:04:42] TimStarling sounds like he wants to gather more data before reboot, I assume he'll issue the reboot when he's ready [00:05:00] yeah, just quickly checking the syslog [00:05:06] bblack, okay, should I wait? [00:05:07] then I'll do a shutdown -r +1 [00:05:19] it shouldn't affect deployments should it? [00:05:23] I think paravoid has the best handle on the XFS/kernel/alloc related issues currently, he's already got some data and theories on it [00:05:36] superm401: I don't think you need to wait [00:06:02] syslog is full of fail [00:06:06] :) [00:06:11] Jan 15 23:34:15 cp1066 kernel: [15148842.004005] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250) [00:06:11] PROBLEM - Puppet freshness on rhodium is CRITICAL: Last successful Puppet run was Wed 15 Jan 2014 06:04:59 PM UTC [00:06:50] TimStarling: dec 2: 16:33 paravoid: rebooting cp1065, usual XFS deadlock [00:06:55] bblack, TimStarling, alright, will go. [00:06:59] it was logging that once every 2 seconds [00:07:44] !log rebooting cp1066 following XFS deadlock [00:07:50] Logged the message, Master [00:07:50] makes for a good log message at least [00:09:53] (03CR) 10Mattflaschen: [C: 032] Enable GuidedTour on translated languages [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/105997 (owner: 10Reza) [00:10:02] (03Merged) 10jenkins-bot: Enable GuidedTour on translated languages [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/105997 (owner: 10Reza) [00:10:21] PROBLEM - Host cp1066 is DOWN: PING CRITICAL - Packet loss = 100% [00:11:41] RECOVERY - Host cp1066 is UP: PING OK - Packet loss = 0%, RTA = 4.99 ms [00:12:41] RECOVERY - Varnish HTTP text-frontend on cp1066 is OK: HTTP OK: HTTP/1.1 200 OK - 197 bytes in 0.000 second response time [00:15:07] the backend one was bitten by the mmap-address thing and didn't start on its own, it's started now [00:15:21] RECOVERY - Varnish HTTP text-backend on cp1066 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.001 second response time [00:15:35] peace is back [00:16:16] !log mflaschen synchronized wmf-config/InitialiseSettings.php 'Deploy GuidedTour to astwiki, fawiki, and ruwiki' [00:16:23] Logged the message, Master [00:17:53] greg-g: Does that ---^^ mean LDs are a go now? [00:18:00] RoanKattouw: yeah [00:18:06] Cool [00:18:11] RoanKattouw: was goign to let you know when superm401 was done [00:18:18] Let me start prepping now because Gerrit is super slow today [00:18:29] RoanKattouw, I have a couple more things. [00:18:30] * ^d already has commands waiting for merges & pulls [00:18:32] The hotfix [00:18:37] :) [00:18:38] * ^d is waiting for his turn [00:18:45] so many people itching at their turn ;) [00:20:06] <^d> Better people waiting than us all saying that we started scap at the same time. [00:20:28] ^d++ [00:20:55] (03PS1) 10Mwalker: Qualified Lookup of OCG Class [operations/puppet] - 10https://gerrit.wikimedia.org/r/107742 [00:21:33] (03CR) 10Ori.livneh: [C: 032 V: 032] Qualified Lookup of OCG Class [operations/puppet] - 10https://gerrit.wikimedia.org/r/107742 (owner: 10Mwalker) [00:26:56] (03PS1) 10Mwalker: Introducing the role:ocg:production [operations/puppet] - 10https://gerrit.wikimedia.org/r/107744 [00:27:29] (03CR) 10Ori.livneh: [C: 032 V: 032] Introducing the role:ocg:production [operations/puppet] - 10https://gerrit.wikimedia.org/r/107744 (owner: 10Mwalker) [00:29:04] !log mflaschen synchronized php-1.23wmf9/extensions/GettingStarted/ 'Sync GettingStarted wmf9 for hotfix' [00:29:10] Logged the message, Master [00:29:23] done? [00:29:27] !log mflaschen synchronized php-1.23wmf10/extensions/GettingStarted/ 'Sync GettingStarted wmf10 for hotfix' [00:29:33] Logged the message, Master [00:29:37] RoanKattouw, ^d, greg-g, done [00:30:03] superm401: confirmed all's good? [00:30:17] greg-g, no, I'll test now. [00:30:31] (03PS6) 10Physikerwelt: added basic hbase support [operations/puppet/cdh4] - 10https://gerrit.wikimedia.org/r/99381 [00:31:06] (03PS1) 10Mwalker: Apparently Cscott does not have shell [operations/puppet] - 10https://gerrit.wikimedia.org/r/107745 [00:31:51] * RoanKattouw runs git review and goes to grab a drink [00:31:54] (03CR) 10Ori.livneh: [C: 032 V: 032] Apparently Cscott does not have shell [operations/puppet] - 10https://gerrit.wikimedia.org/r/107745 (owner: 10Mwalker) [00:33:06] greg-g, deployment looks good. [00:33:48] cool [00:37:12] <^d> RoanKattouw: I made repos faster. [00:37:13] <^d> gc <3 [00:38:12] (03PS1) 10Mwalker: Add standard role to rhodium [operations/puppet] - 10https://gerrit.wikimedia.org/r/107748 [00:38:38] (03CR) 10Ori.livneh: [C: 032 V: 032] Add standard role to rhodium [operations/puppet] - 10https://gerrit.wikimedia.org/r/107748 (owner: 10Mwalker) [00:39:06] Maybe less slow? [00:39:13] git review on those two wmf/* commits still took a while [00:39:18] And the VE repos are still slow as molasses [00:39:38] Alright, my turn to go [00:40:11] RECOVERY - Puppet freshness on rhodium is OK: puppet ran at Thu Jan 16 00:40:01 UTC 2014 [00:40:21] (Seriously though I have the feeling there's more going on here than just gc, fetches are suddenly much slower today than they were on Monday) [00:41:03] Maybe it's something on the office network [00:41:07] Cause it's fast from tin [00:42:15] RoanKattouw: are you using ssh or https? are you possibly tunneling through another host without realizing it? [00:42:24] ssh [00:42:28] And I don't think so [00:43:21] !log catrope synchronized php-1.23wmf9/extensions/VisualEditor 'Update VE for cherry-picks' [00:43:26] Logged the message, Master [00:43:32] ssh -v confirms it's not tunneling [00:43:38] !log catrope synchronized php-1.23wmf10/extensions/VisualEditor 'Update VE for cherry-picks' [00:43:44] Logged the message, Master [00:43:56] RoanKattouw: maybe tcpdump to see where it hangs? [00:44:02] * greg-g has to run, enjoy the deploys [00:44:10] debug1: Connecting to gerrit.wikimedia.org [2620:0:861:3:208:80:154:81] port 29418 is where ssh -v hangs [00:44:13] * RoanKattouw fires up wireshark [00:44:49] Oh! Maybe it's because it's IPv6 [00:45:16] Yup [00:45:17] It's IPv6 [00:45:39] ^d: The Gerrit slowness is not your fault, it's the office's IPv6 being broken [00:45:57] huh, I was going to jokingly blame it, but wow [00:45:59] <^d> +1 for wfh! [00:47:04] <^d> RoanKattouw: You all done? [00:47:36] ^d: Yes, sorry, go for it [00:47:40] <^d> No worries :) [00:47:41] Got distracted writing an email about the IPv6 breakage [00:47:48] (03CR) 10Chad: [C: 032] Add new CirrusSearch logs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107503 (owner: 10Chad) [00:47:55] (03Merged) 10jenkins-bot: Add new CirrusSearch logs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107503 (owner: 10Chad) [00:48:59] !log demon synchronized wmf-config/InitialiseSettings.php 'Config new Cirrus logs' [00:49:05] Logged the message, Master [00:49:06] (03PS1) 10Mwalker: Better (12.04) packages for OCG [operations/puppet] - 10https://gerrit.wikimedia.org/r/107751 [00:50:56] (03PS2) 10Mwalker: Better (12.04) packages for OCG [operations/puppet] - 10https://gerrit.wikimedia.org/r/107751 [00:51:12] (03CR) 10Ori.livneh: [C: 032 V: 032] Better (12.04) packages for OCG [operations/puppet] - 10https://gerrit.wikimedia.org/r/107751 (owner: 10Mwalker) [00:53:11] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [00:53:13] ^d: so is gerrit broken? [00:53:20] <^d> No. [00:53:21] it still takes years to push anything [00:53:26] (03PS6) 10BryanDavis: Add logstash config for udp2log [operations/puppet] - 10https://gerrit.wikimedia.org/r/106154 [00:53:32] <^d> See what Roan said about ipv6? [00:54:26] AaronSchulz: gerrit.wm.o times out over IPv6 from the office network [00:54:26] So every time, it tries IPv6, times out after 127 seconds, then falls back to IPv4 [00:54:26] Breakage seems recent because it worked for me just last night [00:54:43] And Gerrit IPv6 isn't broken because I used it from the conference wifi in Australia last week and it was fine [00:55:43] <^d> Gerrit's had ipv6 for several months now. [00:56:54] * RoanKattouw edits /etc/hosts to get gerrit IP to IPv4 [00:58:38] WFM at home on ipv6.. [00:59:06] RoanKattouw: 'AddressFamily inet' in the relevant ~/.ssh/config entry [01:00:34] !log aaron synchronized php-1.23wmf10/includes/filebackend/SwiftFileBackend.php '1218cc7e9cba8342d0184523f1d7c1fec608e656' [01:00:41] Logged the message, Master [01:01:37] ori: https://gerrit.wikimedia.org/r/#/c/107752/ silly fix [01:02:22] AaronSchulz: silly merge [01:09:34] ori: Nice, thanks [01:26:53] !log catrope synchronized php-1.23wmf9/extensions/VisualEditor/modules/ve-mw/dm/nodes/ve.dm.MWTransclusionNode.js 'touch' [01:27:00] Logged the message, Master [01:27:48] !log catrope synchronized php-1.23wmf10/extensions/VisualEditor/modules/ve-mw/dm/nodes/ve.dm.MWTransclusionNode.js 'touch' [01:27:54] Logged the message, Master [01:28:10] !log catrope synchronized php-1.23wmf10/resources/startup.js 'touch' [01:28:16] Logged the message, Master [01:30:21] !log catrope synchronized php-1.23wmf10/extensions/VisualEditor 'Forgot to run git submodule update earlier' [01:30:28] Logged the message, Master [02:19:10] Hi, I'm looking to help out with operations. Do I need to talk to someone and get permissions or is the best way to just jump into bugzilla etc? [02:21:42] Hey there Noodles` :) [02:21:51] Hi :) [02:21:56] There are probably more people around here a bit earlier in the day [02:22:22] Ok, are most people on EST? [02:22:26] One thing you can easily do without having to get permissions assigned to you is to submit changes to our puppet config in Gerrit [02:22:40] There's documentation at https://www.mediawiki.org/wiki/Gerrit about how you can set up an account there [02:23:47] We always welcome puppet cleanup; other than that I don't really know what's been going on in ops land recently, actual ops people should have a better idea of what tasks are open [02:23:58] Unfortunately most of them aren't in Bugzilla, and RT requires login [02:24:20] Ops people are ... all over really [02:24:21] ok great, thanks. Who are the best people to talk to when they're online? [02:24:36] matanya (not on at the moment, but frequently on) is a volunteer that has successfully seen several puppet patches through [02:24:43] There's a bunch on PST here in SF, and a few on EST, CET and EET [02:24:51] Ahm, that's a great question [02:24:54] if you ping him i'm sure he could share his experiences [02:25:07] Yeah matanya is a good person to talk to because he's already doing what you're trying to do [02:25:33] Jeff_Green is on RT duty this week which means he's triaging stuff all the time, so he'll have a good up-to-date overview of outstanding tasks [02:25:43] Thanks, I'll send him a message [02:25:46] Noodles`: you can also look at and find modules / manifests that do not conform to that guide (there are many) [02:26:05] paravoid and mark are also people that tend to know what's going on and what needs to be done [02:26:17] (Jeff is in EST, Mark is in CET, Faidon (paravoid) is in EET) [02:26:38] And matanya is, what, EETish? [02:26:54] UTC+2 [02:27:01] !log LocalisationUpdate completed (1.23wmf10) at 2014-01-16 02:27:00+00:00 [02:27:07] Right, yeah that's EET [02:27:14] I forget where he is but I think it's Israel? [02:27:17] Logged the message, Master [02:28:39] Thanks guys. May I ask what your roles are? Are you both volunteers? [02:29:06] We're both WMF staff [02:29:20] I am a developer on the VisualEditor project, and we've met before at a few Beowulfs [02:29:40] Roan: ipv6 happy for you?> [02:29:55] Ori has walked off into a meeting room but I can tell you he's a performance engineer [02:30:03] cajoel: Yeah working fine now [02:30:08] Roan: for future troubleshooting a quick trace route is gold. (mtr is a nice one..) [02:30:15] Right, yeah, will do [02:30:20] Is this why Ori was restarting gerrit the last few days? (possible) [02:30:26] It took me a while to even figure out IPv6 was the problem [02:30:33] I doubt it because it was nice and snappy just last night [02:30:39] ok [02:30:51] Cool, I'm CTO and run devops at Drugs.com, just looking to learn a bit more and help out [02:30:58] Chad did GC a bunch of repos today to fight the slowness when we still thought it was server-side [02:31:01] another fallback in urgent situations would be to disable your ipv6 stack on your machine. [02:31:24] Noodles`: Hmm, I think I might have you confused with Noodles without the ` then [02:31:27] want me to try ipv6 from here, I have native or was a problem with a specific location? [02:31:28] Him I have met in person [02:31:35] It was a problem with the office network [02:31:38] well, I'm really sorry if it was due to the ipv6 network layer.. I'll put in some ipv6 specific performance metrics in the office. [02:31:38] oh ok, that could get confusing [02:32:00] to track if/when this happens again.. [02:32:12] Noodles`: You wouldn't happen to be from Northern Ireland, would you? [02:32:16] I'll also probably be briefly breaking it again late tonight to troubleshoot with our isp [02:33:11] no, from NZ [02:33:29] Aha :) [02:35:03] I'm sure there's a few "Noodles" around the place :) [02:35:08] Yeah :) [02:36:40] Thanks again, I'll idle here, and hopefully I'll catch someone later [02:51:30] !log LocalisationUpdate completed (1.23wmf9) at 2014-01-16 02:51:29+00:00 [02:51:36] Logged the message, Master [02:56:52] New format is live. [03:00:41] !log Deleted all non-current pacct on neon; activity seems to have increased tenfold and /var/log is OOS [03:00:47] Logged the message, Master [03:03:11] PROBLEM - Puppet freshness on neon is CRITICAL: Last successful Puppet run was Thu 16 Jan 2014 12:02:23 AM UTC [03:28:59] !log LocalisationUpdate ResourceLoader cache refresh completed at 2014-01-16 03:28:58+00:00 [03:29:05] Logged the message, Master [03:55:36] (03PS1) 10Reedy: Enable Thanks and Echo on zhwikivoyage [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107800 [05:14:31] PROBLEM - mysqld processes on db1042 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [05:15:33] o: [05:17:15] * Reedy blames domas [05:19:55] (03PS1) 10Springle: depool db1042 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107801 [05:20:29] (03CR) 10Springle: [C: 032] depool db1042 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107801 (owner: 10Springle) [05:20:35] (03Merged) 10jenkins-bot: depool db1042 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107801 (owner: 10Springle) [05:21:38] !log springle synchronized wmf-config/db-eqiad.php 'depool db1042' [05:21:44] Logged the message, Master [05:33:31] RECOVERY - mysqld processes on db1042 is OK: PROCS OK: 1 process with command name mysqld [05:49:26] (03PS1) 10Springle: depool db1033 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107803 [05:56:08] (03PS10) 10Physikerwelt: Add Mathoid module (TeX -> MathML / SVG conversion web service) [operations/puppet] - 10https://gerrit.wikimedia.org/r/90733 [05:57:42] (03CR) 10Springle: [C: 032] depool db1033 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107803 (owner: 10Springle) [05:58:52] !log springle synchronized wmf-config/db-eqiad.php 'depool db1033' [05:58:58] Logged the message, Master [06:00:18] (03PS1) 10Springle: repool db1042 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107804 [06:00:41] (03CR) 10Springle: [C: 032] repool db1042 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107804 (owner: 10Springle) [06:00:47] (03Merged) 10jenkins-bot: repool db1042 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107804 (owner: 10Springle) [06:01:28] !log springle synchronized wmf-config/db-eqiad.php 'repool db1042' [06:01:34] Logged the message, Master [06:04:11] PROBLEM - Puppet freshness on neon is CRITICAL: Last successful Puppet run was Thu 16 Jan 2014 12:02:23 AM UTC [06:14:37] (03PS1) 10Springle: remove db29 for decom [operations/puppet] - 10https://gerrit.wikimedia.org/r/107805 [06:16:10] (03CR) 10Springle: [C: 032] remove db29 for decom [operations/puppet] - 10https://gerrit.wikimedia.org/r/107805 (owner: 10Springle) [06:23:04] morning [06:23:47] hi paravoid [06:52:31] RECOVERY - LibreNMS HTTPS on netmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 6603 bytes in 0.050 second response time [07:00:15] paravoid: Is there a protocol for list admins to reset or reminder owner password? [07:00:35] If this requires ops, could you reset cvn@ and cvn-private@? [07:00:48] and send to -owner (or just me) [07:01:24] I don't do mailman, do you mind filing an RT so someone else can have a look? [07:01:53] throwing a question to the room -- in trebuchet; why did I create a deploy repo? wouldn't it just be easier to have a deploy branch? [07:02:45] mwalker: yes [07:03:11] There were multiple questions. [07:03:37] I tend to do that [07:03:40] it's a bad habit [07:05:28] Krinkle: afaik, its just been "file a bz" [07:05:34] in the past [07:05:37] filed rt [07:06:12] Yeah, Bugzilla would do it too. I think it's just a shared password. [07:06:14] does thehelpfulone or whoever does the MLs these days have access to whichever RT queue you put it? [07:08:01] (03PS1) 10Faidon Liambotis: tcpircbot: fix multiple unreferenced var errors [operations/puppet] - 10https://gerrit.wikimedia.org/r/107807 [07:08:04] ori: wanna have a look? :) [07:09:02] oh fuck [07:09:18] yeah i merged that with the intent of carefully rolling it out and then got distracted [07:09:27] i'll have a look [07:10:18] (03CR) 10Ori.livneh: [C: 032] tcpircbot: fix multiple unreferenced var errors [operations/puppet] - 10https://gerrit.wikimedia.org/r/107807 (owner: 10Faidon Liambotis) [07:10:28] paravoid: much appreciated [07:10:42] oh yeah, i remember why i got distracted [07:10:50] a puppet run was already in progress on neon [07:10:57] because a puppet run is always in progress in neon [07:11:03] *on [07:11:07] yes, it's because naggen is broken with 2.7 :( [07:11:27] naggen is the hack I've done for not making multiple thousand exported resources [07:11:34] it dumps directly from the database into configs [07:11:40] it used to do one sql query [07:11:54] now it does thousands, due to puppet internals having changed [07:12:09] (which is the price we're paying for using something that ties into puppet internals) [07:12:35] I was hoping for puppetdb, but I don't see this happening soon [07:12:40] so maybe I should just rewrite it [07:12:42] maybe in python [07:18:21] RECOVERY - Puppet freshness on neon is OK: puppet ran at Thu Jan 16 07:18:10 UTC 2014 [07:25:59] Krinkle: replied in RT [07:27:43] there's another logmsgbot bug i introduced [07:27:46] fixing [07:28:40] it's because i snubbed Gloria [07:28:48] now I'm paying the price [07:32:33] (03PS1) 10Ori.livneh: tcpircbot: Correctly JSON-serialize array values [operations/puppet] - 10https://gerrit.wikimedia.org/r/107809 [07:33:04] (03CR) 10Ori.livneh: [C: 032 V: 032] tcpircbot: Correctly JSON-serialize array values [operations/puppet] - 10https://gerrit.wikimedia.org/r/107809 (owner: 10Ori.livneh) [07:33:47] (03CR) 10Faidon Liambotis: [C: 04-2] "Packaging single nagios/icinga check sounds like a huge overkill to me..." [operations/debs/check_ganglia] (debian) - 10https://gerrit.wikimedia.org/r/107723 (owner: 10Ottomata) [07:38:38] (03PS1) 10Mwalker: OCG Trebuchet Config and Removal of Node [operations/puppet] - 10https://gerrit.wikimedia.org/r/107810 [07:39:24] ^ If anyone wants to merge and push that -- it would be nifty :D [07:41:23] (03PS2) 10Faidon Liambotis: OCG Trebuchet Config and Removal of Node [operations/puppet] - 10https://gerrit.wikimedia.org/r/107810 (owner: 10Mwalker) [07:41:33] (03PS3) 10Faidon Liambotis: OCG: update trebuchet config and remove Node [operations/puppet] - 10https://gerrit.wikimedia.org/r/107810 (owner: 10Mwalker) [07:41:40] bad ori [07:41:50] (03CR) 10Faidon Liambotis: [C: 032] OCG: update trebuchet config and remove Node [operations/puppet] - 10https://gerrit.wikimedia.org/r/107810 (owner: 10Mwalker) [07:42:00] ugh [07:42:02] fixing [07:43:23] mwalker: merged [07:43:31] thanks! [07:44:33] also bad ottomata for spamming icinga so much for so long [07:44:39] oh he's not here [07:44:52] also paravoid, I'm in the process of testing my pristine backport of texlive 2012 -- if it works (and it doesn't break the math extension); what's the process for importing my PPAs into our local stuff? [07:45:03] 2012? [07:45:05] not 2013? [07:45:42] 2012 is what shipped with stuff up until 13.10 and it looked like it had some interesting dependencies that I didn't want to touch [07:45:53] I can try with 2013; but we've been running under 2012 [07:51:09] arsenic bast1001 fenari gallium hume kaulen lanthanum neon rhodium searchidx1001 searchidx2.pmtpa.wmnet vanadium [07:51:20] are the boxes that have packages named "texlive", excluding all mediawiki appservers [07:51:41] anyway [07:51:53] I really want someone from ops to help you with all that [07:51:56] search... interesting [07:52:38] I thought this was Jeff, but Jeff is pushing back due to time constraints, so maybe someone else? [07:52:52] I'm happy to work with anyone that has time [07:53:09] (another option that was suggested on the mailing list would be to use a local FS repo of the debs that I created) [07:53:14] no :) [07:53:19] that's a terrible idea [07:54:20] (03PS1) 10Ori.livneh: tcpircbot: avoid shadowing parent class's 'channels' attribute [operations/puppet] - 10https://gerrit.wikimedia.org/r/107811 [07:54:49] (03PS2) 10Ori.livneh: tcpircbot: avoid shadowing parent class's 'channels' attribute [operations/puppet] - 10https://gerrit.wikimedia.org/r/107811 [07:54:55] (03CR) 10Ori.livneh: [C: 032 V: 032] tcpircbot: avoid shadowing parent class's 'channels' attribute [operations/puppet] - 10https://gerrit.wikimedia.org/r/107811 (owner: 10Ori.livneh) [07:55:25] paravoid: -b plz [07:55:59] thanks [08:02:14] i need to restart it one last time [08:03:13] I have the most terrible headache, and my clothes are missing. [08:04:06] all done. [08:07:32] (03CR) 10Matanya: added basic hbase support (031 comment) [operations/puppet/cdh4] - 10https://gerrit.wikimedia.org/r/99381 (owner: 10Physikerwelt) [08:09:59] (03CR) 10Physikerwelt: added basic hbase support (031 comment) [operations/puppet/cdh4] - 10https://gerrit.wikimedia.org/r/99381 (owner: 10Physikerwelt) [08:10:29] ori: thank you for yet another ops task ori :) [08:10:55] matanya: are you looking for more TODOs? :) [08:11:27] paravoid: i got some yesterday from mark, but i can add more to my growing list [08:13:41] so the answer is yes, paravoid :) [08:15:04] paravoid, matanya: there was another person on earlier today looking to get involved, Noodles`, so keep an eye out [08:20:08] (03PS7) 10Physikerwelt: added basic hbase support [operations/puppet/cdh4] - 10https://gerrit.wikimedia.org/r/99381 [08:43:02] matanya: oh? what was mark's TODOs? [08:44:20] paravoid: make url-downloader a module, and convert all webserver::* to use the new webserver::apache and on stage 2 convert it into a module, and get rid of the other methods for calling webserver [08:44:36] ok [08:44:52] so the first one is done: https://gerrit.wikimedia.org/r/#/c/107590/ [08:45:03] mine was to write puppet 3 compatibility fixes [08:45:05] mostly scoping issues [08:45:19] now trying to understand the second atm [08:45:58] i'm glad my company uses 3 for some time now, with no issues [08:48:23] we're not very far away [08:48:36] someone just needs to spend 2-3 days fixing a few things [08:48:58] paravoid: wanna do a sprint on that ? [08:51:00] sure, but I was hoping matanya would help first :P [08:52:01] ok. What was the plan ? [08:52:13] fix all the warnings that 2.x is throwing [08:52:13] shoot paravoid [08:52:38] paravoid: mind throwing them on etherpad? [08:55:12] https://etherpad.wikimedia.org/p/Puppet3 [08:55:28] a lot of them are admin.pp ones, so it's really one error [08:56:06] and most of them are trivial [08:57:03] i'll ignore the private for obvious reason [08:57:17] yeah :) [08:57:52] we also have to do the same with labs for manifests that are potentially only applied in labs, but let's start with those first [08:58:09] (03PS4) 10Matanya: etherpad: convert into a module [operations/puppet] - 10https://gerrit.wikimedia.org/r/107567 [08:58:53] i hope this is what mark meant [09:01:22] matanya: so, what do you think? [09:02:14] paravoid: i'll do one now, and see if that is what you meant? if it is, you will merge it and if not, comment and i'll see what you did mean, works for you? :) [09:02:37] yes! [09:03:57] (03PS1) 10Matanya: webserver.pp : puppet 3 compatibility fix [operations/puppet] - 10https://gerrit.wikimedia.org/r/107814 [09:04:06] here you go paravoid ^ [09:04:59] s/webserver.pp : /webserver:/ [09:05:04] no space before colon, no .pp [09:05:50] also while puppet 3 compatibility fix is right, I'd personally prefer something like "fully qualify variable" or something like that [09:05:57] (03PS2) 10Matanya: webserver: puppet 3 compatibility fix [operations/puppet] - 10https://gerrit.wikimedia.org/r/107814 [09:06:10] next time [09:07:37] ok, so if the patch itself is ok with you, you may merge :) [09:08:26] (03CR) 10Faidon Liambotis: [C: 032] webserver: puppet 3 compatibility fix [operations/puppet] - 10https://gerrit.wikimedia.org/r/107814 (owner: 10Matanya) [09:08:49] so, that kind of thing [09:08:53] some are a bit more complicated [09:08:56] paravoid: i'll start from bottom, you from top [09:10:09] do admins.pp next [09:10:24] i can barely look at the file [09:10:28] heh [09:10:30] my eyes heart [09:10:50] we must add a defined type there one day [09:11:00] yeah... [09:11:22] oh, hrm, admins.pp will be less trivial [09:15:03] why? [09:16:03] morning nuria [09:16:13] holaaa [09:18:58] (03CR) 10Ori.livneh: [C: 032] Call updateBitsBranchPointers in multiversion/switchAllMediaWikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107579 (owner: 10Reedy) [09:19:22] !log ori updated /a/common to {{Gerrit|Id13e614e5}}: repool db1042 [09:19:28] Logged the message, Master [09:30:08] paravoid: would should be done with variables not in classes? [09:30:16] what? [09:30:35] e.g $ganglia_url in manifests/nagios.pp [09:31:25] that can't be used as "${classname::ganglia_url}" [09:32:14] good question :) [09:32:26] that would be $::ganglia_url [09:32:32] but i dislike that [09:32:42] well yeah, but this top-scoped variables are ugly [09:32:48] but this could do it for puppet 3, indeed [09:32:52] then we can do hiera maybe [09:33:30] well if the goal is to stop the warnings and see how much compatible we are with puppet3 [09:33:42] that 'd be ok for now [09:33:52] that is what i did at first, but i found it ugly and risky [09:34:05] ugly yes. risky ? [09:34:37] if someone ever uses this locally it might have scoping issues, unlikely, but might happen [09:35:33] anyway, i'll do it now, but this just encourages me to move more stuff to normal modules. [09:35:41] :-) [09:35:57] yes [09:45:12] (03PS1) 10Matanya: nagios: puppet 3 compatibility fix: fully qualify variables [operations/puppet] - 10https://gerrit.wikimedia.org/r/107819 [09:46:26] (03PS1) 10Mwalker: Would help if the OCG config was in the right spot [operations/puppet] - 10https://gerrit.wikimedia.org/r/107820 [09:46:48] MaxSem: good morning [09:47:00] paravoid, ori: could I get one of you two to review ^ and puppet-merge [09:47:07] hey [09:47:19] morning max [09:51:00] (03CR) 10Alexandros Kosiaris: [C: 032] "LGTM but this will need a rebase (and a manual one at that i am afraid)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/98307 (owner: 10Faidon Liambotis) [09:56:33] (03PS1) 10Matanya: hive: puppet 3 compatibility fix: fully qualify variables [operations/puppet] - 10https://gerrit.wikimedia.org/r/107821 [10:04:24] (03PS1) 10Matanya: ldap: puppet 3 compatibility fix: fully qualify variable [operations/puppet] - 10https://gerrit.wikimedia.org/r/107823 [10:08:25] (03Abandoned) 10Hashar: beta: pull VisualEditor individually [operations/puppet] - 10https://gerrit.wikimedia.org/r/107575 (owner: 10Hashar) [10:13:25] (03PS2) 10Alexandros Kosiaris: decom hooper,eiximenis [operations/puppet] - 10https://gerrit.wikimedia.org/r/107159 [10:13:35] gah, always think BZ is all done and then more requests to modify https://gerrit.wikimedia.org/r/#/q/status:open+project:wikimedia/bugzilla/modifications,n,z :p [10:14:33] mutante: ahaha. Those are quite a few... [10:14:34] re: backlog sumanah already got her RT pass reset by me, no need to do it again [10:15:42] akosiaris: also without that my gerrit queue has never been so full :) (lintint campaigns etc) [10:15:48] linting [10:16:29] Ah yes, I have started using jon robson's gerrit.py thingy. I find it more useful that the gerrit interface [10:16:30] but should process them by age and get the ancient ones out first:) [10:16:49] akosiaris: ah, didn't know, might take a look [10:17:15] it does exactly what you said... lists them by score and then age [10:17:23] cool [10:17:26] all in all 25 in operations/puppet [10:17:58] better than before the end-of-year cleanup:) [10:18:11] akosiaris: how many of them are mine? :) [10:19:11] matanya: 11 [10:19:28] ok, i'll you a few more :) [10:19:32] matanya: ever checked stats on https://www.ohloh.net/ for wmf projects? [10:19:33] *add [10:20:55] (03CR) 10Guido.iaquinti: [C: 031] decom hooper,eiximenis [operations/puppet] - 10https://gerrit.wikimedia.org/r/107159 (owner: 10Alexandros Kosiaris) [10:21:10] do you think it could work to try and be smart about detecting who should be added as a reviewer automatically? [10:21:22] maybe based on who edited the touched files in the past the most [10:21:45] or rules list for module->likely_reviewers [10:22:05] good luck not adding the people who have left [10:22:24] heh, yea [10:22:31] mutante: rank 5 [10:22:40] matanya: :) [10:22:50] whatever that means [10:22:58] it might mess up stats that gerrit username changes [10:23:05] (03CR) 10Alexandros Kosiaris: [C: 032] decom hooper,eiximenis [operations/puppet] - 10https://gerrit.wikimedia.org/r/107159 (owner: 10Alexandros Kosiaris) [10:31:20] (03PS1) 10Matanya: smokeping: puppet 3 compatibility fix: fully qualify variable + minor lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/107825 [10:36:06] akosiaris: do you want to be on the puppet3 patches? [10:38:36] meaning ? as a reviewer ? [10:38:47] (03PS3) 10Dzahn: add salt grains for applicationserver roles [operations/puppet] - 10https://gerrit.wikimedia.org/r/83768 [10:39:09] feel free to add me. [10:39:24] still has tabs as well, geee [10:41:06] (03PS1) 10Matanya: geoip: puppet 3 compatibility fix: fully qualify variable [operations/puppet] - 10https://gerrit.wikimedia.org/r/107826 [10:41:14] (03PS4) 10Dzahn: add salt grains for applicationserver roles [operations/puppet] - 10https://gerrit.wikimedia.org/r/83768 [10:44:25] matanya: aha, thanks for the citation [10:44:28] Separate file resources (puppetlabs' style guide 9.4) [10:44:40] so that's why you split them all [10:45:55] heh, yea, the guides, you could probably find reviews where people say the opposite, "why don't you merge this" :) [10:46:02] yeah, i remembered it exist, but didn't remmeber where, so i used {{citation needed}}, and an editor added it :) [10:57:44] (03PS1) 10Matanya: udp2log: puppet 3 compatibility fix: fully qualify variable [operations/puppet] - 10https://gerrit.wikimedia.org/r/107828 [11:06:57] (03CR) 10Faidon Liambotis: [C: 032] Would help if the OCG config was in the right spot [operations/puppet] - 10https://gerrit.wikimedia.org/r/107820 (owner: 10Mwalker) [11:09:08] matanya: not all of your changes are about fully qualifying variables [11:09:45] right, some are modulepath things [11:09:54] wrong commit message [11:10:11] i'm not sure if I want to follow that style guide thing about separating all the file resources or not .. [11:10:25] thanks paravoid :) [11:11:41] i'm all in favor of using it mutante [11:11:45] *for [11:16:35] (03PS1) 10Mwalker: More OCG path fun [operations/puppet] - 10https://gerrit.wikimedia.org/r/107830 [11:17:09] ^ I think I'm going to just let these build up [11:17:58] (03CR) 10Faidon Liambotis: [C: 032] More OCG path fun [operations/puppet] - 10https://gerrit.wikimedia.org/r/107830 (owner: 10Mwalker) [11:22:30] (03PS6) 10Dzahn: turn wikistats into module [operations/puppet] - 10https://gerrit.wikimedia.org/r/94409 [11:30:13] (03PS1) 10Dzahn: add salt grains automatically in system::role [operations/puppet] - 10https://gerrit.wikimedia.org/r/107831 [11:31:12] (03CR) 10Dzahn: "this one back to the original suggestion and rebased. to reply to paravoid's comment, see https://gerrit.wikimedia.org/r/#/c/107831/1 as w" [operations/puppet] - 10https://gerrit.wikimedia.org/r/83768 (owner: 10Dzahn) [11:33:38] (03CR) 10Alexandros Kosiaris: [C: 032] Remove hooper, eiximenis [operations/dns] - 10https://gerrit.wikimedia.org/r/106223 (owner: 10Alexandros Kosiaris) [11:43:22] (03CR) 10Dzahn: [C: 031] "lgtm,++Ariel" [operations/puppet] - 10https://gerrit.wikimedia.org/r/107341 (owner: 10Matanya) [11:51:31] (03CR) 10Dzahn: [C: 04-1] gitblit: convert into a module (033 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/107555 (owner: 10Matanya) [11:53:47] (03CR) 10Dzahn: gitblit: convert into a module (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/107555 (owner: 10Matanya) [11:55:09] !log fixed the cp1065 puppet freshness check constantly bugging us and reenabled notifications after that [11:55:16] Logged the message, Master [11:56:07] paravoid: before I go to bed; Erik was wanting some sort of status update from me tomorrow on how much further work it's going to be on op's front to get ocg deployed -- I was hoping you might be able to join us briefly for our standup at 1730 UTC (0930 PST)? [11:56:31] (03CR) 10Dzahn: [C: 04-1] "setting -1 for now because it shouldn't be merged in this form, but see comments above" [operations/puppet] - 10https://gerrit.wikimedia.org/r/107424 (owner: 10Gage) [11:57:57] *and by tomorrow I mean... in 5 hours [12:00:32] (03CR) 10Dzahn: "could you just mv/rename copy-by-url-proxy.conf instead of deleting it in old place and re-adding it?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/107590 (owner: 10Matanya) [12:02:53] (03PS4) 10Matanya: gitblit: convert into a module [operations/puppet] - 10https://gerrit.wikimedia.org/r/107555 [12:04:36] (03CR) 10Dzahn: [C: 031] apt: lint clean [operations/puppet] - 10https://gerrit.wikimedia.org/r/107358 (owner: 10Matanya) [12:04:45] (03CR) 10Matanya: "it is not the same file, it was retabed, and cleared a bit, so git decided it was a new file." [operations/puppet] - 10https://gerrit.wikimedia.org/r/107590 (owner: 10Matanya) [12:05:29] (03CR) 10Alexandros Kosiaris: [C: 032] download: lint clean [operations/puppet] - 10https://gerrit.wikimedia.org/r/107341 (owner: 10Matanya) [12:05:48] (03CR) 10Alexandros Kosiaris: [V: 032] download: lint clean [operations/puppet] - 10https://gerrit.wikimedia.org/r/107341 (owner: 10Matanya) [12:11:48] (03CR) 10Dzahn: "uhm, looks ok without checking every single comma though and only if we actually agree on that style recommendation to separate all the re" [operations/puppet] - 10https://gerrit.wikimedia.org/r/107355 (owner: 10Matanya) [12:13:59] (03CR) 10Dzahn: "hmm, in that case a separate patch that cleans that up may be appropriate so people don't have to read 4k lines" [operations/puppet] - 10https://gerrit.wikimedia.org/r/107590 (owner: 10Matanya) [12:14:20] (03CR) 10Matanya: "ok, i'll revert that." [operations/puppet] - 10https://gerrit.wikimedia.org/r/107590 (owner: 10Matanya) [12:17:37] (03CR) 10Dzahn: "this should be tested in labs" [operations/puppet] - 10https://gerrit.wikimedia.org/r/107567 (owner: 10Matanya) [12:20:32] (03CR) 10Dzahn: [C: 032] cpufrequtils: lint clean [operations/puppet] - 10https://gerrit.wikimedia.org/r/107346 (owner: 10Matanya) [12:21:49] (03PS1) 10Alexandros Kosiaris: Replace hooper's mgmt entries with asset tag [operations/dns] - 10https://gerrit.wikimedia.org/r/107838 [12:29:01] (03PS2) 10Matanya: url-downloader: convert into a module [operations/puppet] - 10https://gerrit.wikimedia.org/r/107590 [12:31:23] (03CR) 10Dzahn: [C: 04-1] svn: convert into a module (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/100760 (owner: 10Matanya) [12:31:31] (03CR) 10Dzahn: svn: convert into a module (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/100760 (owner: 10Matanya) [12:37:58] (03CR) 10Dzahn: [C: 031] "is this going to stay "ganglia_new" or be renamed after it replaced ganglia(old)?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/107128 (owner: 10Matanya) [12:38:39] (03CR) 10Alexandros Kosiaris: "We could extend system::role by adding an extra optional parameter (saltgrain_value ???) and if it is undef to generate the value from rol" [operations/puppet] - 10https://gerrit.wikimedia.org/r/107831 (owner: 10Dzahn) [12:39:27] (03CR) 10Alexandros Kosiaris: [C: 032] "So unless https://gerrit.wikimedia.org/r/#/c/107831/ moves on fast with a nice proposal, I 'd say merge." [operations/puppet] - 10https://gerrit.wikimedia.org/r/83768 (owner: 10Dzahn) [12:40:32] (03PS2) 10Alexandros Kosiaris: Minor scap comment tweak [operations/puppet] - 10https://gerrit.wikimedia.org/r/107738 (owner: 10Aaron Schulz) [12:40:48] (03CR) 10Dzahn: [C: 031] "thanks, much easier to review now" [operations/puppet] - 10https://gerrit.wikimedia.org/r/107590 (owner: 10Matanya) [12:40:53] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Minor scap comment tweak [operations/puppet] - 10https://gerrit.wikimedia.org/r/107738 (owner: 10Aaron Schulz) [12:49:53] (03CR) 10Alexandros Kosiaris: [C: 04-1] smokeping: puppet 3 compatibility fix: fully qualify variable + minor lint (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/107825 (owner: 10Matanya) [12:52:12] (03CR) 10Matanya: smokeping: puppet 3 compatibility fix: fully qualify variable + minor lint (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/107825 (owner: 10Matanya) [12:56:11] (03CR) 10Alexandros Kosiaris: [C: 032] Replace hooper's mgmt entries with asset tag [operations/dns] - 10https://gerrit.wikimedia.org/r/107838 (owner: 10Alexandros Kosiaris) [12:58:53] (03PS10) 10Matanya: svn: convert into a module [operations/puppet] - 10https://gerrit.wikimedia.org/r/100760 [13:01:41] (03PS2) 10Matanya: smokeping: puppet 3 compatibility fix: module path [operations/puppet] - 10https://gerrit.wikimedia.org/r/107825 [13:02:00] yay, a review party :) [13:04:23] (03PS2) 10Matanya: geoip: puppet 3 compatibility fix: module path [operations/puppet] - 10https://gerrit.wikimedia.org/r/107826 [13:05:49] PROBLEM - Host hooper is DOWN: PING CRITICAL - Packet loss = 100% [13:06:19] (03CR) 10Alexandros Kosiaris: [C: 031] smokeping: puppet 3 compatibility fix: module path [operations/puppet] - 10https://gerrit.wikimedia.org/r/107825 (owner: 10Matanya) [13:06:36] $module_name should work [13:06:46] so just do modules/$module_name/ [13:07:10] btw [13:07:20] you prefer ${module_name} or not ? [13:07:26] mark does [13:07:30] I'm indifferent [13:07:39] that answers it [13:07:45] so module_name it is :-) [13:07:53] ${module_name} it is :) [13:10:33] i don't like it, but you ask for it, i'll do it [13:10:38] do you ask for it? [13:12:40] if we have consensus on stuff that is WMF-style but differs from the style guided you quoted, it should be added to our own puppet wikitech page [13:13:44] forking from offical style is that overhead though [13:16:39] i don't think that is stated in the style guide [13:16:47] this specific case i mean [13:18:16] eh, maybe yea, but we just another one earlier [13:18:19] it is not, but splitting resources is [13:18:26] about splitting up all the file{} resources [13:18:36] just one per file and never combining them.. shrug [13:18:44] which i supprt and mutante doesn't :) [13:18:56] matanya: no, i said i can't decide :p [13:18:59] anyone who has a full checkout of mediawiki/extensions and can tell me how bit it is? [13:19:11] 1 bit [13:19:27] that was useful [13:19:34] can someone tell me how big it is? :) [13:19:57] YuviPanda: yes I can [13:20:12] Nemo_bis: sweet! can you tell me the size of it? [13:20:24] YuviPanda: du running [13:20:40] but I also download ref/notes or whatever the name and I never compressed it or whatever the name [13:20:55] it's on toolserver, /mnt/user-store/git and world-readable [13:21:21] Nemo_bis: ah, hmm [13:21:26] I should probably put one on toollabs [13:21:37] but I want it to be fast, and NFS... [13:21:40] I think it was about 4 GB in October [13:21:56] hmm, that's not too bad [13:22:03] du will take a bit [13:22:12] 'tis fine, thanks for running it, Nemo_bis [13:22:39] YuviPanda: if you gc it once a week, it will be ok [13:24:39] hmm [13:27:29] hmm, is there a list of all deployed extensions somewhere? [13:27:43] * YuviPanda looks around http://noc.wikimedia.org/conf/ [13:29:12] hmm, there's no such thing [13:29:20] other than evaluating CommonSettings.php [13:38:39] YuviPanda: it's in mediawiki/tools [13:38:45] oh [13:38:48] * YuviPanda looks [13:39:06] hmm, https://github.com/wikimedia/mediawiki-tools [13:39:08] iks empty [13:39:20] ori: do you happen to know why https://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Miscellaneous%20pmtpa&h=hume.wikimedia.org&v=823574&m=Global_JobQueue_length&r=hour&z=default&jr=&js=&st=1365625056&z=large has been missing data for the last 3 days [13:39:23] * YuviPanda looks for things under it [13:40:00] Nemo_bis: ah, thank you! [13:40:51] https://git.wikimedia.org/blob/mediawiki%2Ftools%2Frelease.git/HEAD/make-wmf-branch%2Fdefault.conf [13:40:56] https://git.wikimedia.org/blob/operations%2Fmediawiki-config.git/HEAD/wmf-config%2Fextension-list [13:41:01] https://www.mediawiki.org/wiki/Category:Extensions_used_on_Wikimedia [13:41:12] hmm, second one's useful [13:41:24] Nemo_bis: still, I guess there's no way for me to figure out which extensions are deployed where [13:41:37] UploadWizard, for example, is only on a few wikis [13:41:40] same with ShortUrl [13:41:54] I guess I'll have to spam Special:Version? [13:41:58] * YuviPanda looks for an API call [13:42:24] woo that works! [13:51:53] YuviPanda: mediawiki/tools is just a container, and check on gerrit not github:) [13:52:04] mutante: yeah, did :) [13:52:05] start typing mediawiki/tools into project filter [13:52:14] and it autocompletes all the ones inside [13:52:18] mutante: still doesn't have what I want, which is list of extensions per wiki [13:52:32] mutante: is there a way to get mapping of dbname to wiki URL? [13:52:45] I'm starting to explore building greg-g's deployment dashboard [13:53:26] YuviPanda: extensions per wiki? you want to know which wikis a specific extensions is installed on? [13:53:42] mutante: no, I want list of extensions installed on a wiki for every wiki [13:53:50] just extensions on a single wiki is Special:Version [13:54:11] YuviPanda: you can ask the API [13:54:16] mutante: yeah, I am going to do that [13:54:23] mutante: but then I also need to find list of all wikis... [13:54:23] i tried that once in wikistats [13:54:25] of all projects [13:54:27] oh? [13:54:27] and? [13:54:35] like when i get number of pages, users etc, also get list of extensions [13:54:40] then store it in db [13:54:51] it was used on mw.org on extension pages [13:55:02] as an external link marked experimental [13:55:14] and it broke and the guy who wrote that script isnt active anymore [13:55:40] though i could still find the source somewhere [13:55:51] hmm [13:55:59] that was not just WMF [13:56:02] but ALL mw :p [13:56:30] hahaha [13:56:34] that... seems extensive [13:56:46] well we have thousands of URLS of mw installs on the net [13:56:50] in a database [13:57:12] we do? [13:57:24] yea, wikistats.wmflabs.org [13:57:32] not to be confused with ezachte stats.wm.org [13:57:47] heh [13:57:55] one of the sources is also that team of archive.org wiki people [13:58:28] and it ran on non-wmf server before, it has been given to wmf [13:58:48] Nemo_bis: ^^ [13:58:57] yea :) [13:59:21] i think there is an open bug where i can import another 2k [13:59:34] Yes. [13:59:35] nice! [14:00:43] https://gerrit.wikimedia.org/r/#/c/94409/ [14:00:53] that's that thing i was just working on [14:01:02] nowadays that's done by wikiapiary [14:01:24] https://wikiapiary.com/wiki/Extension:Main_Page [14:01:59] Nemo_bis: is that used in the Extenseion pages on mediawiki.org [14:02:06] (instead of the old s23.org link) [14:02:29] the one that always had the "experimental" label and was a link on every single extension page, from template [14:03:02] because it should be either that or if we do it again as well, it should link to labs, but not to the old server [14:03:50] mutante: the old wikistats link was disabled years ago because it stopped working, some time ago it's been reused/replaced to link wikiapiary [14:03:52] ah, it is, see the link labeled "Check usage and version matrix" [14:03:57] yes [14:03:58] on any Extension page [14:04:05] gotcha [14:04:29] yea, i met the guy who wrote the original link 2 weeks ago, but first time after many years [14:04:32] heh [14:05:04] YuviPanda: http://lists.thingelstad.com/pipermail/wikiapiary-l/2014-January/000066.html [14:05:20] mutante: looking into it [14:05:37] thanks:) [14:06:16] YuviPanda: and probably you can contribute to it directly at http://lists.thingelstad.com/pipermail/wikiapiary-l/2013-November/000046.html [14:07:02] Nemo_bis: hmm, what I'm doing is something distinctly different [14:07:03] I'm still affectionate to wikistats and I think it's more suitable for huge sets of wikis and a few other things, but extensions tracking is something wikiapiary does well so it's probably better to concentrate efforts there [14:07:11] oh [14:07:21] the idea being 'I have a change merged in gerrit, now when will this get deployed to which wikis?' [14:07:37] and "this change was merged, where all is this deployed?" [14:07:39] it's great to hear but also kind of said for me if wikiapiary really should replace it completely some time, but just because so much time went into it over the years [14:07:40] and things of that sort [14:07:45] s/said/sad [14:07:45] very WMF centric [14:08:09] mutante> well we have thousands of URLS of mw installs on the net <-- well, so few actually, only a handful thousands independent mediawikis when there are tens; and nobody helps running my scraper even [14:08:11] and the idea was to give it to wmf labs vs. 3rd party [14:08:18] license-wise it was [14:08:41] Nemo_bis: where is it running? [14:08:46] mutante: linode [14:09:01] or do you mean the scraper? [14:09:07] Nemo_bis: if it was labs it'd be easier to get people to help [14:09:09] ? [14:09:13] yea [14:09:21] I doubt it's legal for labs [14:09:38] Nemo_bis: you know, i have been thinking about another approach to find them [14:09:46] I run it from home, it's not hard; I just need more IPs aka home computers [14:09:59] mutante: share please :) [14:10:15] Nemo_bis: suggest to add a feature in the installer, that people can opt-in to report their new installation, kind of like Debian popularity contest [14:10:28] so it asks if you want to be in stats or not on install once [14:10:39] and then tells you about a new install out there on the net some way [14:11:01] just once, getting the stats would still be pulled like before [14:11:11] mutante: there's already an RfC open for that [14:12:14] mutante: https://www.mediawiki.org/wiki/Requests_for_comment/Opt-in_site_registration_during_installation , please watchlist :) [14:12:15] Nemo_bis: cool! maybe i already created it as BZ enhancement but i forget ..:p [14:12:25] didn't know about RFC, nice [14:12:30] will do [14:12:46] there are a lot of bugs filed already [14:12:49] oh from hexmode, then that's because i talked to him about this before i think:) [14:13:18] possible, I also asked such a thing a year ago or more [14:13:34] I doubt it will ever happen though :) better start working on your own solutions [14:13:41] i never expected it to be so detailed already [14:13:58] I just need some people to run a simple ruby script which sleeps 99 % of the time [14:14:05] wow, pingserver etc [14:14:12] YuviPanda: 5.2 GB so far [14:14:20] oh hmm [14:14:36] Nemo_bis: interesting. I cloned it on toollabs, the clone finished, and du tells me 1.6G [14:14:48] YuviPanda: I told you, there's a lot of additional junk in it [14:14:53] right [14:14:54] right [14:16:41] Nemo_bis: is the scraper script somewhere on git? [14:16:59] what is 5.2 GB so far? [14:17:25] mutante: 5.2 GB is the clone of extensions [14:17:33] ah, i see [14:18:15] mutante: and the scraper is very stupid: script is https://gist.github.com/nemobis/7718061 and the command a simple loop http://p.defau.lt/?nM_NqaATaDXRW8y53bu7tQ [14:19:21] then the list of (mostly junk) URLs was checked with a slightly less stupid script (by emijrp) http://code.google.com/p/wikiteam/source/browse/trunk/listsofwikis/checkalive.py [14:19:32] Nemo_bis: thanks, first thing when i see the user_agent, i remember changing that back and forth because there were always more or less special cases that actually cared about it before they wanted to talk [14:19:44] which is useless for all the wikis below 1.17 which are the vast majority [14:20:15] mutante: in fact it doesn't work if you change it :) or not always [14:20:24] puts "We got a 503, party is over" *g* [14:20:41] that script only talks to Google, the checkalive script checks the wikis [14:20:51] Google had plenty of wikis indexed which are long dead [14:21:01] yea, i realize it gets google result, not actually pinging the internet [14:21:40] i wonder if it really is against any labs rules [14:21:56] the first part shouldn't , right [14:23:06] scraping google is probably against google tos [14:23:15] didn't even bother to check actually [14:23:34] (03CR) 10Alexandros Kosiaris: [C: 04-1] "I am kind of puzzled by the mariadb/mysql talk. Is it or not required?" (036 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/94409 (owner: 10Dzahn) [14:23:38] because the API they provide only gives 100 results per query, useless [14:23:49] so I have no alternatives to scraping, not my fault [14:24:18] so you search for Brion to find the wikis, also nice idea :) [14:24:45] under "# Other queries to use:" [14:25:10] yes, though that one doesn't work much for whatever reason [14:25:33] I suspect Google skips Special:version because it sees it as a duplicate page copied by/from thousands wikis [14:25:39] * thousands sites [14:29:17] Nemo_bis: or it skips it because it list the MediaWiki version which would let people find vulnerable setups easily :d [14:29:28] or maybe we instruct bots to not crawl it [14:29:51] hashar: it is indexed, just not often [14:30:07] and no there's no nofollow or robots.txt I could find [14:30:50] hashar: we already have sortable tables somewhere, cough, where you can sort by oldest first .. [14:31:10] yep, there are some nice 1.5 wikis out there [14:31:41] it's hard to find an answer what the best way is [14:31:53] we even had a small campaign of mailing admins with the oldest wikis [14:32:28] that was me :P [14:32:34] I sent few thousands emails [14:32:43] you need volunteers who walk them through an upgrade of a dozen major versions :p [14:33:03] they turn into long #mediawiki support talk then [14:33:41] Nemo_bis: thousands? wow, then there even multiple campaigns, they cant complain [14:33:48] i know hexmode did as well [14:35:54] he sent a few dozens AFAIK [14:36:13] I just took the full list of known mediawikis and emailed them all [14:38:00] !jenkins mwext-VisualEditor-sync-gerrit [14:38:00] https://integration.wikimedia.org/ci/job/ [14:38:04] come on [14:38:11] !jenkins mwext-VisualEditor-sync-gerrit [14:38:11] https://integration.wikimedia.org/ci/job/mwext-VisualEditor-sync-gerrit [14:40:39] (03PS7) 10Dzahn: turn wikistats into module [operations/puppet] - 10https://gerrit.wikimedia.org/r/94409 [14:41:03] (03CR) 10Dzahn: turn wikistats into module (036 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/94409 (owner: 10Dzahn) [14:44:07] (03CR) 10Dzahn: "Alex, re: mariadb/mysql. it's complicated (tm):). once, in the early days of labs, i wrote a MariaDB class, it was back in the "test" bran" [operations/puppet] - 10https://gerrit.wikimedia.org/r/94409 (owner: 10Dzahn) [14:44:53] (03CR) 10Ottomata: "It is a python module with multiple files. Should I add them all to puppet?" [operations/debs/check_ganglia] (debian) - 10https://gerrit.wikimedia.org/r/107723 (owner: 10Ottomata) [14:45:05] paravoid ^ [14:45:10] Nemo_bis: how was the reply rate? [15:04:29] (03PS1) 10Faidon Liambotis: (WIP) Refactor admins.pp into an admin module [operations/puppet] - 10https://gerrit.wikimedia.org/r/107848 [15:05:08] (03CR) 10jenkins-bot: [V: 04-1] (WIP) Refactor admins.pp into an admin module [operations/puppet] - 10https://gerrit.wikimedia.org/r/107848 (owner: 10Faidon Liambotis) [15:06:27] (03PS2) 10Faidon Liambotis: (WIP) Completely overhaul admins.pp & modularize [operations/puppet] - 10https://gerrit.wikimedia.org/r/107848 [15:06:40] paravoid: git review -D [15:06:51] what's that? [15:06:53] I don't use git review [15:07:16] really ? you push straight to refs/for/production ? [15:07:19] yes [15:07:41] I prefer git review. it rebases and submits on its own [15:07:56] mutante: I got a few dozens replies, can't remember exactly [15:08:10] enough to keep me busy for several days replying to emails [15:08:14] you do git review -d number and get the changeset. and can also publish Draft changesets (hence -D) [15:08:44] Nemo_bis: i see, thanks, yea, even if the percentage is low with that amount it scales of course :p nice work! [15:09:03] anyway you can push to refs/drafts/production [15:09:41] others wont ever see the change until you push it to refs/for/production or press the publish button [15:10:03] why would I do that? [15:10:07] I want others to see the change [15:10:30] well in case it is WIP and not ready for review ? [15:10:40] mutante: btw, given that user who asked Wikia stats to be updated, IMHO you should just restore them and tell wikia you did so at their users' request [15:10:47] ottomata: you're spamming nagios way too much, and I actively look at it every day [15:10:50] (03CR) 10Jeremyb: "rebump. this status quo actively redirecting people to broken URLs." [operations/apache-config] - 10https://gerrit.wikimedia.org/r/106107 (owner: 10Jeremyb) [15:10:58] ottomata: can you do your experiments on only one host? it's been weeks now [15:11:34] i'm spamming? [15:11:43] sorry in standup, will discuss in a few mins [15:12:03] https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?hosts=all&style=hostservicedetail&hoststatustypes=12&hostprops=2097162&servicestatustypes=28&serviceprops=2097162&nostatusheader [15:12:31] Nemo_bis: that's a Bugzilla where Robert Hanke asks for that i suppose? that same comment there would be appreciated, i really need to clean those up some time [15:12:39] oof, i know, this is ganglios, i should just turn it off [15:12:47] 16/24 alerts is you [15:14:01] mutante: no, I mean https://bugzilla.wikimedia.org/show_bug.cgi?id=59943 [15:15:34] (03CR) 10Matanya: "one minor comment. A layout question: what is the benefit from splitting init.pp and files.pp ? i think it would be better to put them all" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/96413 (owner: 10Dzahn) [15:16:37] Nemo_bis: thanks! i didnt even read that yet..umpf, just overload. yea, you're reply is correct, it was just stopped, and we also never had a good way to sync the working wikia wikis, and the plan was always to ask Wikia if they could theoretically just provide ALL the stats from their DB without making us do 40k HTTP requests [15:17:09] it seemed to extreme to update them even once in 24hrs [15:18:22] RECOVERY - LDAPS on virt1000 is OK: TCP OK - 0.002 second response time on port 636 [15:18:40] paravoid: The opendj error log said 'JE Database Environment corresponding to backend id userRoot is corrupt. Restart the Directory Server to reopen the Environment' [15:18:42] So I did. [15:18:52] RECOVERY - LDAP on virt1000 is OK: TCP OK - 0.000 second response time on port 389 [15:19:34] (03CR) 10Faidon Liambotis: [C: 032] clean up chapters redirects [operations/apache-config] - 10https://gerrit.wikimedia.org/r/106107 (owner: 10Jeremyb) [15:19:47] andrewbogott: lol [15:20:01] andrewbogott: I asked before about it and I got the reply that you and mike are working on it so it's expected [15:20:12] andrewbogott: so I didn't even try fixing it [15:20:23] obviously some kind of misunderstanding, sorry [15:20:48] Maybe my fault. I don't think of virt1000 as being currently in use... [15:20:59] But I guess it's different from the rest of the eqiad virt cluster that was just sitting idle. [15:21:16] At least where ldap/dns are concerned. [15:21:30] yeah [15:21:39] how's the labs migration going btw? [15:21:40] (03PS1) 10Cmjohnson: removing remenants of db29 from puppet, adding to decom.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/107852 [15:21:48] mutante: I think API exists for a reason, if you still have to ship data by snail mail you could as well shut down any API; updating once a month is ok, but not using API is just begging for problems (in fact, nothing happened for 2 years) [15:22:14] paravoid: Some parts of it are coming along well, but mhoover was sick w/the flu for about 10 days so… no real action on the puppet/OpenStack front. [15:22:30] (03PS1) 10Hashar: contint: invoke gerrit-sync-ve-push.sh as jenkins [operations/puppet] - 10https://gerrit.wikimedia.org/r/107853 [15:23:12] andrewbogott: hello :-]  Mind merging above change for contint ? Grant a sudo right to let slaves run a script with elevated privileges (jenkins user, not root hehe) [15:23:21] Nemo_bis: fair, but still needs a meta api that tells you the list of existing wikis, new ones added, closed ones removed [15:23:29] hashar: reading... [15:23:42] Nemo_bis:if each "farm" had that, would be awesome [15:23:59] mutante: there is an API to list existing wikis; that can be updated once a year IMHO [15:24:11] Nemo_bis: on wikia specifically? [15:24:23] yes, I had already given you the link IIRC [15:24:37] just check api.php on the community.wikia.com [15:24:48] Nemo_bis: thank you [15:25:14] (03CR) 10Faidon Liambotis: [C: 04-1] "This doesn't work. You need to do scope.lookupvar to lookup variables outside the current scope in .erbs." [operations/puppet] - 10https://gerrit.wikimedia.org/r/107828 (owner: 10Matanya) [15:25:17] i keep forgetting all this over regular work [15:25:24] normal :) [15:25:28] hashar, I don't immediately know how to read this sudo_user class. What does the 'ALL' mean? [15:25:40] …since you also define a set of commands and a user, ALL what? [15:25:40] it's a pity that moving the service to labs made it so much more complex :/ [15:26:13] andrewbogott: btw, I just added you as a reviewer to the admins.pp refactor, I thought you'd be interested [15:26:15] andrewbogott: all hosts maybe? I am not sure, I basically copy pasted the line for postgres [15:26:28] paravoid: I am! But won't get to it tonight. [15:26:32] Nemo_bis: hmm, yea, but "hard" still > "impossible" (to do social coding on it) [15:26:33] no worries [15:27:15] andrewbogott: I think ALL refers to the hostname [15:27:51] hashar: OK. It's consistent, at least, with how it's handled elsewhere. [15:27:55] hashar: any idea why jenkins try to so while runnning gerrit-sync-ve-push ??? [15:28:05] to sudo* [15:28:23] ah yeah the context [15:28:30] so the jobs are running with the user 'jenkins-slave' [15:28:46] (03CR) 10Andrew Bogott: [C: 032] contint: invoke gerrit-sync-ve-push.sh as jenkins [operations/puppet] - 10https://gerrit.wikimedia.org/r/107853 (owner: 10Hashar) [15:28:47] I have a job that needs to be able to push a reference to Gerrit using the Gerrit 'jenkins-bot' user [15:29:05] the credentials (SSH private key) for that user is under the jenkins user (not jenkins-slave) [15:29:13] (03PS2) 10Cmjohnson: removing remenants of db29 from puppet, adding to decom.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/107852 [15:29:15] so I have to run the ssh commands as jenkins to get the credentials [15:29:27] which is all crazy [15:31:34] mutante: well, I men hard for you and Robert; previusly it worked smoothly enough, though with not many additions [15:31:44] (03CR) 10Dzahn: "unrelated to actual code: the topic branch is "aude"? isn't that a nickname?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/98700 (owner: 10Andrew Bogott) [15:33:11] (03CR) 10Andrew Bogott: "Yeah, I just happened to develop this while in the process of giving her access." [operations/puppet] - 10https://gerrit.wikimedia.org/r/98700 (owner: 10Andrew Bogott) [15:33:32] ok paravoid [15:33:35] sorry about the spamming [15:33:39] :) [15:33:45] i shoudl have turned of ganglios when i realized it wasn't going to work at all yesterday [15:33:53] so [15:34:03] it has been over a week [15:34:10] (03CR) 10Dzahn: "paravoid, are you aware of this one? does it conflict with your new admin.pp WIP ?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/98700 (owner: 10Andrew Bogott) [15:34:10] because ganglios didn't work with non standard gmond ports [15:34:20] so previously the warning were just from amsterdam hosts [15:34:28] andrewbogott: thanks [15:34:37] gage: was going to work on the ganglios stuff as a way to get him involved with analytics cluster [15:34:49] but he's been busy getting dump from Leslie since she is leaving soon [15:34:59] (03CR) 10Andrew Bogott: "It conflicts only in the sense that Faidon's refactor is much more comprehensive and should render this patch moot." [operations/puppet] - 10https://gerrit.wikimedia.org/r/98700 (owner: 10Andrew Bogott) [15:34:59] so, i started fixing ganglios yesterday [15:35:11] I fixed what I thought were the main bugs, deployed a new version [15:35:28] and those bugs are indeed fixed, so more hosts were now being checked by ganglios [15:35:29] but [15:35:40] then i discovered that ganglios is basically broken from the bottom up [15:35:41] (03PS3) 10Cmjohnson: removing remenants of db29 from puppet, adding to decom.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/107852 [15:35:57] it strips the xml of tags, and then later tries to parse it as valid xml [15:36:17] hence the check_ganglia review. [15:36:31] (03CR) 10Cmjohnson: [C: 032] removing remenants of db29 from puppet, adding to decom.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/107852 (owner: 10Cmjohnson) [15:37:14] (03CR) 10Dzahn: "also see https://gerrit.wikimedia.org/r/#/c/98700/6" [operations/puppet] - 10https://gerrit.wikimedia.org/r/107848 (owner: 10Faidon Liambotis) [15:37:20] cmjohnson1: while at it... wanna update this too? https://rt.wikimedia.org/Ticket/Display.html?id=4644. Only a wipe and shutting the switch port down is left (it is a brocade and i don't have access) [15:38:04] akosiaris ok [15:38:48] thanks [15:40:26] (03PS1) 10Jeremyb: add rdns for iodine [operations/dns] - 10https://gerrit.wikimedia.org/r/107854 [15:40:51] paravoid, can we talk about why you don't thikn we should have a package for check_ganglia? [15:40:55] we have a package for ganglios [15:41:01] and this is a python module with a bunch of files [15:41:05] (03PS2) 10Jeremyb: add rdns for iodine [operations/dns] - 10https://gerrit.wikimedia.org/r/107854 [15:41:57] (03CR) 10Jeremyb: "rdns is in I7de50c0fcdbc095315b7d0f37d89c877041ff231" [operations/puppet] - 10https://gerrit.wikimedia.org/r/94111 (owner: 10Jeremyb) [15:42:42] (03CR) 10Faidon Liambotis: [C: 04-1] "iodine is in public1-b, not -a. Its static IP is 2620:0:861:2:208:80:154:146." [operations/dns] - 10https://gerrit.wikimedia.org/r/107854 (owner: 10Jeremyb) [15:42:51] grrrr [15:42:54] :) [15:42:55] i changed it! [15:43:02] did i not commit it? [15:43:14] * jeremyb did actually check that [15:43:48] (03CR) 10Andrew Bogott: "Rebase requires reconciling with https://gerrit.wikimedia.org/r/#/c/102052/1" [operations/puppet] - 10https://gerrit.wikimedia.org/r/98307 (owner: 10Faidon Liambotis) [15:45:33] (03CR) 10Dzahn: "bump, is this still going on? "Ryan and Mike are aggressively refactoring these classes right now"" [operations/puppet] - 10https://gerrit.wikimedia.org/r/97007 (owner: 10ArielGlenn) [15:45:55] (03PS3) 10Jeremyb: add rdns for iodine [operations/dns] - 10https://gerrit.wikimedia.org/r/107854 [15:46:08] (03CR) 10Andrew Bogott: "Yes, still going on." [operations/puppet] - 10https://gerrit.wikimedia.org/r/97007 (owner: 10ArielGlenn) [15:46:33] (03CR) 10Jeremyb: "right, I had actually checked and fixed that and then didn't commit! :(" [operations/dns] - 10https://gerrit.wikimedia.org/r/107854 (owner: 10Jeremyb) [15:47:05] (03CR) 10Dzahn: "thanks, i'll leave this to others involved more in labs migration" [operations/puppet] - 10https://gerrit.wikimedia.org/r/97007 (owner: 10ArielGlenn) [15:47:22] (03CR) 10Faidon Liambotis: [C: 032] add rdns for iodine [operations/dns] - 10https://gerrit.wikimedia.org/r/107854 (owner: 10Jeremyb) [15:47:32] * jeremyb runs away. thanks faidon for (several) recent merges [15:48:07] jeremyb: will you also add the forward? [15:48:10] any reason not to? [15:48:33] (03CR) 10Ottomata: [C: 032] hive: puppet 3 compatibility fix: fully qualify variables [operations/puppet] - 10https://gerrit.wikimedia.org/r/107821 (owner: 10Matanya) [15:48:45] !log disabling puppet on db29 [15:48:51] Logged the message, Master [15:49:32] PROBLEM - Host db29 is DOWN: PING CRITICAL - Packet loss = 100% [15:49:32] PROBLEM - Varnish HTTP text-backend on cp1055 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:49:35] paravoid: hrmmm... i was originally doing it for mail headers but for mail headers you maybe need the forward too... [15:49:42] PROBLEM - Varnish HTCP daemon on cp1055 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:49:42] PROBLEM - Varnish traffic logger on cp1055 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:49:44] (03CR) 10Dzahn: "hashar, is that a -0.5 ?:) your vote counts on contint i'd say" [operations/puppet] - 10https://gerrit.wikimedia.org/r/107347 (owner: 10Matanya) [15:49:47] paravoid: forward would also effect HTTPS clients though [15:50:00] paravoid: it's not going through lvs or anything [15:50:20] i guess we already use v6 other places and web clients survived... [15:50:29] would have to make sure apache's actually listening on v6 [15:50:49] paravoid: anyway, really running away [15:53:32] RECOVERY - Varnish HTCP daemon on cp1055 is OK: PROCS OK: 1 process with UID = 111 (vhtcpd), args vhtcpd [15:53:33] RECOVERY - Varnish traffic logger on cp1055 is OK: PROCS OK: 2 processes with command name varnishncsa [15:54:22] RECOVERY - Varnish HTTP text-backend on cp1055 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.000 second response time [15:56:25] (03CR) 10Dzahn: gitblit: convert into a module (033 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/107555 (owner: 10Matanya) [15:57:44] (03CR) 10Dzahn: "now that you're setting $host in the role and then use it in the module to setup the apache site (cool, that was the suggestion), you shou" [operations/puppet] - 10https://gerrit.wikimedia.org/r/107555 (owner: 10Matanya) [15:59:05] ok, i'm also gonna use an away nick for a change [15:59:09] cya later [16:14:19] !log rebooting cp1055 to clear out mess from XFS/kmem_alloc bug [16:14:25] Logged the message, Master [16:14:41] PROBLEM - Varnish HTTP text-frontend on cp1055 is CRITICAL: Connection timed out [16:15:00] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [16:16:10] PROBLEM - Host cp1055 is DOWN: PING CRITICAL - Packet loss = 100% [16:17:30] RECOVERY - Host cp1055 is UP: PING OK - Packet loss = 0%, RTA = 0.51 ms [16:21:54] (03PS1) 10Cmjohnson: removing dns for db29 [operations/dns] - 10https://gerrit.wikimedia.org/r/107858 [16:22:57] (03CR) 10Ottomata: [C: 032 V: 032] "I got an 'ok ok if you must' from Faidon in IRC. Merging :p" [operations/debs/check_ganglia] (debian) - 10https://gerrit.wikimedia.org/r/107723 (owner: 10Ottomata) [16:23:12] heh [16:23:17] (03PS1) 10Hashar: lower TTL for contint websites [operations/dns] - 10https://gerrit.wikimedia.org/r/107859 [16:24:57] so we have a merged changeset with a +2 and a -2 [16:25:18] i wonder what the historians of the future will say ... [16:27:12] hgaha [16:30:27] akosiaris: we will rewrite history before our kingdom fall appart [16:32:32] and since history is written by the victors that means we will be victorious. And yet our kingdom will fall apart. Weird.... [16:34:38] (03CR) 10Cmjohnson: [C: 032] removing dns for db29 [operations/dns] - 10https://gerrit.wikimedia.org/r/107858 (owner: 10Cmjohnson) [16:34:42] aaah i got it. All your bases are belong to us. Resistance is futile, you will be assimilated :-) [16:43:45] !log reedy synchronized php-1.23wmf11 [16:43:52] Logged the message, Master [16:47:18] !log db1004 replacing failing disk slot 9 [16:47:24] Logged the message, Master [16:47:32] !log tungsten replacing failing disk slot 3 [16:47:37] Logged the message, Master [16:48:00] !log es1007 replacing disk slot 6 [16:48:07] Logged the message, Master [16:51:50] PROBLEM - RAID on tungsten is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [16:55:40] PROBLEM - RAID on db1004 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [16:58:50] PROBLEM - RAID on es1007 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [17:08:34] !log reedy started scap: testwiki to 1.23wmf11 and build l10n cache [17:08:43] Logged the message, Master [17:13:00] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [17:13:14] Reedy: No. [17:13:35] I like the new logmsgbot already [17:18:28] Did somebody write a bot to taunt Reedy? [17:18:51] bd808: apparently [17:19:06] bd808: it already does other fun things, like calling andrewbogott_afk dummy everytime he does something [17:19:30] bd808: I think it just has interesting start messages. tools was just restarted, so this might be something from there [17:19:41] I saw the headache and naked join message earlier [17:20:24] yeah, I saw that too [17:20:33] quick, let's take down tools to see what it says next when it comes back up [17:28:39] (03PS1) 10Ottomata: Style fixes for nagios.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/107864 [17:29:15] !log reedy finished scap: testwiki to 1.23wmf11 and build l10n cache (duration: 23m 24s) [17:29:22] Logged the message, Master [17:29:40] (03PS2) 10Ottomata: Style fixes for nagios.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/107864 [17:33:50] RECOVERY - RAID on tungsten is OK: OK: optimal, 1 logical, 2 physical [17:43:40] RECOVERY - RAID on db1004 is OK: OK: optimal, 1 logical, 2 physical [17:44:20] (03PS1) 10Reedy: Add symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107870 [17:44:22] (03PS1) 10Reedy: Wikipedias to 1.23wmf10 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107871 [17:44:24] (03PS1) 10Reedy: Undeploy AssertEdit completely [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107872 [17:44:26] (03PS1) 10Reedy: phase1 wikis to 1.23wmf11 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107873 [17:45:19] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: testwiki back to 1.23wmf9 till window [17:45:26] Logged the message, Master [17:52:59] (03CR) 10Reedy: [C: 032] Add symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107870 (owner: 10Reedy) [17:53:06] (03Merged) 10jenkins-bot: Add symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107870 (owner: 10Reedy) [17:53:56] !log reedy synchronized docroot and w [17:54:03] Logged the message, Master [17:59:21] (03CR) 10Ori.livneh: gitblit: convert into a module (033 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/107555 (owner: 10Matanya) [18:01:25] (03PS3) 10Faidon Liambotis: (WIP) Completely overhaul admins.pp & modularize [operations/puppet] - 10https://gerrit.wikimedia.org/r/107848 [18:01:33] !log reedy synchronized php-1.23wmf11/includes/specials/SpecialWatchlist.php [18:01:39] Logged the message, Master [18:04:46] paravoid: an hero! [18:05:00] an hiero! [18:05:23] for encountering puppet bugs all afternoon? [18:05:32] puppet is a bug [18:05:39] like { $foo => ... } is invalid [18:05:48] but { "${foo}" => ... } is okay [18:05:54] yes, really [18:06:24] or how I was declaring a virtual resource multiple times and puppet was complain that it's redeclared [18:06:36] but it's virtual!? [18:07:13] my theory is that virtual resources inside other resources that have been instantiated with create_resources() are being treated as regular resources, not virtual [18:07:16] or something [18:17:00] <^d> We abandoned the Ehcache thing, right? [18:17:15] hey greg-g. let me know when you have a few moments to talk about the dashboard [18:24:50] RECOVERY - RAID on es1007 is OK: OK: optimal, 1 logical, 2 physical [18:37:00] ^d: I think so [18:37:15] <^d> Thought so. [18:37:16] I'm sure Tim was making progress, but to "fix" issues needed the non free version [18:37:23] Which obv was a no go [18:38:00] except when dealing with codecs [18:38:04] :P [18:40:06] greg-g: any objections to me starting the elasticsearch upgrade now? [18:40:32] manybubbles|away: should be ok [18:40:54] no longer away! [18:43:05] ^d: I see you fixed an error in master. thanks for that [18:44:31] !log starting Elasticsearch upgrade from 0.90.9 to 0.90.10 [18:44:37] Logged the message, Master [18:45:13] paravoid, you there? you know if mr. bergsma is around? [18:45:28] I am here, I don't know about mark [18:46:01] (03CR) 10Ottomata: [C: 032 V: 032] Style fixes for nagios.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/107864 (owner: 10Ottomata) [18:46:50] <^d> manybubbles: Was running some unit tests, came across it :) [18:47:27] paravoid, thx. just wanted to verify you'll be able to make the meeting in 45 minutes with bblack or bblack will be handling alone. my preference is to have all the brains together. [18:48:42] it's too late for me to catch it :( [18:48:55] but I'll be in the office next week [18:49:29] ok, i guess we can get on the same page with bblack, then solve next week. yurik will be here then, too. [18:49:50] PROBLEM - Host ms-be1002 is DOWN: PING CRITICAL - Packet loss = 100% [18:50:17] cool [18:50:46] !log powercycling ms-be1002, down, console unresponsive [18:50:54] Logged the message, Master [18:53:50] RECOVERY - Host ms-be1002 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [18:57:16] ^d: I'm not sure our requests are being logged [18:57:30] I might have renamed that file in a new version of that commit without you noticing [18:57:47] <^d> Hm? [18:57:58] paravoid, approx what day/time you get here? [18:58:05] trying to schedule a meeting [18:58:32] ^d: It is just that CirrusSearch-all doesn't really have anythign in it [18:58:43] <^d> manybubbles: Yeah. Most wikis haven't moved to new code yet. [18:58:47] <^d> Only wmf11 wikis have. [18:58:52] oh yeah! [18:58:54] <^d> :) [19:00:07] <^d> manybubbles: Future reference, e-mail subjects with "Yikes" in them are likely to get me to stop what I'm doing and panic for you :p [19:00:21] sorry [19:00:28] I'm not really sure where they are all coming from [19:00:33] but yeah, not panic worthy [19:01:31] <^d> Ouch, I see what you're talking about with enwiki now. [19:01:37] <^d> Weird, they shouldn't be logging there yet...old code [19:03:31] funky [19:03:54] my head is in Elasticsearch land and I'm starting to get into the upgrading groove but that is kinda funky [19:04:35] (03CR) 10Reedy: [C: 032] Wikipedias to 1.23wmf10 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107871 (owner: 10Reedy) [19:04:41] (03Merged) 10jenkins-bot: Wikipedias to 1.23wmf10 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107871 (owner: 10Reedy) [19:04:54] (03PS1) 10Ottomata: Adding class icinga::ganglia::check to install check_ganglia [operations/puppet] - 10https://gerrit.wikimedia.org/r/107885 [19:06:04] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikipedias to 1.23wmf10 [19:06:11] Logged the message, Master [19:06:19] (03CR) 10Ottomata: [C: 032 V: 032] Adding class icinga::ganglia::check to install check_ganglia [operations/puppet] - 10https://gerrit.wikimedia.org/r/107885 (owner: 10Ottomata) [19:06:31] <^d> manybubbles: We're at 2.6mil cirrusSearchLinksUpdate jobs now. We're falling further behind. [19:06:44] <^d> I'm inclined to turn back off for enwiki for the time being. [19:07:08] ^d: why not just clear the jobs and keep plowing forwards while we thinkg about them? [19:07:41] turning off enwiki will probably make them fail spectacularly [19:07:42] <^d> We could just drop the jobs :p [19:07:46] yeah [19:08:03] just smash the backlog and think about how to stop it from building back up [19:08:20] <^d> If we turned off the search update variable we would turn them all into a no-op :p [19:08:20] we won't really lose any data because we'll reindex enwiki eventually anyway [19:08:30] (03CR) 10Reedy: [C: 032] Undeploy AssertEdit completely [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107872 (owner: 10Reedy) [19:08:32] including indexing I think [19:08:36] (03Merged) 10jenkins-bot: Undeploy AssertEdit completely [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107872 (owner: 10Reedy) [19:08:36] <^d> Oh yeah duh. [19:08:38] <^d> Ignore m. [19:09:09] <^d> Bah, what was that one liner to drop all jobs of a given type again. [19:09:19] service redis restart [19:09:41] I think you emailed it to me [19:10:02] <^d> Got it [19:10:13] <^d> JobQueueGroup::singleton()->get( 'cirrusSearchLinksUpdate' )->delete(); [19:12:09] <^d> !log dropped all cirrusSearchLinksUpdate jobs from enwiki job queue. Job queue back to ~330k entries, far better than the ~3mil [19:12:15] Logged the message, Master [19:12:22] (03PS1) 10Ottomata: Adding define monitor_ganglia [operations/puppet] - 10https://gerrit.wikimedia.org/r/107887 [19:12:31] (03CR) 10Reedy: [C: 032] phase1 wikis to 1.23wmf11 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107873 (owner: 10Reedy) [19:12:37] (03Merged) 10jenkins-bot: phase1 wikis to 1.23wmf11 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107873 (owner: 10Reedy) [19:13:32] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: testwiki, testwiki, testwikidatawiki and mediawikiwiki to 1.23wmf11 [19:13:38] Logged the message, Master [19:14:36] (03CR) 10Ottomata: [C: 032 V: 032] Adding define monitor_ganglia [operations/puppet] - 10https://gerrit.wikimedia.org/r/107887 (owner: 10Ottomata) [19:17:16] (03PS1) 10Ottomata: Trying out monitor_ganglia with kafka-broker-MessagesIn check. [operations/puppet] - 10https://gerrit.wikimedia.org/r/107888 [19:17:35] (03PS2) 10Reedy: Add global permissions for Flow [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/106306 (owner: 10EBernhardson) [19:17:40] (03CR) 10Reedy: [C: 032] Add global permissions for Flow [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/106306 (owner: 10EBernhardson) [19:17:49] (03Merged) 10jenkins-bot: Add global permissions for Flow [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/106306 (owner: 10EBernhardson) [19:18:21] (03CR) 10Ottomata: [C: 032 V: 032] Trying out monitor_ganglia with kafka-broker-MessagesIn check. [operations/puppet] - 10https://gerrit.wikimedia.org/r/107888 (owner: 10Ottomata) [19:18:36] !log reseating pem0 on cr2-ulsfo to try and clear problem [19:18:42] Logged the message, Mistress of the network gear. [19:20:29] ^d: elastic1010 is very sad now [19:20:40] <^d> Ugh [19:20:41] ^d: why are those link update jobs slow? They don't have to do any parsing right? [19:20:44] !log Created zhwikivoyage echo tables on db1029 [19:20:50] (03PS1) 10Ottomata: monitor_ganglia cannot require icinga::ganglia::check [operations/puppet] - 10https://gerrit.wikimedia.org/r/107889 [19:20:51] Logged the message, Master [19:21:00] (03PS2) 10Reedy: Enable Thanks and Echo on zhwikivoyage [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107800 [19:21:01] <^d> AaronSchulz: No, they shouldn't be. And they're not slow...each one runs quickly. [19:21:05] (03CR) 10Reedy: [C: 032] Enable Thanks and Echo on zhwikivoyage [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107800 (owner: 10Reedy) [19:21:09] <^d> I think we're just doing too many and can't keep up. [19:21:12] (03Merged) 10jenkins-bot: Enable Thanks and Echo on zhwikivoyage [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107800 (owner: 10Reedy) [19:21:26] (03CR) 10jenkins-bot: [V: 04-1] monitor_ganglia cannot require icinga::ganglia::check [operations/puppet] - 10https://gerrit.wikimedia.org/r/107889 (owner: 10Ottomata) [19:21:28] (03PS2) 10Reedy: Let bureaucrats add users to accountcreator on elwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107599 (owner: 10Odder) [19:21:32] (03PS2) 10Ottomata: monitor_ganglia cannot require icinga::ganglia::check [operations/puppet] - 10https://gerrit.wikimedia.org/r/107889 [19:21:34] (03CR) 10Reedy: [C: 032] Let bureaucrats add users to accountcreator on elwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107599 (owner: 10Odder) [19:21:47] (03Merged) 10jenkins-bot: Let bureaucrats add users to accountcreator on elwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107599 (owner: 10Odder) [19:21:55] (03PS2) 10Reedy: Let admins add users to three groups on zhwikivoyage [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107596 (owner: 10Odder) [19:21:57] (03CR) 10Ottomata: [C: 032 V: 032] monitor_ganglia cannot require icinga::ganglia::check [operations/puppet] - 10https://gerrit.wikimedia.org/r/107889 (owner: 10Ottomata) [19:21:59] (03CR) 10Reedy: [C: 032] Let admins add users to three groups on zhwikivoyage [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107596 (owner: 10Odder) [19:22:01] ^d: are you actually tracking what links where or just the counts? [19:22:14] (03Merged) 10jenkins-bot: Let admins add users to three groups on zhwikivoyage [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107596 (owner: 10Odder) [19:22:25] <^d> AaronSchulz: For these jobs? Just counts if memory serves. [19:22:39] !log reedy synchronized database lists files: [19:22:45] Logged the message, Master [19:22:53] (03PS1) 10Ottomata: Fixing missing comma [operations/puppet] - 10https://gerrit.wikimedia.org/r/107890 [19:23:58] (03CR) 10Reedy: [C: 04-1] "This won't do what you think it should." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107340 (owner: 10Hydriz) [19:24:27] (03CR) 10Ottomata: [C: 032 V: 032] Fixing missing comma [operations/puppet] - 10https://gerrit.wikimedia.org/r/107890 (owner: 10Ottomata) [19:24:33] (03PS3) 10Reedy: Set $wgExportFromNamespaces to true on MediaWiki.org [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/106911 (owner: 10Odder) [19:24:37] (03CR) 10Reedy: [C: 032] Set $wgExportFromNamespaces to true on MediaWiki.org [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/106911 (owner: 10Odder) [19:24:44] (03Merged) 10jenkins-bot: Set $wgExportFromNamespaces to true on MediaWiki.org [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/106911 (owner: 10Odder) [19:25:01] (03PS4) 10Reedy: Completely undeploy AssertEdit (merged into core) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96931 (owner: 10Legoktm) [19:25:09] (03Abandoned) 10Reedy: Completely undeploy AssertEdit (merged into core) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96931 (owner: 10Legoktm) [19:25:14] :( [19:25:29] (03PS2) 10Reedy: Add aliases for NS 100, 106 on bnwikisource [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/106776 (owner: 10Odder) [19:25:36] (03CR) 10Reedy: [C: 032] Add aliases for NS 100, 106 on bnwikisource [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/106776 (owner: 10Odder) [19:25:43] (03Merged) 10jenkins-bot: Add aliases for NS 100, 106 on bnwikisource [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/106776 (owner: 10Odder) [19:25:44] legoktm: I forgot about that revision existing [19:25:47] I made one myself [19:25:49] ah [19:25:50] ok :D [19:26:06] If you look it rebased to be a no-op [19:26:18] (03PS2) 10Reedy: $wgMemoryLimit up to 220MB [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/106158 [19:26:26] (03CR) 10Reedy: [C: 032] $wgMemoryLimit up to 220MB [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/106158 (owner: 10Reedy) [19:26:34] lol [19:26:47] (03CR) 10Legoktm: "Was actually done in I86660dc9947de2b2c771692d1d6a89d9ea15236e" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96931 (owner: 10Legoktm) [19:27:16] (03CR) 10Reedy: "It might actually make sense to move Scribunto earlier rather than WikimediaIncubator later" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107340 (owner: 10Hydriz) [19:28:58] (03CR) 10Reedy: [V: 032] $wgMemoryLimit up to 220MB [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/106158 (owner: 10Reedy) [19:30:08] !log reedy synchronized wmf-config/ [19:34:24] PROBLEM - Disk space on elastic1010 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 11159 MB (3% inode=99%): [19:35:12] heh [19:35:19] manybubbles: ^^ [19:35:33] Reedy: I been watching it..... [19:35:44] I isn't enjoying the restart [19:35:58] I might see if I can free some space across the cluster by lowering redundancy to 2 [19:43:24] RECOVERY - Disk space on elastic1010 is OK: DISK OK [19:47:24] 92% = OK? [19:47:25] (03PS1) 10Ottomata: Using monitor_ganglia for udp2log packet_loss_avg and vanrishkafka drerr [operations/puppet] - 10https://gerrit.wikimedia.org/r/107893 [19:47:57] better than 97% [19:48:41] internet connection is flaking out [19:48:44] not a good time for that [19:50:00] (03CR) 10Aaron Schulz: Bump ParsoidCacheUpdateJobOnDependencyChange runners (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/107420 (owner: 10Aaron Schulz) [19:50:09] wow, https://gerrit.wikimedia.org/r/#/c/107420/2 wasn't merged yet [19:50:11] (03CR) 10Ottomata: [C: 032 V: 032] Using monitor_ganglia for udp2log packet_loss_avg and vanrishkafka drerr [operations/puppet] - 10https://gerrit.wikimedia.org/r/107893 (owner: 10Ottomata) [19:50:12] ori? [19:54:41] (03PS3) 10Ori.livneh: Bump ParsoidCacheUpdateJobOnDependencyChange runners [operations/puppet] - 10https://gerrit.wikimedia.org/r/107420 (owner: 10Aaron Schulz) [19:54:47] (03CR) 10Ori.livneh: [C: 032 V: 032] Bump ParsoidCacheUpdateJobOnDependencyChange runners [operations/puppet] - 10https://gerrit.wikimedia.org/r/107420 (owner: 10Aaron Schulz) [20:07:00] ohhh [20:08:11] (03PS1) 10Ottomata: Fix for check_ganglia, removing unused checkcommands [operations/puppet] - 10https://gerrit.wikimedia.org/r/107896 [20:10:02] (03PS2) 10Ottomata: Fix for check_ganglia, removing unused checkcommands [operations/puppet] - 10https://gerrit.wikimedia.org/r/107896 [20:10:09] (03PS3) 10Ottomata: Fix for check_ganglia, removing unused checkcommands [operations/puppet] - 10https://gerrit.wikimedia.org/r/107896 [20:12:40] (03CR) 10Ottomata: [C: 032 V: 032] Fix for check_ganglia, removing unused checkcommands [operations/puppet] - 10https://gerrit.wikimedia.org/r/107896 (owner: 10Ottomata) [20:12:42] ^d: I had to lower some of those "agressive" elasticsearch shard movement settings because they turned out to be "exciting" [20:12:44] PROBLEM - Host mw27 is DOWN: PING CRITICAL - Packet loss = 100% [20:12:52] <^d> exciting! [20:13:06] they were trying to write more data to a recovering node's disk then it was capable of handling at any one time [20:13:17] because, I suppose, we don't have an omgfast disk in there [20:13:28] which made that node very sad [20:13:33] RECOVERY - Host mw27 is UP: PING OK - Packet loss = 0%, RTA = 35.77 ms [20:14:02] so the whole process is going to take _longer_ [20:14:12] but be less exciting, I hope [20:14:49] I really don't like the default elasticearch restart process - just bounce the node and recover any changed from replicas. it is faster but you lose all redundancy while the node is recovering [20:14:59] rather, you lose the redundancy that the node would have provided [20:15:45] <^d> Such is life. [20:16:01] * YuviPanda gives ^d a potato [20:16:11] <^d> Potatoes? [20:16:19] you can cook it! [20:16:34] You can make French fries out of it! [20:18:31] what's with the increase in HTTP req/s & traffic? [20:18:59] http://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&c=Bits+caches+esams&m=cpu_report&s=by+name&mc=2&g=network_report [20:19:03] http://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&c=Bits+caches+eqiad&m=cpu_report&s=by+name&mc=2&g=network_report [20:19:23] 1.4-1.5x [20:20:03] spikes of fatals in prod? https://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&title=MediaWiki+errors&vl=errors+%2F+sec&x=0.5&n=&hreg[]=vanadium.eqiad.wmnet&mreg[]=fatal|exception>ype=stack&glegend=show&aggregate=1&embed=1 [20:20:18] do those get counted? they result in http status 500, so they should, right? [20:20:23] also, https://graphite.wikimedia.org/render/?title=HTTP%20Requests/sec%20%28excludes%20bits.wikimedia.org:%20css/js%29%20-8hours&from=-8hours&width=1024&height=500&until=now&areaMode=none&hideLegend=false&lineWidth=1&lineMode=connected&target=color%28cactiStyle%28alias%28scale%28reqstats.requests,%220.01666%22%29,%20%22requests/sec%22%29%29,%22blue%22%29 [20:21:19] 19:06 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikipedias to 1.23wmf10 [20:24:11] network isn't very happy [20:25:17] esams is getting saturated across the board [20:25:35] ori, what are these graphs based on? I see errors in fatals.log but not in fatalmonitor [20:25:48] also the monthly: http://ganglia.wikimedia.org/latest/graph.php?r=month&z=xlarge&c=Bits+caches+esams&m=cpu_report&s=by+name&mc=2&g=network_report [20:25:51] what the hell? [20:26:22] we hit 687M/s on bits today, while our normal is at 350M/s [20:27:24] ori, looks like most fatals are from job runners [20:29:49] * AaronSchulz wonders wtf infoaction uses getTemplateLinksFrom [20:30:09] so, something got deployed today and is causing havoc [20:30:33] paravoid: havoc on the indside or causing clients to request more stuff? [20:30:49] see the links above [20:30:49] varnishstat [20:30:59] varnishtop? [20:31:04] that [20:31:15] got output handy? [20:31:57] it's easy enough, but note that we have two separate effects [20:32:09] we have https://graphite.wikimedia.org/render/?title=HTTP%20Requests/sec%20%28excludes%20bits.wikimedia.org:%20css/js%29%20-8hours&from=-8hours&width=1024&height=500&until=now&areaMode=none&hideLegend=false&lineWidth=1&lineMode=connected&target=color%28cactiStyle%28alias%28scale%28reqstats.requests,%220.01666%22%29,%20%22requests/sec%22%29%29,%22blue%22%29 [20:32:14] which *excludes* bits [20:32:46] interesting spike at 18:45 [20:32:50] and it's > 10% increase in HTTP req/s [20:33:04] then we have bits [20:33:34] 22331.75 RxURL /geoiplookup [20:33:34] 9740.95 RxURL /favicon/wikipedia.ico [20:33:34] 9488.79 RxURL /static-1.23wmf10/extensions/UniversalLanguageSelector/resources/images/cog-sprite.svg?2014-01-09T16:51:40Z [20:33:37] 8995.63 RxURL /static-1.23wmf10/skins/common/images/poweredby_mediawiki_88x31.png [20:33:39] PROBLEM - Varnishkafka Delivery Errors on cp4020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.0 [20:33:39] PROBLEM - Varnishkafka Delivery Errors on cp1046 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.0 [20:33:40] 8927.57 RxURL /static-1.23wmf10/skins/vector/images/search-ltr.png?303-4 [20:33:49] PROBLEM - Varnishkafka Delivery Errors on cp1059 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.0 [20:33:49] PROBLEM - Varnishkafka Delivery Errors on cp1047 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.0 [20:33:59] PROBLEM - Varnishkafka Delivery Errors on cp4011 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.0 [20:34:05] working on it! [20:34:09] PROBLEM - Varnishkafka Delivery Errors on cp1060 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.0 [20:34:20] that's top-5 [20:35:18] bits is gradually getting better compared to the ~19:00 spike [20:35:19] PROBLEM - Varnishkafka Delivery Errors on cp4019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.0 [20:35:19] PROBLEM - Varnishkafka Delivery Errors on cp4012 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.0 [20:35:39] search-ltr is the little magnifying glass in the search box. [20:35:49] PROBLEM - Varnishkafka Delivery Errors on cp3011 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.0 [20:35:49] PROBLEM - Varnishkafka Delivery Errors on cp3012 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 0.0 [20:35:51] https://en.wikipedia.org/wiki/Template:Cite_news?action=info [20:35:53] poweredby is well, the poweredby logo :) [20:35:55] domas: "Pages transcluded on (446,448)" [20:36:00] fast query is fast ;) [20:36:01] however the traffic levels are still elevated and they have also been much more elevated than the past two weeks [20:36:09] cf. http://ganglia.wikimedia.org/latest/graph.php?r=month&z=xlarge&c=Bits+caches+esams&m=cpu_report&s=by+name&mc=2&g=network_report [20:39:20] (03CR) 10Hashar: [C: 031] "Change is fine, makes the declarations consistent which is a good thing :-]" [operations/puppet] - 10https://gerrit.wikimedia.org/r/107347 (owner: 10Matanya) [20:40:21] %35-40 of bits traffic is ULS, but I'm not sure if that's new [20:40:26] 35-40% even :) [20:40:38] we accept either notation [20:41:10] %3%5%-%4%0% [20:41:37] (03PS1) 10Ottomata: Fixing threshold for varnishkafka drerr alert [operations/puppet] - 10https://gerrit.wikimedia.org/r/107904 [20:41:40] Nikerabbit: ? [20:41:57] (03CR) 10Ottomata: [C: 032 V: 032] Fixing threshold for varnishkafka drerr alert [operations/puppet] - 10https://gerrit.wikimedia.org/r/107904 (owner: 10Ottomata) [20:42:22] flying [20:43:58] Rates:15.8Gbps/42.4Gbps - ifspeed: 40Gbps [20:44:01] paravoid: didn't understand your CR on udp2log [20:44:33] (03CR) 10Hashar: "Thanks for the review Daniel The 3 issues you reported were already present and I just fixed the most obvious puppet-lint issues." (033 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/104743 (owner: 10Hashar) [20:46:40] !log Deleted pacct.0 on neon; activity still silly high [20:47:00] that bits spike looks very similar to the one we had at the last upgrade [20:47:09] paravoid: is the 35-40% cog-sprite.svg, or is there anything else? [20:47:53] no, everything that matches UniversalLanguageSelector [20:48:10] paravoid: can you provide a few other URLs? [20:48:34] autonym, some other fonts [20:48:34] paravoid: Nemo_bis is right that pushing out a new release tends to cause a spike, because many bits URLs encode the MediaWiki release, so a lot of cached resources become invalid all at once [20:48:50] wait a sec for the urls [20:48:54] yes, I'm aware of this [20:49:07] so the deploy spike is getting better by the minute [20:49:19] !log Deleted all but last 50K lines from /var/log/ganglia/ganglia_parser.log (never rotated, not at 11G) from neon [20:49:23] actually last time was much worse, bits had hit 100 load [20:49:27] however the monthly shows an abnormal elevation of traffic the past 2-3 weeks [20:49:43] that + the deploy spike is filling up all kinds of pipes [20:49:49] there is a live bug with ULS that causes it to load too many fonts [20:50:05] niklas merged a workaround but i'm not sure it made it to wmf10, let me check [20:50:19] Total bytes: 1616669434 [20:50:19] Universal Language Selector: 585067766 [20:50:20] I want to be careful not to attribute it to ULS just yet though [20:50:22] 100943578 /static-current/extensions/UniversalLanguageSelector/data/fontrepo/fonts/Autonym/Autonym.woff?version=20131205 [20:50:31] 64545146 /static-current/extensions/UniversalLanguageSelector/data/fontrepo/fonts/SiyamRupali/SiyamRupali.woff?version=1.070 [20:50:35] 36328294 /static-current/extensions/UniversalLanguageSelector/data/fontrepo/fonts/Jomolhari/Jomolhari.woff?version=0.003 [20:50:42] etc. [20:50:49] yes, I'm not blaming it on ULS yet either [20:50:58] as I said, I'm not sure what the number was e.g. 2 weeks ago [20:52:09] a weekly cron job that runs varnishtop for a minute and emails the output to ops@ would be very useful [20:52:23] runs varnishtop where? :) [20:52:44] oh, right. i guess you need to collate the output of multiple bits. but even a single bits cache would be pretty good. [20:53:03] because you're really checking for misbehaving mediawiki code, which is not dc-aware. [20:53:08] (text, bits, mobile, upload) x (front, back) x (eqiad, esams, ulsfo) [20:53:09] (03PS1) 10Ottomata: Escaping '!' negation in varnishkafka drerr icinga threshold [operations/puppet] - 10https://gerrit.wikimedia.org/r/107910 [20:53:15] front bits [20:53:21] any DC [20:53:27] well yes, bits doesn't have backend [20:53:28] (03CR) 10Ottomata: [C: 032 V: 032] Escaping '!' negation in varnishkafka drerr icinga threshold [operations/puppet] - 10https://gerrit.wikimedia.org/r/107910 (owner: 10Ottomata) [20:53:40] DC matters [20:53:59] DCs tend to attract specific languages [20:54:15] yeah, not saying it's perfect, just saying that even a single host would be very useful [20:54:30] i'm chatty because i'm waiting for the ULS repo to clone, 40mb or so [20:55:05] what's the commit/ [20:56:41] paravoid: https://gerrit.wikimedia.org/r/#/c/107028/ , did not make it to wmf10 [20:56:43] i am backporting [20:58:29] (03CR) 10Matanya: "nitpicks." (038 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/104743 (owner: 10Hashar) [21:00:45] !log ori synchronized php-1.23wmf10/extensions/UniversalLanguageSelector/resources/js/ext.uls.webfonts.js 'Update UniversalLanguageSelector to master for I2da436caa: Wait till rendering thread completion before applying webfonts (Bug: 59958)' [21:02:01] let's see if that negates the trend [21:02:10] it doesn't look like a very likely match [21:02:45] between the issue and ULS, or between the issue and this specific patch? [21:03:17] the latter for sure [21:03:24] I'm not sure about the former [21:04:00] RECOVERY - Varnishkafka Delivery Errors on cp4011 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [21:04:10] RECOVERY - Varnishkafka Delivery Errors on cp1060 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [21:04:20] RECOVERY - Varnishkafka Delivery Errors on cp4019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [21:04:20] RECOVERY - Varnishkafka Delivery Errors on cp4012 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [21:04:30] RECOVERY - Varnishkafka Delivery Errors on cp4020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [21:04:40] RECOVERY - Varnishkafka Delivery Errors on cp1047 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [21:04:40] RECOVERY - Varnishkafka Delivery Errors on cp1046 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [21:04:40] RECOVERY - Varnishkafka Delivery Errors on cp3012 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [21:04:41] RECOVERY - Varnishkafka Delivery Errors on cp3011 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [21:04:50] RECOVERY - Varnishkafka Delivery Errors on cp1059 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [21:08:36] hrm [21:08:40] we're back to November levels [21:08:47] which isn't too bad [21:09:15] it's different than two weeks ago, but otoh two weeks ago was new year's [21:10:08] https://graphite.wikimedia.org/render/?title=HTTP%20Requests/hour%20%28excludes%20bits.wikimedia.org:%20css/js%29%20-6week&from=-6%20weeks&width=1024&height=500&until=now&areaMode=none&hideLegend=false&lineWidth=1&lineMode=connected&target=color%28cactiStyle%28alias%28hitcount%28scale%28reqstats.requests,%220.01666%22%29,%20%221hour%22%29,%20%22requests/hour%22%29%29,%22blue%22%29 [21:10:22] there was definitely a drop [21:10:44] all these URLs give me "no data" graphs btw [21:11:16] did you open the whole URL? [21:11:30] maybe it's getting trimmed by the server [21:11:39] is should end in %22blue%22%29 [21:11:49] yes it's not trimmed, not that long [21:12:16] hmpf it's firefox following another religion of percent-encoding it seems [21:12:27] I use firefox too [21:14:24] paravoid: what is generating that data again? [21:14:28] mediawiki itself -> profiler, right? [21:14:30] what data? [21:14:36] in the graph you linked to above [21:14:44] http req/hr? [21:14:51] no that's not mediawiki [21:14:52] yes [21:15:05] oh, that's reqstats [21:15:07] that's the perl script, yes? [21:15:09] yes [21:15:21] consumes from udp2log, writes to carbon [21:15:57] collector on professor was saturated and dropping some % of data. i don't remember if carbon was, too [21:17:02] (03PS5) 10Matanya: gitblit: convert into a module [operations/puppet] - 10https://gerrit.wikimedia.org/r/107555 [21:20:43] (03CR) 10Matanya: "The .erb file doesn't have any refs to git.wikimedia.org. I only changed the path there + addressing the rest of the comments." [operations/puppet] - 10https://gerrit.wikimedia.org/r/107555 (owner: 10Matanya) [21:20:44] paravoid: ULS backport seems to have helped, no? [21:22:59] indeed [21:23:12] good catch! [21:24:38] now that things are calm(er) does anyone have any objection to me turning on disk space awareness for Elasticsearch? [21:24:48] If you want to read about it, go here: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-allocation.html#disk [21:32:49] manybubbles: seems like a good idea? [21:34:00] <^d> I think I +1 the idea and bd808 does too. [21:34:04] <^d> So I say go for it :) [21:34:14] since no one objects I'm going to go turn it on. It hasn't hurt beta or development. I'll turn it on manually and submit a puppet patch this evening if it proves non-stupid [21:34:48] !log enabled disk space aware allocator on Elasticsearch cluster. now it won't do as many stupid things! [21:35:04] <^d> manybubbles: What did we decide? 85/95 for lower/upper and 60s check? [21:35:16] ^d: yes. that. [21:35:20] <^d> mmlk [21:35:22] <^d> *mmk [21:35:23] here I'll propose the puppet patch now [21:36:33] <^d> Things to prevent out of disk errors ++ [21:36:39] <^d> I think everyone's in favor of that [21:36:54] rm -rf /* [21:37:00] cmjohnson1: so cr2-ulsfo has a bad power supply - would you like to take care of it? you can see if someone in sf wants to swap ior it's totally something smarthands can handle [21:37:14] <^d> bd808: I tried that but then elasticsearch didn't work right [21:37:40] lesliecarr: i can take care of [21:37:47] or robh [21:38:39] i'll handle and ill have jgage shadow me [21:38:51] so we have two techs here with recent juniper experience [21:38:58] seems like best option to me [21:39:16] (i hate warranty returns, but i havent done a juniper one in over a year and we should have local folks to SF know how) [21:39:33] LeslieCarr: Can you drop a ticket in the ulsfo queue with the full details, including serial of stuff? [21:39:36] if you dont mind =] [21:39:42] (if you do, well... shit) [21:40:14] ^d: On day 2 of my first "real" job out of school I did `rm -rf .*` with / as $PWD in a brilliantly misguided attempt to clean up dot files left over from ~root being /. I realized my error and killed the process after all of /dev and half of /etc was gone. [21:40:39] (if you do i'll make ticket ;) [21:40:43] ^d: you should file a bug [21:41:16] it is working, btw [21:41:22] <^d> awesome [21:41:30] elastic1010 has 85% utilization and it isn't being sent more shards [21:41:40] they are going to other nodes with less data [21:41:49] <^d> Ok that's cool. [21:42:10] bd808, was there day 3 after that? [21:42:13] <^d> I mean, that's what it says on the tin, so I'm glad it works. [21:42:20] <^d> But still, cool to see it in action. [21:42:22] ok [21:42:45] MaxSem: I lucked out and found a matching server to recover from. Got a dump on tape, recreated enough of /dev by hand to mount the tape and restored. I didn't tell anybody for a week or so out of raw shame. [21:43:06] bd808: nice job! [21:44:19] <^d> bd808: Reminds me of first "real job." I was performing some database maintenance on our CMS system and I dropped the wrong rows. NOT IN is not the same as IN. [21:44:28] <^d> Had to recreate most of the data by hand from staging. [21:44:56] DROP TABLE revision; -- OH SHI [21:45:03] bd808: when I was in college I was sysadmin for the cs department and I had to nuke some dude's home directory but it wasn't dying right so I wrote some one liner bash script that didn't work either. [21:45:19] it would have recursed upwords but for AFS's permissions saying no [21:46:07] i rebooted production instead of test once. [21:46:26] too many terminal tabs issue [21:46:53] * MaxSem is so happy of not having uid 0:P [21:46:55] so when we kickstarted new machines we used my boss' desktop as the source for the kickstart file. we'd frequently use the red hat install cd to shell into his machine and then fix the file for the new machine [21:47:23] the problem is, the next step after you fix the file is to reboot _the machine in front of you_ then kickstart from that file [21:47:39] but if you type `sudo reboot` you reboot another machine.... [21:47:58] bd808: what year that was? [21:48:15] bd808: rm -rf / doesn't work nowadays, you'll get thrown a nice error [21:48:29] twkozlowski: 1995 on HP-UX 8.something [21:49:02] hardcore era, that explains things :) [21:49:10] twkozlowski: did you check it? i saw it lately on 2010ish [21:49:15] paravoid, ping [21:49:20] pong [21:49:28] on a dedicated machine [21:49:29] hey [21:49:40] hi [21:49:50] I split off the first part of the move to the new repo and upstart into https://gerrit.wikimedia.org/r/#/c/107492/ [21:50:22] the idea is to only add the new repo and upstart config, but keep the old repo and init in place [21:50:35] then init the new repo and do some manual testing [21:50:36] https://en.wikipedia.org/wiki/Rm_(Unix)#Protection_of_the_filesystem_root matanya [21:51:09] when that is all good, the second puppet change can then actually use the new repo & upstart [21:51:52] twkozlowski, 2010 - 2006 is just 4 years. not everyone upgrades to bleeding edge at all times [21:52:10] twkozlowski: hmm, interesting, i wonder if they used old gnu utils or the OS was just old. [21:52:47] MaxSem: :) [21:53:08] i think is was a solaris, but can't remember. I do remember though the uptime was since 1991 [21:54:18] two identical machines, one is still running, from what is hear from folks [21:54:40] so 23 years uptime. not bad [21:57:31] beta cluster has a php fatal error [21:57:36] matanya, booring. haven't they heard that Microsoft says about Windows' lower cost of ownership? they should go professional and install Windows [21:57:40] (03PS1) 10Manybubbles: WIP:Make Elasticsearch less exciting [operations/puppet] - 10https://gerrit.wikimedia.org/r/107920 [21:57:48] jackmcbarn: link? [21:58:09] Reedy: had, rather. it was missing something with MultimediaViewer. when i refreshed, it went back to normal [21:58:31] (03CR) 10Manybubbles: [C: 04-1] "Still WIP:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/107920 (owner: 10Manybubbles) [21:58:47] (i assume any page would have done it) [21:58:52] Fatal error: require_once() [function.require]: Failed opening required '/usr/local/apache/common-local/php-master/extensions/MultimediaViewer/MultimediaViewer.php' (include_path='/data/project/apache/common-local/php-master/extensions/TimedMediaHandler/handlers/OggHandler/PEAR/File_Ogg:/usr/local/apache/common-local/php-master:/usr/local/lib/php:/usr/share/php') at [21:59:09] yeah that [21:59:24] .. at /data/project/apache/common-local/wmf-config/CommonSettings.php on line 1828 [22:01:25] looks as if files are there now [22:04:34] paravoid, does that sound like a good plan to you? [22:04:55] hm [22:05:17] why are we doing the repo/path change and the upstart change together? [22:08:15] paravoid, the paths are changing etc [22:08:27] and we can test it both without switching to it [22:09:55] hrm [22:10:03] ok, I'll have a look at the patchsets [22:10:14] but I don't think we can reasonably deploy this on a Friday [22:11:05] paravoid: agreed- next Tuesday or Wednesday maybe [22:11:51] maybe... [22:12:04] I'll be at the office, people keep adding meetings to my calendar :) [22:12:12] (and sometimes I do too, to be fair) [22:17:11] paravoid, I'll be busy too- so maybe after next week [22:17:24] yeah I can imagine [22:17:36] but let's see [22:18:11] hey cmjohnson1, [22:18:13] you there? [22:18:14] so not super-urgent to do the review now if you have more pressing things on your plate [22:18:48] any idea on how to force console com2 to lose it session when you are not in it? [22:18:56] my network connection died while i was in a console on analytics1012 [22:19:00] now I can't get back in [22:19:39] ottomata racreset [22:19:50] racadm racreset [22:20:25] (03PS1) 10Jgreen: move several fundraising conf files to aluminium:/etc/fundraising [operations/puppet] - 10https://gerrit.wikimedia.org/r/107921 [22:22:01] thanks cmjohnson1, trying it [22:23:46] (03CR) 10Ottomata: [C: 031] WIP:Make Elasticsearch less exciting (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/107920 (owner: 10Manybubbles) [22:24:11] (03PS2) 10Jgreen: remove and/or move several fundraising conf files to aluminium:/etc/fundraising [operations/puppet] - 10https://gerrit.wikimedia.org/r/107921 [22:25:36] (03CR) 10Adamw: [C: 031] "This is exciting!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/107921 (owner: 10Jgreen) [22:26:37] (03CR) 10Jgreen: [C: 032 V: 031] remove and/or move several fundraising conf files to aluminium:/etc/fundraising [operations/puppet] - 10https://gerrit.wikimedia.org/r/107921 (owner: 10Jgreen) [22:29:46] (03PS2) 10Manybubbles: WIP:Make Elasticsearch less exciting [operations/puppet] - 10https://gerrit.wikimedia.org/r/107920 [22:32:31] (03PS1) 10Ottomata: analytics1012 is no longer a journalnode [operations/puppet] - 10https://gerrit.wikimedia.org/r/107922 [22:33:00] (03CR) 10Ottomata: [C: 032 V: 032] analytics1012 is no longer a journalnode [operations/puppet] - 10https://gerrit.wikimedia.org/r/107922 (owner: 10Ottomata) [22:36:29] (03PS1) 10QChris: Log cron job output to files for geowiki's process_data step [operations/puppet] - 10https://gerrit.wikimedia.org/r/107923 [22:45:46] (03PS2) 10Ottomata: Log cron job output to files for geowiki's process_data step [operations/puppet] - 10https://gerrit.wikimedia.org/r/107923 (owner: 10QChris) [22:45:52] (03CR) 10Ottomata: [C: 032 V: 032] Log cron job output to files for geowiki's process_data step [operations/puppet] - 10https://gerrit.wikimedia.org/r/107923 (owner: 10QChris) [23:02:07] cmjohnson1: do you remember (i don't remember if there was a ticket) if you got the zayo cross connect plugged into cr1-eqiad ? [23:02:11] i'm looking to see if there's a ticket [23:02:19] there was and he did [23:02:27] I remember :) [23:02:33] oh you did [23:02:34] hehe [23:02:36] woot [23:02:43] yay people with memories [23:02:46] yes!...thx paravoid [23:03:07] so status of giglinx/zayo/abovenet is ulsfo has light, eqiad does not [23:03:11] i'm going to put an ip on it now [23:04:56] RECOVERY - Disk space on analytics1012 is OK: DISK OK [23:05:06] RECOVERY - RAID on analytics1012 is OK: OK: no disks configured for RAID [23:05:06] RECOVERY - Host analytics1012 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [23:05:16] RECOVERY - puppet disabled on analytics1012 is OK: OK [23:05:36] RECOVERY - SSH on analytics1012 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [23:05:36] RECOVERY - DPKG on analytics1012 is OK: All packages OK [23:06:07] awwwww analytics1012 is waking up! [23:06:09] thanks cmjohnson1! [23:06:19] woot [23:06:42] i didn't have to reinstall, btw [23:06:47] just change the ip to the new one for the rack it is in [23:06:53] i'm wiping the hdfs data dirs and starting over there [23:06:59] but system is still as was before [23:08:40] ottomata: great news [23:09:59] (03PS1) 10Jgreen: redo install layout for legacy paypal IPN listener [operations/puppet] - 10https://gerrit.wikimedia.org/r/107971 [23:11:45] (03CR) 10Jgreen: [C: 032 V: 031] redo install layout for legacy paypal IPN listener [operations/puppet] - 10https://gerrit.wikimedia.org/r/107971 (owner: 10Jgreen) [23:16:43] (03PS1) 10Jgreen: fix duplicate file install error [operations/puppet] - 10https://gerrit.wikimedia.org/r/107978 [23:17:52] out for the eve, laters all! [23:18:58] (03CR) 10Jgreen: [C: 032 V: 031] fix duplicate file install error [operations/puppet] - 10https://gerrit.wikimedia.org/r/107978 (owner: 10Jgreen) [23:18:58] cmjohnson1: we tried rolling, didn't we ? [23:19:18] i'll put in a ticket to try again tomorrow, if you'll be in dc [23:27:48] (03PS2) 10Ori.livneh: kibana: Restrict URLs proxied without authentication [operations/puppet] - 10https://gerrit.wikimedia.org/r/107639 (owner: 10BryanDavis) [23:28:16] (03CR) 10Ori.livneh: [C: 032 V: 032] kibana: Restrict URLs proxied without authentication [operations/puppet] - 10https://gerrit.wikimedia.org/r/107639 (owner: 10BryanDavis) [23:28:18] (03PS1) 10Lcarr: adding in rdns for new link between eqiad and ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/107980 [23:29:08] paravoid: are you still around, by any chance? if so, thoughts re: https://gerrit.wikimedia.org/r/#/c/107609/ ? [23:29:29] I am, give me a sec [23:29:48] no problem, thanks. [23:30:45] (03PS2) 10Ori.livneh: Logstash: Configure Elasticsearch to automatically create indices [operations/puppet] - 10https://gerrit.wikimedia.org/r/107716 (owner: 10BryanDavis) [23:30:50] (03CR) 10Ori.livneh: [C: 032 V: 032] Logstash: Configure Elasticsearch to automatically create indices [operations/puppet] - 10https://gerrit.wikimedia.org/r/107716 (owner: 10BryanDavis) [23:31:35] (03PS1) 10Faidon Liambotis: varnish: add Ganglia metrics for memory statistics [operations/puppet] - 10https://gerrit.wikimedia.org/r/107982 [23:31:48] (03CR) 10Lcarr: [C: 032] adding in rdns for new link between eqiad and ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/107980 (owner: 10Lcarr) [23:35:48] (03CR) 10Faidon Liambotis: [C: 032] varnish: add Ganglia metrics for memory statistics [operations/puppet] - 10https://gerrit.wikimedia.org/r/107982 (owner: 10Faidon Liambotis) [23:38:36] RECOVERY - Varnishkafka Delivery Errors on cp3014 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [23:43:20] come back ganglia! [23:43:25] I love you [23:45:35] that would be my change + a ganglia/rrd bug, sorry about that [23:47:34] (03PS1) 10Faidon Liambotis: Revert "varnish: add Ganglia metrics for memory statistics" [operations/puppet] - 10https://gerrit.wikimedia.org/r/107985 [23:47:39] (03CR) 10Faidon Liambotis: [C: 032] Revert "varnish: add Ganglia metrics for memory statistics" [operations/puppet] - 10https://gerrit.wikimedia.org/r/107985 (owner: 10Faidon Liambotis) [23:48:11] (03CR) 10Faidon Liambotis: [V: 032] Revert "varnish: add Ganglia metrics for memory statistics" [operations/puppet] - 10https://gerrit.wikimedia.org/r/107985 (owner: 10Faidon Liambotis) [23:49:06] (03PS1) 10Jgreen: a little file ownership and formatting cleanup in fundraising.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/107986 [23:49:47] ^d: can you bother someone about ganglia that is online this time of day? [23:50:08] <^d> Oh snap, that sucks. [23:50:10] see above [23:50:11] it's being fixed [23:50:13] I'm on it [23:50:18] <^d> Ah, ok :) [23:50:18] (03CR) 10Jgreen: [C: 032 V: 031] a little file ownership and formatting cleanup in fundraising.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/107986 (owner: 10Jgreen) [23:50:31] * ^d was in the middle of a dozen other things, hadn't been reading channel [23:52:51] thanks!