[00:07:19] ori_: are the tmh boxes much better than regular mw1001-mw1016 boxen? [00:07:47] * AaronSchulz wonders if we really need that separation [00:13:37] AaronSchulz: I think you could glean that from Ganglia, no? [00:13:47] or by sshing in? [00:14:00] if you hit a blocker I can look too [00:14:15] I don't know off the top of my head, if that's what you're asking [00:20:00] I see 16G vs 12G ram, 8 vs 12 cores, and 900G vs 200G disk (all tmh vs mw1001) [00:20:22] I guess the question is...do they make use of all that disk [00:21:48] hey ^demon|away, I don't want to wreck your weekend, but I was watching the logs due to my oauth deploy, and noticed a bunch of Fatal error: Class 'Elastica\Exception\ElasticsearchException' not found at /usr/local/apache/common-local/php-1.24wmf10/extensions/Elastica/Elastica/lib/Elastica/Exception/ResponseException.php on line 74 [00:22:19] looks like they run 5 transcode job procs max (according to puppet) [00:23:07] ori_: seems like you could easily just fold tmh boxen into two more job runners and give 1 proc to all 18 (vs 10 procs we have right now) [00:23:24] * AaronSchulz likes simplicity :) [00:23:58] but maybe there is some special reason for separating them I'm not seeing [00:25:38] AaronSchulz: probably not [00:26:24] ideally there'd be no differentiation between job runners, not just in terms of hardware specs, but in terms of configuration (job queues) too [00:26:49] yeah, the cpu difference might be annoying, since the etc file might need to be templated to vary on cpu count [00:27:05] that might be a reason to just use mw10** (maybe get more) [00:30:17] i think stepping back and thinking about what we can do on the software level to distribute the workload [00:30:21] could be productive [00:32:55] different scheduling algorithms, having some unit of computing resource like amazon's ECU [00:35:23] there's so much industry hype about that now, there must be some good papers to read or something [00:36:27] AaronSchulz: doesn't the toolserver use some grid engine thing? [00:36:34] maybe Coren has some ideas? [00:37:28] * AaronSchulz knows little of ts [00:37:37] ori_: yes, it uses SGE [00:39:10] https://en.wikipedia.org/wiki/Sun_Grid_Engine [00:41:29] AaronSchulz: https://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management looks interesting too [00:41:37] "SLURM uses a best fit algorithm based on Hilbert curve scheduling or fat tree network topology in order to optimize locality of task assignments on parallel computers." [00:41:42] * ori_ knows what some of those words mean [00:42:07] like "computers" and "uses" ;) [00:42:57] what's interesting is: [00:43:03] 'Job profiling (periodic sampling of each tasks CPU use, memory use, power consumption, network and file system use) ' [00:43:13] 'Fair-share scheduling with hierarchical bank accounts ' [00:43:20] 'Integrated with database for accounting and configuration ' [00:43:21] etc. [00:43:44] p'Real-time accounting down to the task level (identify specific tasks with high CPU or memory usage) ' [00:43:55] and it's GPL! [00:45:50] it's running on 5 of the top 10 computers in the TOP500 [00:46:14] anyways a comprehensive solution like that may have been overkill ten years ago but i don't think it would be now [02:16:41] !log LocalisationUpdate completed (1.24wmf10) at 2014-06-28 02:15:38+00:00 [02:16:50] Logged the message, Master [02:25:16] !log LocalisationUpdate completed (1.24wmf11) at 2014-06-28 02:24:12+00:00 [02:25:21] Logged the message, Master [02:42:02] (03PS1) 10Yurik: Fix LABS url for Zero portal. Ready to commit any time. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142764 [02:43:42] greg-g, would it be ok to +2 it now? Should be prod noop [02:45:36] MaxSem, if you are around ^ [02:45:49] yup [02:46:09] MaxSem, need a second pair of eyes on that patch ^ [02:46:25] fixing labs [02:46:43] set it in mobile-labs.php? [02:47:11] MaxSem, i would have to duplicate the whole section and somehow prevent it from executing in the prod mobile [02:47:56] !log LocalisationUpdate ResourceLoader cache refresh completed at Sat Jun 28 02:46:49 UTC 2014 (duration 46m 48s) [02:48:01] Logged the message, Master [02:48:03] why? [02:48:11] MaxSem, ? [02:49:12] "duplicate the whole section" and "somehow prevent it from executing in the prod mobile" [02:49:58] $wgJsonConfigs['JsonZeroConfig']['remote']['url] = ... [02:54:49] (03PS1) 10Yurik: Fix Labs URL for ZeroBanner. Prod noop. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142765 [02:56:23] (03CR) 10MaxSem: [C: 032] Fix Labs URL for ZeroBanner. Prod noop. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142765 (owner: 10Yurik) [02:56:29] (03Merged) 10jenkins-bot: Fix Labs URL for ZeroBanner. Prod noop. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142765 (owner: 10Yurik) [02:57:13] MaxSem, thx [02:57:52] (03Abandoned) 10MaxSem: Fix LABS url for Zero portal. Ready to commit any time. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142764 (owner: 10Yurik) [03:19:27] PROBLEM - MySQL InnoDB on es1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:20:17] RECOVERY - MySQL InnoDB on es1001 is OK: OK longest blocking idle transaction sleeps for 0 seconds [03:53:27] PROBLEM - Puppet freshness on db1006 is CRITICAL: Last successful Puppet run was Sat 28 Jun 2014 01:53:10 UTC [04:12:58] RECOVERY - Puppet freshness on db1006 is OK: puppet ran at Sat Jun 28 04:12:50 UTC 2014 [04:23:49] ori_: Gridengine is good at what it does, but does not have dynamic /re/scheduling of tasks according to load; so it's better if you have lots of jobs with roughly the same use pattern or you can predict the load they will generate in advance. [04:38:38] Coren: it may a good fit, then. we can (and do) predict the resource cost of different job types. the queue is currently manually distributed across different jobrunner classes based on these predictions. [04:39:09] That sounds like exactly what gridengine does then. [04:39:56] It has "consumable resources" you can assign to jobs, and they will be distributed according to availability. [04:40:38] And it's easy to add or remove nodes from the different queues. [04:40:38] does it work well when most of the jobs are fairly brief? [04:41:01] many are under a second, in fact. (but some, like video transcoding, take much longer.) [04:41:30] ori_: If you can live with ~10s median delay for scheduling. Worst case is, IIRC, 30s; but it's all very parallelized so while that causes lag it doesn't consume significant resources. [04:41:59] * ori_ sighs [04:42:09] we'd be doing very well if we had a 10s median delay :( [04:42:26] editors are accustomed to much worse i think [04:43:03] It's also possible to tweak the scheduling interval but I know there are some caveats when you do that and you quickly end up with diminishing returns. [04:43:21] how would mediawiki enqueue jobs? is there a standard way to do that via the network? [04:43:38] i suppose i should RTM [04:44:01] ori_: It's fairly easy with a command-line tool; and I _think_ there is an API if you want to do it by hand. [04:45:15] Yeah, it speaks DRMAA [04:46:11] https://en.wikipedia.org/wiki/DRMAA [04:46:23] Which has the advantage that it's not tied to a specific cluster/grid system. [04:47:27] Do you think it worthwhile to set up a labs cluster for experimenting? [04:48:50] Right now tool labs' is poorly puppetized; the packages are in place but much of gridengine's configuration is runtime; I keep meaning to spend the dev effort to create a puppet module that does the whole thing right -- having a second user would make the exercise much more worthwhile. [04:49:29] It's old Sun software. Its configuration is... baroque. :-) [04:50:32] Part of the issue is that, without puppet resource collection, it's tricky to do right -- node configuration needs to be done from an administrative box and not on the nodes proper. [04:52:02] i think it'd be very worthwhile, yeah [04:52:09] i bet aaron would be pretty excited too [04:53:18] the DRMAA article manages to say almost nothing concrete about the protocol, heh [04:54:42] the master daemon speaks with individual execution agents via a network link but i'm not finding a lot about that protocol either [04:56:45] i'm a hipster, i want something i can cURL :P [05:04:53] over on #wikimedia-dev bd808 and legoktm are battling with another piece of sun technology :D [05:05:13] RIP :( [05:05:20] what happened? [05:05:30] sun + bsd + linux all in on bundle! [05:07:06] i'm on OS X and it works for me [05:07:26] That's even weirder [05:07:40] Are you 10.8 or 10.9? [05:07:54] 10.9.3 [05:08:19] And I'm 10.8.5 [05:08:38] that's the whole: [05:08:44] config.nfs.map_uid = Process.uid [05:08:44] config.nfs.map_gid = Process.gid [05:09:10] i'm on 1.9 but it worked for me pre-upgrade too [05:09:37] I'm using vagrant 1.4.3 if it matters. [05:09:38] Yeah, I see the -mapall setting in my /etc/exports file [05:09:59] 1.4.3 is pretty ancient [05:10:13] But for some random reason it seems to blow up for me jsut like it does for legoktm [05:12:17] what version are you on, bd808? [05:12:42] ori: 1.6.3 [05:13:20] oh, no, i lied [05:13:32] i meant that nfs worked in generally, but i just enabled the centralauth role and it failed [05:13:38] :/ [06:13:37] PROBLEM - Puppet freshness on db1006 is CRITICAL: Last successful Puppet run was Sat 28 Jun 2014 04:12:50 UTC [07:17:55] PROBLEM - Puppet freshness on db1009 is CRITICAL: Last successful Puppet run was Sat 28 Jun 2014 05:17:04 UTC [07:57:15] RECOVERY - Puppet freshness on db1009 is OK: puppet ran at Sat Jun 28 07:57:05 UTC 2014 [08:13:24] RECOVERY - Puppet freshness on db1006 is OK: puppet ran at Sat Jun 28 08:13:21 UTC 2014 [08:20:26] PROBLEM - Puppet freshness on mw1182 is CRITICAL: Last successful Puppet run was Sat 28 Jun 2014 08:18:01 UTC [08:22:26] PROBLEM - Puppet freshness on mw1182 is CRITICAL: Last successful Puppet run was Sat 28 Jun 2014 08:18:01 UTC [08:24:26] PROBLEM - Puppet freshness on mw1182 is CRITICAL: Last successful Puppet run was Sat 28 Jun 2014 08:18:01 UTC [08:26:26] PROBLEM - Puppet freshness on mw1182 is CRITICAL: Last successful Puppet run was Sat 28 Jun 2014 08:18:01 UTC [08:28:26] PROBLEM - Puppet freshness on mw1182 is CRITICAL: Last successful Puppet run was Sat 28 Jun 2014 08:18:01 UTC [08:30:26] PROBLEM - Puppet freshness on mw1182 is CRITICAL: Last successful Puppet run was Sat 28 Jun 2014 08:18:01 UTC [08:32:26] PROBLEM - Puppet freshness on mw1182 is CRITICAL: Last successful Puppet run was Sat 28 Jun 2014 08:18:01 UTC [08:34:26] PROBLEM - Puppet freshness on mw1182 is CRITICAL: Last successful Puppet run was Sat 28 Jun 2014 08:18:01 UTC [08:36:26] PROBLEM - Puppet freshness on mw1182 is CRITICAL: Last successful Puppet run was Sat 28 Jun 2014 08:18:01 UTC [08:38:26] PROBLEM - Puppet freshness on mw1182 is CRITICAL: Last successful Puppet run was Sat 28 Jun 2014 08:18:01 UTC [08:38:36] RECOVERY - Puppet freshness on mw1182 is OK: puppet ran at Sat Jun 28 08:38:34 UTC 2014 [09:27:23] (03PS1) 10Yuvipanda: toollabs: Remove superfluous setting in redis monitoring [operations/puppet] - 10https://gerrit.wikimedia.org/r/142790 [09:36:01] (03PS1) 10Yuvipanda: toollabs: Collect active users metric for bastion hosts [operations/puppet] - 10https://gerrit.wikimedia.org/r/142792 [09:40:20] (03PS1) 10Yuvipanda: toollabs: Collect NFS Mount stats from all toollabs nodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/142793 [10:10:29] PROBLEM - puppetmaster backend https on palladium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:11:29] RECOVERY - puppetmaster backend https on palladium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 3.005 second response time [10:12:39] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [10:18:19] PROBLEM - Lucene on search1015 is CRITICAL: Connection timed out [10:21:42] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 3.002 second response time on port 8123 [10:24:42] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [10:27:42] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 3.002 second response time on port 8123 [10:30:02] RECOVERY - Lucene on search1015 is OK: TCP OK - 0.000 second response time on port 8123 [10:30:42] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [10:38:42] <_joe_> nno one apart me around? [10:38:47] <_joe_> :( [10:40:08] apparently not, no [10:40:56] <_joe_> !log restarting lucene on search1015, stuck. again. [10:41:01] Logged the message, Master [10:41:04] PROBLEM - Puppet freshness on db1007 is CRITICAL: Last successful Puppet run was Sat 28 Jun 2014 08:40:04 UTC [10:41:14] <_joe_> in about ~ 30 mins it will be ok [10:41:34] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on port 8123 [10:41:44] <_joe_> hoo: I'm at the italian hackmeeting this weekend, I really hoped someone else would show up :) [10:42:08] <_joe_> and I have to get off to a meeting [10:42:54] a hackmeeting with you as the only participant? sounds fun. [10:43:15] <_joe_> twkozlowski: lol [10:43:23] <_joe_> as in show up here :) [10:43:46] <_joe_> and that sounded wrong again [10:45:06] hey [10:45:10] I got pages [10:45:20] _joe_: lucene stuck again ? [10:45:35] <_joe_> akosiaris: yep [10:45:57] reading backlog [10:46:11] <_joe_> I'm off sorry [10:47:07] ok. I assume we are ok. Will be near a PC should anything else show up [10:48:01] <_joe_> ok [10:51:11] ACKNOWLEDGEMENT - puppet last run on dobson is CRITICAL: Connection refused by host alexandros kosiaris hardy machine. To be decom soon [10:52:05] ACKNOWLEDGEMENT - puppet last run on pdf2 is CRITICAL: Connection refused by host alexandros kosiaris hardy machine. To be decom soon [11:00:36] RECOVERY - Puppet freshness on db1007 is OK: puppet ran at Sat Jun 28 11:00:32 UTC 2014 [11:18:46] PROBLEM - Puppet freshness on db1009 is CRITICAL: Last successful Puppet run was Sat 28 Jun 2014 09:18:07 UTC [12:58:06] !log Cirrus reindex status: enwiki has almost finished its in place reindex, alphabetical wikipedias are at frwiki, all group1 wikis have finished their in place reindex. all group1 wikis are running from mediawiki reindex. itwiki and cawiki both finished both the in place and from mediawik reindex. Haven't started alphabetical from mediawiki reindex yet for wikipedias. that is the only [12:58:08] thing left to start. [12:58:12] Logged the message, Master [13:00:45] PROBLEM - Puppet freshness on db1007 is CRITICAL: Last successful Puppet run was Sat 28 Jun 2014 11:00:32 UTC [13:19:54] PROBLEM - Puppet freshness on db1009 is CRITICAL: Last successful Puppet run was Sat 28 Jun 2014 09:18:07 UTC [13:57:12] RECOVERY - Puppet freshness on db1009 is OK: puppet ran at Sat Jun 28 13:57:08 UTC 2014 [13:59:58] RECOVERY - Puppet freshness on db1007 is OK: puppet ran at Sat Jun 28 13:59:55 UTC 2014 [15:26:30] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data exceeded the critical threshold [500.0] [15:39:25] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% data above the threshold [250.0] [17:16:39] !log restarted lucene on search1016 per _joe_ [17:16:44] Logged the message, Master [17:17:41] PROBLEM - SSH on lvs1002 is CRITICAL: Server answer: [17:18:41] RECOVERY - SSH on lvs1002 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [18:13:20] (03PS2) 10Yuvipanda: dynamicproxy: Enable diamond collector for nginx [operations/puppet] - 10https://gerrit.wikimedia.org/r/142732 [18:16:31] (03PS1) 10Yuvipanda: dynamicproxy: Send proxy redis stats to graphite as well [operations/puppet] - 10https://gerrit.wikimedia.org/r/142812 [18:17:07] (03CR) 10Yuvipanda: [C: 031] "Tested" [operations/puppet] - 10https://gerrit.wikimedia.org/r/142631 (owner: 10Yuvipanda) [18:19:32] (03CR) 10coren: [C: 032] "Doesn't look like monitoring to me. :-)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/142631 (owner: 10Yuvipanda) [18:20:15] (03CR) 10coren: [C: 032] "Seems sane." [operations/puppet] - 10https://gerrit.wikimedia.org/r/142732 (owner: 10Yuvipanda) [18:20:49] (03CR) 10coren: [C: 032] "Straightforward enough." [operations/puppet] - 10https://gerrit.wikimedia.org/r/142812 (owner: 10Yuvipanda) [18:21:50] (03CR) 10coren: [C: 032] "Should be a noop." [operations/puppet] - 10https://gerrit.wikimedia.org/r/142790 (owner: 10Yuvipanda) [18:22:43] (03CR) 10coren: [C: 032] "Not sure how actually useful that metric may be, but it can't harm." [operations/puppet] - 10https://gerrit.wikimedia.org/r/142792 (owner: 10Yuvipanda) [18:23:13] (03CR) 10coren: [C: 032] "Sane." [operations/puppet] - 10https://gerrit.wikimedia.org/r/142793 (owner: 10Yuvipanda) [18:24:17] Oh, hm. The end result doesn't seem to be mergable. [18:24:28] * Coren tries to figure out why. [18:25:10] oh? [18:25:43] Oh, I seem to have tried to puppet-merge while one of the patches was halfway merged. Worked the second time. [18:25:49] ;D cool! [18:41:26] (03PS1) 10Yuvipanda: dynamicproxy: Fix puppet include for nginx monitoring [operations/puppet] - 10https://gerrit.wikimedia.org/r/142818 [18:41:41] Coren: ^ [18:43:13] (03PS1) 10Yuvipanda: toollabs: Remove libvips-dev from dev_environment [operations/puppet] - 10https://gerrit.wikimedia.org/r/142819 [18:44:54] (03PS1) 10Yuvipanda: toollabs: Fix path to pastebinit.conf config [operations/puppet] - 10https://gerrit.wikimedia.org/r/142820 [18:45:03] Coren: there were puppet failures for a long time on -login as well, ^ has two simple fixes [18:45:31] Yeah, I'm on the other issue atm. Hang on. [18:45:36] Coren: yeah, ok. ty! [18:45:46] Coren: I'm going to look at other types of nodes and see if there are puppet failures [18:45:55] Ah, I see why the ec2id isn't set -- that's not the previous cause. [18:46:15] For some reason, 'hostname -d' doesn't return 'eqiad.wmflabs' as it should on -webproxy [18:46:23] oh wow. I see. [18:47:34] So ec2id.rb facter fails. [18:48:59] I can't see why the box ended up in that state, but setting the fqdn fixed it. [18:49:08] And now it errors out but it's your fault. :-) [18:49:20] Error 400 on SERVER: Could not find class diamond::collector::nginx for i-000000e6.eqiad.wmflabs on node i-000000e6.eqiad.wmflabs [18:49:26] Coren: yeah, I've a patch for that :) [18:49:32] Coren: https://gerrit.wikimedia.org/r/#/c/142818/ [18:50:24] (03CR) 10coren: [C: 032] "Seems to be sane-ish?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/142818 (owner: 10Yuvipanda) [18:51:35] Coren: :D ty. two more for fixes to -login and -dev [18:52:07] Want to throw in a fix for Error 400 on SERVER: Duplicate declaration: Package[python-redis] is already declared in file /etc/puppet/modules/dynamicproxy/manifests/init.pp:90; cannot redeclare at /etc/puppet/modules/toollabs/manifests/proxy.pp:18 on node i-000000e6.eqiad.wmflabs [18:52:37] that's weird. let me check [18:53:51] (03CR) 10coren: [C: 032] toollabs: Fix path to pastebinit.conf config [operations/puppet] - 10https://gerrit.wikimedia.org/r/142820 (owner: 10Yuvipanda) [18:54:16] (03PS1) 10Yuvipanda: toollabs: Fix package conflict for python-redis [operations/puppet] - 10https://gerrit.wikimedia.org/r/142821 [18:54:19] Yeah, the libvips-dev one I don't want to just remove. [18:54:33] Coren: fixed ^ [18:54:40] Coren: why? [18:54:52] Coren: let me take it out of the patch series then. [18:55:01] (03CR) 10coren: [C: 032] toollabs: Fix package conflict for python-redis [operations/puppet] - 10https://gerrit.wikimedia.org/r/142821 (owner: 10Yuvipanda) [18:55:24] Because it's actually needed; I need to fix the dependency issue instead. [18:55:45] (03PS2) 10Yuvipanda: toollabs: Fix package conflict for python-redis [operations/puppet] - 10https://gerrit.wikimedia.org/r/142821 [18:55:47] (03PS2) 10Yuvipanda: toollabs: Fix path to pastebinit.conf config [operations/puppet] - 10https://gerrit.wikimedia.org/r/142820 [18:55:56] Coren: right, so rebased the other two patches [19:00:25] Coren: welcome back :) [19:05:15] Coren: you've to +2 those two patches again, since I rebased them (to remove dependency on the vips patch) [19:09:43] (03CR) 10coren: [C: 032] "Is a fix" [operations/puppet] - 10https://gerrit.wikimedia.org/r/142820 (owner: 10Yuvipanda) [19:09:54] (03CR) 10coren: "It's also a fix." [operations/puppet] - 10https://gerrit.wikimedia.org/r/142821 (owner: 10Yuvipanda) [19:10:50] Coren: ty! [20:02:04] any ideas why I can't find the built debs at https://launchpad.net/~nginx/+archive/development/+build/4617469 [20:02:04] ? [20:02:06] kart_: ^ [20:24:09] (03PS1) 10Yuvipanda: Do not install debug symbols [operations/puppet/nginx] - 10https://gerrit.wikimedia.org/r/142828 [20:24:15] Coren: so... I broke puppet on -proxy still. fix ^ [20:24:28] Again? :-) [20:24:44] Coren: yeah. sigh. [20:24:50] Coren: this hand-rolled package is causing so much troubles. [20:25:25] (03CR) 10coren: [C: 032] "Can't wait for Trusty." [operations/puppet/nginx] - 10https://gerrit.wikimedia.org/r/142828 (owner: 10Yuvipanda) [20:25:49] (03PS1) 10Yuvipanda: nginx: Submodule bump [operations/puppet] - 10https://gerrit.wikimedia.org/r/142829 [20:25:51] Coren: ^ submodule bump to include that one. [20:26:30] (03CR) 10coren: [C: 032] "At least, it's not in the night." [operations/puppet] - 10https://gerrit.wikimedia.org/r/142829 (owner: 10Yuvipanda) [20:26:32] (03CR) 10Krinkle: "Indeed, per Tim "the mergehistory right can be given to sysops by default in the MW core and on all WMF wikis[..]"" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141892 (owner: 10Aaron Schulz) [20:26:36] Coren: heh :) [20:28:37] Coren: runs fine :) [20:28:47] Coren: need to setup puppet error checks on tools. wonder how to do that [20:35:56] (03PS2) 10Tim Landscheidt: Tools: Fix pastebinit configuration [operations/puppet] - 10https://gerrit.wikimedia.org/r/135500 [20:36:39] (03CR) 10Tim Landscheidt: "Rebased after merge of Iaa7286d39531ccb6a8444d0d809e45b3dd55ba9d." [operations/puppet] - 10https://gerrit.wikimedia.org/r/135500 (owner: 10Tim Landscheidt) [20:41:39] (03CR) 10Krinkle: "Hm.. it's just a gut feeling, but I'm curious whether it makes sense to use 127.0.0.1 for this as ServerName. I guess the reason you use t" [operations/puppet] - 10https://gerrit.wikimedia.org/r/142250 (owner: 10Giuseppe Lavagetto) [20:45:04] Coren: everything seems fine now :) [21:02:03] Coren: uhm, qhy is tools-webproxy's fqdn 'tools.wmflabs.org' and not tools-webproxy.eqiad.wmflabs? [21:02:25] >>> socket.getfqdn() [21:02:25] YuviPanda: ReferenceError: socket is not defined [21:02:27] ... because I'm an idiot? [21:02:30] 'tools.wmflabs.org' [21:02:33] Coren: oh? :) [21:03:02] Coren: hostname still returns tools-webproxy.eqiad.wmflabs, but graphite was recording tools.tools instead of tools-webproxy, so went and investigated and found this... [21:03:19] No, I though I had made a typo but: [21:03:29] root@tools-webproxy:~# hostname [21:03:29] tools-webproxy.eqiad.wmflabs [21:04:04] Coren: yeah, so that is fine. I don't know why getfqdn returns tools.wmflabs.org though [21:05:01] I dunno how sock.getfqdn() works. That host is also 'tools.wmflabs.org' for nginx though so perhaps that's what's confusing it? [21:05:30] Coren: no, this is diamond. I think the issue is that there's a /etc/hosts entry for tools.wmflabs.org on tools.wmflabs.org [21:05:48] Coren: is that needed? Nobody should be connecting to tools from tools... [21:05:57] err, tools.wmflabs.org from tools.wmflabs.org [21:06:16] Coren: also is /etc/hosts managed in puppet? [21:06:21] No, it's not. [21:06:27] I should write a role that lets anyone access the replica dbs. [21:06:29] * YuviPanda puts it on his list [21:06:41] Coren: any objections to me removing that /etc/hosts entry? [21:06:57] Shouldn't be an issue. It was useful for apache, but shouldn't be needed now. [21:07:23] Coren: cool! [21:08:28] Coren: boom, everything seems to work fine now! :) [21:08:35] tools-webproxy reporting as tools-webproxy now [21:29:14] Bayes-scoring on OTRS/iodine seems to have stopped working about 3d10h ago. I dropped Jeff an email on Friday but it may not have reached him in time or he's on vacation or whatever. Maybe someone could check if there were any changes that roughly correspond to that time? [21:31:06] Yeah, he's on vaction this weekend [21:31:25] and possibly for a bit longer [21:40:20] Reedy, thanks, who can look into this while he's away? [21:41:14] pajz: I saw such a change [21:41:20] let me look into logs [21:42:49] pajz: https://gerrit.wikimedia.org/r/#/c/141919/ ? [21:44:35] if this is in fact the change you are talking about you should talk to faidon irc nick paravoid [21:46:32] matanya, provided the time is UTC, that seems to fit. To be clear, I have absolutely no idea about the underlying software, so I cannot tell you if this particular change broke something. [21:46:54] it is UTC [21:47:26] and it touches the bayes mail subsystem, so i guess that is the change [21:48:31] I only see the result which is that the level of undetected spam is extraordinarily high, and that seems to be because no bayes scores get added anymore. [21:48:50] Ok. [21:54:23] Well then, /me pokes paravoid [21:54:33] Hope he'll scroll back. [21:54:56] sure, I guess he will reply on Monday, if i will see him before, i'll let him know [22:00:34] (03CR) 10Reedy: "Seems this might have upset bayes scoring for OTRS" [operations/puppet] - 10https://gerrit.wikimedia.org/r/141919 (owner: 10Faidon Liambotis) [22:40:42] PROBLEM - Puppet freshness on db1007 is CRITICAL: Last successful Puppet run was Sat 28 Jun 2014 20:40:07 UTC [23:00:44] RECOVERY - Puppet freshness on db1007 is OK: puppet ran at Sat Jun 28 23:00:41 UTC 2014