[00:00:54] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 3 hours [00:04:16] I'm seeing occasional 503s in one of the Parsoid backend varnishes (cp1058). We don't emit those from Parsoid, so I am wondering whether that could be a varnish or LVS issue. [00:07:44] anybody around with varnish / LVS knowledge and the rights? [00:12:20] according to https://wikitech.wikimedia.org/wiki/Parsoid, this might provide clues on parsoid.svc.eqiad.wmnet: tail -f /var/log/pybal.log | grep parsoid [00:13:00] ottomata seems to be offline [00:13:04] any roots around? [00:14:11] what's up? [00:14:56] see above [00:15:08] do you have the output of such a 503? [00:15:23] cp1058.eqiad.wmnet 701 2013-11-19T00:06:20 0.000141859 10.64.32.97 miss/503 419 GET http://parsoid/enwiki/Princess_Theatre,_Torquay?oldid=527921922 - - - 10.64.0.32, 10.64.0.32 - [00:15:51] the page itself is working: http://parsoid-lb.eqiad.wikimedia.org/enwiki/Princess_Theatre,_Torquay?oldid=527921922 [00:16:28] the load on the backends and the parsoid logs look normal [00:16:36] is it just cp1058? [00:16:40] yes, oddly [00:16:46] cp1058 backend [00:17:02] I did varnishncsa | egrep '/(4|5)' | grep -v 'miss/412' [00:17:03] 670 VCL_call c miss fetch [00:17:04] 670 FetchError c no backend connection [00:17:04] 670 VCL_call c error deliver [00:18:18] 5886 tcp connections open [00:18:54] slightly more than cp1045 [00:20:05] should be too low to be an issue in itself [00:21:02] paravoid: could you restart varnish to see if that makes a difference? [00:21:05] <^demon|busy> mwalker: I'm gonna set them up now. Is collectoid a repo as well? [00:21:18] yep [00:21:28] it'll be a submodule of Collection [00:21:45] <^demon|busy> And then the other 3 will be submodules of that? [00:21:57] yep [00:22:24] deployment is going to be a pita [00:22:43] <^demon|busy> The other option is not using submodules then ;-) [00:22:48] <^demon|busy> Then deployment isn't a pita. [00:23:08] not sure that's so true; parsoid is taking the no submodules approach [00:23:18] we'll probably end up somewhere in the middle eventually [00:23:40] we are moving towards submodules [00:23:57] gwicke: do you have a ping setup for parsoid? [00:24:07] or are you just omnicient [00:24:09] <^demon|busy> Also, as I do anytime this happens, I'm lodging my single objection towards the word -oid, as I think it's completely lame. [00:24:23] mwalker: I do have a ping, but am on this channel currently anyway [00:24:26] if you have a better name; please for the love of god use it :) [00:24:31] debugging a varnish issue [00:24:31] * ^demon|busy knows his complaints go ignored, and will just get linked to the wiktionary article on -oid, so he makes the repo anyway [00:25:06] ^demon|busy: no! I dislike collectoid; it's lame and generic [00:25:16] ori-l: quick! we need an awesome name! [00:25:27] <^demon|busy> I dislike -oid because I think -oid always makes it sound like a 2-bit android app. [00:26:25] <^demon|busy> "Let's make a sudoku app" "What shall we call it?" "Well, it's for android, let's call it sudokoid" [00:27:05] the suffix principle is a good one IMO, as the names are fairly self-explanatory [00:27:23] do you have a better suffix? [00:27:32] <^demon|busy> Yes but if you have to explain the suffix to people then you've missed the point. [00:27:51] many android apps seem to be called AndFoo [00:28:06] error 420: boring suffix [00:28:08] <^demon|busy> AndFoo or Baroid. [00:28:21] something like rashomon is less informative to most [00:28:28] do we have Foooid already? [00:28:50] I like Erik's quip 'avoid the oid' [00:29:26] the apple folks seem to use *Kit [00:29:52] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Tue Nov 19 00:29:43 UTC 2013 [00:29:53] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 3 hours [00:30:21] paravoid: ping [00:30:30] pong [00:30:31] found it [00:30:38] backend ipv4_10_2_2_28 { [00:30:42] .max_connections = 600; [00:30:42] } [00:30:52] ahh [00:31:02] root@cp1058:~# netstat -n |grep 10.2.2.28 |wc -l [00:31:02] 600 [00:31:11] ^demon|busy: I guess just create the repos as collectoid [00:31:13] root@cp1045:~# netstat -n |grep 10.2.2.28 |wc -l [00:31:13] 594 [00:31:15] it's not far off [00:31:35] ^demon|busy: I don't know how hard it is to rename things once created; but I do need the workspace [00:31:47] ^demon|busy: so the status quo shall be retained :'( [00:31:50] <^demon|busy> It's basically not worth it to ever rename anything. [00:31:51] <^demon|busy> :p [00:32:13] <^demon|busy> I shall create collectoid, but I'm going to grumble and not like it :) [00:32:20] heh [00:32:29] in the absence of fancy names, we can always go for descriptive ones :) [00:32:47] <^demon|busy> Or go for completely opaque but definitely unused. [00:33:13] mwalker, how would you describe the contents of the collectoid project? [00:33:15] yep; Jeff has named the puppet group 'offline content generator' [00:33:29] OfflineContentGenerator sounds like a totally suitable repo name to me :) [00:33:50] paravoid: can you restart the backend varnish to fix it in the short term? [00:34:01] I wonder what a good value would be [00:34:01] <^demon|busy> Or just see how many more things with can name via permutations of /[mediawk]+/ ;-) [00:34:06] 1000 sound okay? [00:34:12] paravoid: yes [00:34:13] very unscientific [00:34:51] <^demon|busy> Sorry, /[mediawkp]+/ [00:34:52] the backends take many more connections, but they shouldn't be needed in theory [00:34:54] (03PS1) 10Faidon Liambotis: varnish: bump parsoid's max_connections to 1000 [operations/puppet] - 10https://gerrit.wikimedia.org/r/96176 [00:35:19] (03CR) 10GWicke: [C: 031] varnish: bump parsoid's max_connections to 1000 [operations/puppet] - 10https://gerrit.wikimedia.org/r/96176 (owner: 10Faidon Liambotis) [00:35:20] ^demon|busy: heh; ok, per Eloquence: /mediawiki/extension/Collection/OfflineContentGenerator/* [00:35:30] (03CR) 10Faidon Liambotis: [C: 032] varnish: bump parsoid's max_connections to 1000 [operations/puppet] - 10https://gerrit.wikimedia.org/r/96176 (owner: 10Faidon Liambotis) [00:35:36] and thank god for tab complete [00:35:36] (03CR) 10Faidon Liambotis: [V: 032] varnish: bump parsoid's max_connections to 1000 [operations/puppet] - 10https://gerrit.wikimedia.org/r/96176 (owner: 10Faidon Liambotis) [00:36:10] (03PS1) 10Dzahn: add ferm rule to only allow nrpe/5666 from intern [operations/puppet] - 10https://gerrit.wikimedia.org/r/96177 [00:38:24] (03PS2) 10Dzahn: add ferm rule to only allow nrpe/5666 from intern [operations/puppet] - 10https://gerrit.wikimedia.org/r/96177 [00:39:29] gwicke: it's fixed [00:39:49] 619 connections open now, no TxStatus:503 that I can see of [00:40:16] paravoid: thanks! [00:40:29] <^demon|busy> mwalker: You're ready for cloning. [00:40:38] am still wondering if all those connections are actually active, or if many are actually just lingering [00:40:46] ^demon|busy: thanks kindly [00:40:49] <^demon|busy> yw [00:41:30] <^demon|busy> That actually sounds kind of creepy. Step into your cloning device, you're ready for cloning. [00:46:47] sorta sounds like the transmorgifier from calvin and hobbes [00:59:53] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Tue Nov 19 00:59:51 UTC 2013 [01:00:52] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 3 hours [01:04:52] ^demon|busy: we're going to stage gerrit development in a branch so ignore the .gitreview file but... is this the correct way we want to add submodules to extensions? e.g. doing it this way isn't going to cause me trouble later on? https://gerrit.wikimedia.org/r/#/c/96181/ [01:06:47] mwalker: That should be fine [01:07:10] ok; so git isn't going to give me crap about having .gitmodules files not in the root [01:07:11] We've been doing a recursive submodule checkout for deployment for quite a while now [01:07:48] not in the root? [01:08:06] not in mediawiki/core or something [01:08:28] shouldn't do... [01:08:39] git submodule update --init --recursive extensions/Collection [01:08:42] *thumbs up* -- that jives with what I just experimented with [01:09:02] so at least two people think it behaves rationally [01:23:55] (03PS3) 10Dzahn: add ferm rule to only allow nrpe/5666 from intern [operations/puppet] - 10https://gerrit.wikimedia.org/r/96177 [01:24:57] mutante: thanks :-] [01:25:31] (03PS4) 10Dzahn: add ferm rule to only allow nrpe/5666 from intern [operations/puppet] - 10https://gerrit.wikimedia.org/r/96177 [01:25:45] arr, that also needs retabbing, it never stops [01:26:30] (03PS1) 10Dzahn: retab role/gitblit to spaces [operations/puppet] - 10https://gerrit.wikimedia.org/r/96186 [01:27:06] whenever tab got killed off puppet manifest, we can add a step in the lint check to bail out whenever a .pp contains a leading tab [01:30:01] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Tue Nov 19 01:29:55 UTC 2013 [01:30:51] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 3 hours [01:38:46] (03PS1) 10Reedy: Tabs to spaces [operations/dns] - 10https://gerrit.wikimedia.org/r/96187 [01:38:48] (03CR) 10jenkins-bot: [V: 04-1] Tabs to spaces [operations/dns] - 10https://gerrit.wikimedia.org/r/96187 (owner: 10Reedy) [01:39:05] (03Abandoned) 10Reedy: Tabs to spaces [operations/dns] - 10https://gerrit.wikimedia.org/r/96187 (owner: 10Reedy) [01:40:08] (03PS1) 10Reedy: Tabs to spaces [operations/dns] - 10https://gerrit.wikimedia.org/r/96188 [01:40:22] all of the merge conflicts [01:40:42] you don't have to abandon to rebase [01:40:47] just use the same change-id [01:40:54] even if you start from scratch [01:41:01] Yeah [01:41:05] Mostly laziness [01:41:22] Though it's probably more steps to abandon and do again... [01:46:53] hashar: lol, 2 people, same thought https://wikitech.wikimedia.org/wiki/Talk:Puppet [01:48:23] mutante: very easily fixed... [01:50:14] Reedy: yea. https://wikitech.wikimedia.org/wiki/Puppet_coding#tab_character_found_on_line_.. but we don't want huge "retab them all" patches either [01:50:25] awwwwww [02:00:00] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Tue Nov 19 01:59:51 UTC 2013 [02:00:42] (03PS1) 10Dzahn: retab misc/pdf and role/pdf [operations/puppet] - 10https://gerrit.wikimedia.org/r/96190 [02:00:50] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 3 hours [02:01:28] (03CR) 10Dzahn: "btw, did the pdf sprint come up with things to add here?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/96190 (owner: 10Dzahn) [02:04:53] (03PS1) 10Springle: try to reduce "no working slave" notice flood in pmtpa dberror.log [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96191 [02:05:33] (03CR) 10Springle: [C: 032] try to reduce "no working slave" notice flood in pmtpa dberror.log [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96191 (owner: 10Springle) [02:06:31] !log springle synchronized wmf-config/db-pmtpa.php [02:06:48] Logged the message, Master [02:10:34] !log LocalisationUpdate completed (1.23wmf3) at Tue Nov 19 02:10:34 UTC 2013 [02:10:47] Logged the message, Master [02:10:54] (03PS1) 10Springle: try to reduce "no working slave" notice flood in pmtpa dberror.log (attempt 2) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96192 [02:11:28] (03CR) 10Springle: [C: 032] try to reduce "no working slave" notice flood in pmtpa dberror.log (attempt 2) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96192 (owner: 10Springle) [02:12:11] !log springle synchronized wmf-config/db-pmtpa.php [02:12:23] Logged the message, Master [02:19:12] !log LocalisationUpdate completed (1.23wmf4) at Tue Nov 19 02:19:11 UTC 2013 [02:19:24] Logged the message, Master [02:29:54] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Tue Nov 19 02:29:49 UTC 2013 [02:30:53] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 3 hours [02:50:35] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Nov 19 02:50:34 UTC 2013 [02:50:47] Logged the message, Master [03:00:03] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Tue Nov 19 02:59:53 UTC 2013 [03:00:53] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 3 hours [03:27:52] (03PS1) 10Tim Starling: Updated wikipedia.org etc. A/AAAA records to point to eqiad [operations/dns] - 10https://gerrit.wikimedia.org/r/96198 [03:28:00] (03CR) 10jenkins-bot: [V: 04-1] Updated wikipedia.org etc. A/AAAA records to point to eqiad [operations/dns] - 10https://gerrit.wikimedia.org/r/96198 (owner: 10Tim Starling) [03:29:57] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Tue Nov 19 03:29:48 UTC 2013 [03:30:47] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 3 hours [03:59:57] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Tue Nov 19 03:59:50 UTC 2013 [04:00:47] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 3 hours [04:24:23] (03PS2) 10Tim Starling: Updated wikipedia.org etc. A/AAAA records to point to eqiad [operations/dns] - 10https://gerrit.wikimedia.org/r/96198 [04:24:31] (03CR) 10jenkins-bot: [V: 04-1] Updated wikipedia.org etc. A/AAAA records to point to eqiad [operations/dns] - 10https://gerrit.wikimedia.org/r/96198 (owner: 10Tim Starling) [04:26:06] (03PS3) 10Tim Starling: Updated wikipedia.org etc. A/AAAA records to point to eqiad [operations/dns] - 10https://gerrit.wikimedia.org/r/96198 [04:30:19] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Tue Nov 19 04:30:10 UTC 2013 [04:30:49] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 3 hours [04:32:30] /away afk [04:49:12] ori-l: you know https://github.com/crucially/timesplicedb ? [04:49:30] sounds vaguely familiar but i'm not certain if i've seen it before [04:49:41] no, looks interesting [04:49:57] hrmmmm, on second look the commits are really old [04:55:27] (03CR) 10Tim Starling: "* PS2: tried adding "0" after "::" to fix test failure" [operations/dns] - 10https://gerrit.wikimedia.org/r/96198 (owner: 10Tim Starling) [04:58:54] ori-l: http://dpaste.com/1472454/plain/ [04:59:33] (03CR) 10Tim Starling: [C: 032] Updated wikipedia.org etc. A/AAAA records to point to eqiad [operations/dns] - 10https://gerrit.wikimedia.org/r/96198 (owner: 10Tim Starling) [05:00:27] !log updating DNS to I3adffd88 [05:00:41] Logged the message, Master [05:03:39] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 3 hours [05:06:46] heh [05:25:55] stupid bugzilla [05:26:46] want to add a whiteboard entry to more than one bug, see the "Change several bugs at once" option "YAY!"... in the whiteboard field "--do_not_change--" :( :( [05:30:09] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Tue Nov 19 05:29:59 UTC 2013 [05:30:39] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 3 hours [05:36:54] I just don't even [05:38:36] my favorite aspect of BZ is how it shouts your mistakes from the mountaintop [05:38:48] like, you file a bug, and then: "oh yeah, forgot to CC so-and-so" [05:39:10] so you add a CC. quoth bugzilla: "I just told these 40 people that you screwed up" [05:39:23] "I'm sure they appreciate the spam." [05:39:31] :) [05:46:40] I think most people have CC notifications disabled. [05:46:54] The default got flipped at some point. [05:49:09] (03PS1) 10Springle: not 5.1.53 compatible (5.1.56+) [operations/puppet] - 10https://gerrit.wikimedia.org/r/96204 [05:50:14] (03CR) 10Springle: [C: 032] not 5.1.53 compatible (5.1.56+) [operations/puppet] - 10https://gerrit.wikimedia.org/r/96204 (owner: 10Springle) [05:59:49] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Tue Nov 19 05:59:47 UTC 2013 [06:00:34] TimStarling: if the udpprofiler collector is choking on the volume of stats, is there any reason not to halve the % of requests that are randomly sampled for profiling, changing it from 2% to 1%? [06:00:39] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 3 hours [06:01:04] it's been choking on the volume of stats for a couple of years, I think [06:01:15] it needs to be rewritten [06:01:33] I'm in the process of doing that [06:01:50] stats aren't sampled, only profiling is sampled [06:03:20] I'm not sure what you mean. A profiling data is collected from a random sample of web requests [06:03:47] by the 'mt_rand() % 50 ) == 0' check in StartProfiler.php [06:03:54] yes, and the collector also collects packets sent by wfIncrStats() [06:04:06] stats and profiling, the two sources of data [06:04:08] oh, that's what you meant [06:04:08] right [06:05:11] what causes most of the UDP traffic at the collector at present? [06:05:28] let me check. [06:06:51] Special:Random uses mt_rand(). [06:08:24] Elsie: thank you for that pearl of wisdom [06:08:33] No problem. [06:08:38] I learned from you. [06:08:58] https://en.wikipedia.org/wiki/Wikipedia:FAQ/Technical#Is_the_.22random_article.22_feature_really_random.3F [06:09:05] TimStarling: stats are about 30% [06:10:20] what limits the performance of the collector? [06:13:35] TimStarling: I'm not sure; probably the fact that for each sample it has to look up the entry in the bdb file and then write it back [06:14:25] I don't think persistence is actually required [06:14:35] it wipes all its data with every clear command anyway [06:15:13] so presumably you could replace that with a hash_map or something [06:15:24] but that wouldn't be multithreaded [06:15:29] who/what sends the clear command? [06:15:51] there's a script called clear-profile which is run occasionally [06:16:15] it is useful if you want something other than aggregate statistics out of the old web interface [06:16:47] i.e. if you want to know what is going on now, rather than over the last 6 months or whatever [06:18:12] any reason not to get that from graphite? [06:18:18] I mean, it doesn't provide an ordering by default [06:18:39] but one could presumably write something that queries its API for keys in a certain keyspace [06:18:59] and then queries each key for the mean (or whatever) for a given time range [06:19:11] it is difficult to get numbers out of graphite [06:20:04] and it only shows you a time series, if you want any other sort of data, graphite is not the best solution [06:20:41] it's basically not profiling at all [06:21:56] i've been stuffing them into redis [06:21:57] https://dpaste.de/YYts/raw/ [06:22:01] as an experiment [06:22:18] but i set out to reproduce exactly the functionality of collector [06:22:37] redis is not multithreaded either [06:22:49] no, i was going to shard the load over multiple instances [06:22:54] on the same host [06:23:34] but do you have other recommendations? [06:23:37] i was looking at leveldb [06:24:04] i guess just a hash-map in memory? [06:24:17] if it doesn't crash i guess that's all the persistence you need [06:24:20] well, I would start out by replacing BDB with hash_map or judy or something and see if that is fast enough [06:25:17] OK, sounds like a plan. anything else on the collector wishlist, while I'm at it? [06:25:36] (that wasn't a sarcastic offer, in case it sounded like one) [06:26:25] I think the main problem with just using a different hashtable would be packet loss due to stalls during queries [06:26:48] i.e. queries from graphite [06:27:53] PROBLEM - udp2log log age for lucene on oxygen is CRITICAL: CRITICAL: log files /a/log/lucene/lucene.log, have not been written in a critical amount of time. For most logs, this is 4 hours. For slow logs, this is 4 days. [06:27:59] heh [06:29:06] maybe it just needs a buffer thread [06:29:47] one very simple solution that I used in eventlogging is basically that -- having a separate standalone executable that reads from a UDP socket and publishes it over a ZeroMQ PUB socket [06:29:53] RECOVERY - udp2log log age for lucene on oxygen is OK: OK: all log files active [06:30:03] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Tue Nov 19 06:29:53 UTC 2013 [06:30:08] which gives you the ability to have an arbitrary number of subscribers and configurable per-subscriber buffering on the publisher side [06:30:33] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 3 hours [06:31:57] then you could have a separate consumer process the data for graphite [06:33:59] yes, that would work [06:34:08] you would still need a collector for the profiling interface though [06:35:31] yeah, and it would still need to not stall while serving aggregate figures for the web interface [06:36:36] but that's not too hard [06:39:49] zeromq subscribers also handle I/O in a background thread [06:40:44] but the API hides that from you, you're just dealing with a socket [06:41:30] http://zeromq.org/topics:omq-is-just-sockets [06:58:56] (03CR) 10ArielGlenn: [C: 032] pass pep8 E128 (continuation lines under-indented) [operations/software] - 10https://gerrit.wikimedia.org/r/95587 (owner: 10Hashar) [06:59:53] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Tue Nov 19 06:59:46 UTC 2013 [07:00:08] (03CR) 10ArielGlenn: [C: 032] pep8: ignore E128 [operations/software] - 10https://gerrit.wikimedia.org/r/95588 (owner: 10Hashar) [07:00:33] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 3 hours [07:03:54] (03CR) 10ArielGlenn: [C: 032] checkhost.py: report hosts in various manifests and lists [operations/software] - 10https://gerrit.wikimedia.org/r/95586 (owner: 10ArielGlenn) [07:18:46] (03CR) 10Hashar: "Lacking time to review, sorry." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95996 (owner: 10Aude) [07:29:59] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Tue Nov 19 07:29:54 UTC 2013 [07:35:08] PROBLEM - SSH on amslvs1 is CRITICAL: Server answer: [07:36:09] RECOVERY - SSH on amslvs1 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [07:51:12] (03PS1) 10ArielGlenn: remove hardcoded salt command timeout, replace with option [operations/software] - 10https://gerrit.wikimedia.org/r/96214 [07:53:16] (03CR) 10ArielGlenn: [C: 032] remove hardcoded salt command timeout, replace with option [operations/software] - 10https://gerrit.wikimedia.org/r/96214 (owner: 10ArielGlenn) [08:01:51] (03PS1) 10ArielGlenn: remove dysprosium from decom temporarily (to be reclaimed) [operations/puppet] - 10https://gerrit.wikimedia.org/r/96216 [08:03:30] (03CR) 10ArielGlenn: [C: 032] remove dysprosium from decom temporarily (to be reclaimed) [operations/puppet] - 10https://gerrit.wikimedia.org/r/96216 (owner: 10ArielGlenn) [08:05:11] PROBLEM - SSH on amslvs1 is CRITICAL: Server answer: [08:06:11] RECOVERY - SSH on amslvs1 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [08:46:05] (03CR) 10Hashar: [C: 031] "Code is good, might want to make the list of fonts one element per line." (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/96190 (owner: 10Dzahn) [08:46:39] (03CR) 10Hashar: [C: 031] retab role/gitblit to spaces [operations/puppet] - 10https://gerrit.wikimedia.org/r/96186 (owner: 10Dzahn) [08:54:51] (03PS1) 10ArielGlenn: qualify vars planet_domain_name, planet_languages [operations/puppet] - 10https://gerrit.wikimedia.org/r/96225 [09:01:17] hashar: I am looking at the ferm policy thing. Anything else needed before tomorrow's zuul upgrade ? [09:01:49] (03CR) 10Akosiaris: [C: 032] retab role/gitblit to spaces [operations/puppet] - 10https://gerrit.wikimedia.org/r/96186 (owner: 10Dzahn) [09:26:00] (03PS2) 10Ori.livneh: Use log scale for 5xx errors in "(cdn) HTTP Error Rate" [operations/puppet] - 10https://gerrit.wikimedia.org/r/95064 (owner: 10Nemo bis) [09:26:24] (03PS2) 10Ori.livneh: Also add 2 months and 1 year graphs in "(cdn) HTTP Error Rate" [operations/puppet] - 10https://gerrit.wikimedia.org/r/95068 (owner: 10Nemo bis) [09:30:52] (03CR) 10Ori.livneh: [C: 032] Use log scale for 5xx errors in "(cdn) HTTP Error Rate" [operations/puppet] - 10https://gerrit.wikimedia.org/r/95064 (owner: 10Nemo bis) [09:31:00] (03CR) 10Ori.livneh: [C: 032] Also add 2 months and 1 year graphs in "(cdn) HTTP Error Rate" [operations/puppet] - 10https://gerrit.wikimedia.org/r/95068 (owner: 10Nemo bis) [09:39:39] (03PS1) 10Akosiaris: ferm rule for bacula director [operations/puppet] - 10https://gerrit.wikimedia.org/r/96226 [09:42:25] (03CR) 10Akosiaris: [C: 04-2] "There is a bug in ferm with hostname resolution and IPv4/IPv6 in case a hostname does not resolv for both IPv4 and IPv6. Will update this " [operations/puppet] - 10https://gerrit.wikimedia.org/r/96226 (owner: 10Akosiaris) [09:53:43] (03CR) 10Akosiaris: [C: 032] "I like Antoine's suggestion, so please do what he suggests. Other than that, LGTM" [operations/puppet] - 10https://gerrit.wikimedia.org/r/96190 (owner: 10Dzahn) [10:08:44] (03PS1) 10ArielGlenn: fix case of multiple filters broken in earlier refactor [operations/software] - 10https://gerrit.wikimedia.org/r/96231 [10:09:53] (03CR) 10ArielGlenn: [C: 032] fix case of multiple filters broken in earlier refactor [operations/software] - 10https://gerrit.wikimedia.org/r/96231 (owner: 10ArielGlenn) [10:19:58] akosiaris: the serial number in /var/lib/puppet/server/ssl/ca/serial is wrong (there are the elasticsearch hosts with certs... only on sockpuppet sadly... created afterwords) [10:20:22] not sure what else might be out of date, I can copy over certs and the new serial file but what else might not be current? [10:21:10] strontium is the other missing cert [10:21:17] lol ? [10:21:28] so let me get this straight [10:21:57] sockpuppet has more certificates than palladium in its store ? [10:22:02] oh yes [10:22:08] so these other certs were created [10:22:11] thankfully that is solvable [10:22:25] nov 11 (stronitum) [10:22:38] nov 17 & 18 (elasticxxxx) [10:22:57] palladium does not fortunately have new certs after nov 5 when the serial file went over [10:23:08] thank god [10:23:17] so yes, it's solvable realtively easy, I'm jut not sure what other files besides those (certs + serial) we need [10:23:28] ther emay be some other lists [10:23:34] just tar the entire dir [10:23:52] and it should work [10:24:09] actually let me just double check one thing [10:24:09] but we also need to stop this from happening again.... [10:24:19] hold on a sec [10:24:40] probably purge all puppet/puppetmaster packages on sockpuppet [10:24:41] ? [10:25:04] yeah the formey cert is the most recent on palladium and it's also on sockpuppet so we are good there [10:25:19] ok... i 'll fix then [10:25:21] I changed all the docs yesterday and let people know but [10:25:28] (to use palladium as ca) [10:25:39] but I didn't know then people had already been doing installs :-D [10:25:48] well i suppose we need to disable all puppet stuff on stafford/sockpuppet [10:25:53] disable, yes [10:25:58] I'm still reluctant to purge it [10:26:04] i had disable puppet-merge and commits [10:26:09] cause eg today here we are copying things right? [10:26:14] i had not thought of puppetca unfortunately [10:26:20] it's hard to get them all [10:26:44] yeah no purge... just uninstall the packages i think [10:26:54] and keep a backup too... [10:27:08] ok I 'll fix those. Thanks for reporting [10:27:20] yep, yay for daily cleanup scripts [10:27:37] I should have seen this yesterday but I had a bug in them from refactoring over the weekend :-/ [10:27:39] so... some certs we revoked/cleaned yesterday ? [10:27:50] yes [10:27:54] those have not happenned on sockpuppet [10:28:02] so we need to do them again [10:28:04] not sure about those [10:28:11] that's fine: just tar over [10:28:18] ok [10:28:28] I'll run my report and clean em again, for good this time [10:28:42] can you do a favor [10:28:52] ? [10:29:00] can we see the list of things revoked in the last 4 days on palladium [10:29:19] because I may not have been the only person doing cleanup [10:29:24] *maybe* [10:30:45] I guess not, I can just keep a copy of the whole crl list, lemme do that [10:35:14] (03PS1) 10Akosiaris: Remove puppetmaster from sockpuppet/stafford [operations/puppet] - 10https://gerrit.wikimedia.org/r/96234 [10:36:27] apergos: so after submitting ^ i will aptitude remove puppetmaster packages on sockpuppet and stafford manually [10:36:39] looking [10:37:15] ok [10:37:24] that's good [10:40:52] (03CR) 10Akosiaris: [C: 032] Remove puppetmaster from sockpuppet/stafford [operations/puppet] - 10https://gerrit.wikimedia.org/r/96234 (owner: 10Akosiaris) [10:50:57] heh 16 of em [10:58:46] (03PS1) 10Akosiaris: ferm rule for bacula connections from internal [operations/puppet] - 10https://gerrit.wikimedia.org/r/96237 [11:00:00] (03CR) 10Akosiaris: [C: 032] ferm rule for bacula connections from internal [operations/puppet] - 10https://gerrit.wikimedia.org/r/96237 (owner: 10Akosiaris) [11:01:22] is report.py documented or even just mentioned anywhere?? [11:01:34] (the wikitech page where I just linked it doesn't count ;) ) [11:05:55] so there is /var/lib/puppet/server/ssl/ca.new waiting to be moved into place; can I just move the old one out of the way and the new one into place and not disrupt things too much? [11:05:58] akosiaris: [11:06:11] ca.new ? [11:06:21] I just lost you [11:06:29] I took a tarball from sockpuppet [11:06:41] untarred it into ca.new on palladium [11:06:43] huh you did too ? [11:06:46] oh? [11:06:48] ok [11:06:52] i was about to deploy it [11:06:56] so no harm done [11:06:57] hahaha [11:07:06] so ... you must restart apache [11:07:11] but other than that yes [11:07:14] just swap them [11:07:19] ok great [11:07:26] sorry, didn't realize you were doing that bit too :-D [11:07:49] no worries. As long as we didn't cause each other problems it is cool [11:07:55] heh [11:08:06] and now let's make sure some run is ok [11:09:24] yes, good [11:09:39] so stafford is clean now [11:09:51] no longer running apache or puppetmaster [11:10:15] I 've kept /var/lib/puppet just in case [11:10:18] good [11:10:19] moving to sockpuppet [11:10:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: Connection refused [11:10:35] I'm going to do my revocations again on palladium [11:10:39] huh [11:10:56] hmmm shouldn't neon be updated already ? [11:11:01] let's run puppet manually [11:12:42] so palladium/strontium have spiky load but are holding up [11:13:06] maybe we could consider lowering the interval to 20 ? [11:13:37] I dn't think so, there were spikes to 100% cpu from time to time [11:13:51] yes spikes... not sustained load [11:14:16] stafford had sustained load 100% like forever before we had problems [11:14:26] we had problems for a long time with stafford [11:14:41] hmmm [11:15:04] we just lived with it in that state for a long time, not really sure why [11:15:09] well it is not like we can't revert... [11:15:12] that's true [11:15:14] we can just try it [11:15:42] I 'll finish up first with the other thingies and then just do it for test [11:15:56] worse case scenario? revert [11:15:59] :-D [11:16:05] you're itching to get it down to 20 eh? [11:16:26] which is a 50% increase btw [11:16:42] I kind of left 15 (100% increase) is a bit too much [11:16:46] felt* [11:16:58] you think :-D [11:17:12] I don't want our runs to start getting really slow [11:17:42] there is that too... [11:18:52] !log remove puppetmaster and apache packages from stafford. It no longer is a puppetmaster [11:19:06] yay [11:19:07] Logged the message, Master [11:22:50] (03CR) 10Ori.livneh: "> We were setting Receive Packet Steering as "ff" (i.e. all CPUs, up to 16" [operations/puppet] - 10https://gerrit.wikimedia.org/r/95963 (owner: 10Faidon Liambotis) [11:25:06]