[00:00:54] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 3 hours [00:04:16] I'm seeing occasional 503s in one of the Parsoid backend varnishes (cp1058). We don't emit those from Parsoid, so I am wondering whether that could be a varnish or LVS issue. [00:07:44] anybody around with varnish / LVS knowledge and the rights? [00:12:20] according to https://wikitech.wikimedia.org/wiki/Parsoid, this might provide clues on parsoid.svc.eqiad.wmnet: tail -f /var/log/pybal.log | grep parsoid [00:13:00] ottomata seems to be offline [00:13:04] any roots around? [00:14:11] what's up? [00:14:56] see above [00:15:08] do you have the output of such a 503? [00:15:23] cp1058.eqiad.wmnet 701 2013-11-19T00:06:20 0.000141859 10.64.32.97 miss/503 419 GET http://parsoid/enwiki/Princess_Theatre,_Torquay?oldid=527921922 - - - 10.64.0.32, 10.64.0.32 - [00:15:51] the page itself is working: http://parsoid-lb.eqiad.wikimedia.org/enwiki/Princess_Theatre,_Torquay?oldid=527921922 [00:16:28] the load on the backends and the parsoid logs look normal [00:16:36] is it just cp1058? [00:16:40] yes, oddly [00:16:46] cp1058 backend [00:17:02] I did varnishncsa | egrep '/(4|5)' | grep -v 'miss/412' [00:17:03] 670 VCL_call c miss fetch [00:17:04] 670 FetchError c no backend connection [00:17:04] 670 VCL_call c error deliver [00:18:18] 5886 tcp connections open [00:18:54] slightly more than cp1045 [00:20:05] should be too low to be an issue in itself [00:21:02] paravoid: could you restart varnish to see if that makes a difference? [00:21:05] <^demon|busy> mwalker: I'm gonna set them up now. Is collectoid a repo as well? [00:21:18] yep [00:21:28] it'll be a submodule of Collection [00:21:45] <^demon|busy> And then the other 3 will be submodules of that? [00:21:57] yep [00:22:24] deployment is going to be a pita [00:22:43] <^demon|busy> The other option is not using submodules then ;-) [00:22:48] <^demon|busy> Then deployment isn't a pita. [00:23:08] not sure that's so true; parsoid is taking the no submodules approach [00:23:18] we'll probably end up somewhere in the middle eventually [00:23:40] we are moving towards submodules [00:23:57] gwicke: do you have a ping setup for parsoid? [00:24:07] or are you just omnicient [00:24:09] <^demon|busy> Also, as I do anytime this happens, I'm lodging my single objection towards the word -oid, as I think it's completely lame. [00:24:23] mwalker: I do have a ping, but am on this channel currently anyway [00:24:26] if you have a better name; please for the love of god use it :) [00:24:31] debugging a varnish issue [00:24:31] * ^demon|busy knows his complaints go ignored, and will just get linked to the wiktionary article on -oid, so he makes the repo anyway [00:25:06] ^demon|busy: no! I dislike collectoid; it's lame and generic [00:25:16] ori-l: quick! we need an awesome name! [00:25:27] <^demon|busy> I dislike -oid because I think -oid always makes it sound like a 2-bit android app. [00:26:25] <^demon|busy> "Let's make a sudoku app" "What shall we call it?" "Well, it's for android, let's call it sudokoid" [00:27:05] the suffix principle is a good one IMO, as the names are fairly self-explanatory [00:27:23] do you have a better suffix? [00:27:32] <^demon|busy> Yes but if you have to explain the suffix to people then you've missed the point. [00:27:51] many android apps seem to be called AndFoo [00:28:06] error 420: boring suffix [00:28:08] <^demon|busy> AndFoo or Baroid. [00:28:21] something like rashomon is less informative to most [00:28:28] do we have Foooid already? [00:28:50] I like Erik's quip 'avoid the oid' [00:29:26] the apple folks seem to use *Kit [00:29:52] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Tue Nov 19 00:29:43 UTC 2013 [00:29:53] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 3 hours [00:30:21] paravoid: ping [00:30:30] pong [00:30:31] found it [00:30:38] backend ipv4_10_2_2_28 { [00:30:42] .max_connections = 600; [00:30:42] } [00:30:52] ahh [00:31:02] root@cp1058:~# netstat -n |grep 10.2.2.28 |wc -l [00:31:02] 600 [00:31:11] ^demon|busy: I guess just create the repos as collectoid [00:31:13] root@cp1045:~# netstat -n |grep 10.2.2.28 |wc -l [00:31:13] 594 [00:31:15] it's not far off [00:31:35] ^demon|busy: I don't know how hard it is to rename things once created; but I do need the workspace [00:31:47] ^demon|busy: so the status quo shall be retained :'( [00:31:50] <^demon|busy> It's basically not worth it to ever rename anything. [00:31:51] <^demon|busy> :p [00:32:13] <^demon|busy> I shall create collectoid, but I'm going to grumble and not like it :) [00:32:20] heh [00:32:29] in the absence of fancy names, we can always go for descriptive ones :) [00:32:47] <^demon|busy> Or go for completely opaque but definitely unused. [00:33:13] mwalker, how would you describe the contents of the collectoid project? [00:33:15] yep; Jeff has named the puppet group 'offline content generator' [00:33:29] OfflineContentGenerator sounds like a totally suitable repo name to me :) [00:33:50] paravoid: can you restart the backend varnish to fix it in the short term? [00:34:01] I wonder what a good value would be [00:34:01] <^demon|busy> Or just see how many more things with can name via permutations of /[mediawk]+/ ;-) [00:34:06] 1000 sound okay? [00:34:12] paravoid: yes [00:34:13] very unscientific [00:34:51] <^demon|busy> Sorry, /[mediawkp]+/ [00:34:52] the backends take many more connections, but they shouldn't be needed in theory [00:34:54] (03PS1) 10Faidon Liambotis: varnish: bump parsoid's max_connections to 1000 [operations/puppet] - 10https://gerrit.wikimedia.org/r/96176 [00:35:19] (03CR) 10GWicke: [C: 031] varnish: bump parsoid's max_connections to 1000 [operations/puppet] - 10https://gerrit.wikimedia.org/r/96176 (owner: 10Faidon Liambotis) [00:35:20] ^demon|busy: heh; ok, per Eloquence: /mediawiki/extension/Collection/OfflineContentGenerator/* [00:35:30] (03CR) 10Faidon Liambotis: [C: 032] varnish: bump parsoid's max_connections to 1000 [operations/puppet] - 10https://gerrit.wikimedia.org/r/96176 (owner: 10Faidon Liambotis) [00:35:36] and thank god for tab complete [00:35:36] (03CR) 10Faidon Liambotis: [V: 032] varnish: bump parsoid's max_connections to 1000 [operations/puppet] - 10https://gerrit.wikimedia.org/r/96176 (owner: 10Faidon Liambotis) [00:36:10] (03PS1) 10Dzahn: add ferm rule to only allow nrpe/5666 from intern [operations/puppet] - 10https://gerrit.wikimedia.org/r/96177 [00:38:24] (03PS2) 10Dzahn: add ferm rule to only allow nrpe/5666 from intern [operations/puppet] - 10https://gerrit.wikimedia.org/r/96177 [00:39:29] gwicke: it's fixed [00:39:49] 619 connections open now, no TxStatus:503 that I can see of [00:40:16] paravoid: thanks! [00:40:29] <^demon|busy> mwalker: You're ready for cloning. [00:40:38] am still wondering if all those connections are actually active, or if many are actually just lingering [00:40:46] ^demon|busy: thanks kindly [00:40:49] <^demon|busy> yw [00:41:30] <^demon|busy> That actually sounds kind of creepy. Step into your cloning device, you're ready for cloning. [00:46:47] sorta sounds like the transmorgifier from calvin and hobbes [00:59:53] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Tue Nov 19 00:59:51 UTC 2013 [01:00:52] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 3 hours [01:04:52] ^demon|busy: we're going to stage gerrit development in a branch so ignore the .gitreview file but... is this the correct way we want to add submodules to extensions? e.g. doing it this way isn't going to cause me trouble later on? https://gerrit.wikimedia.org/r/#/c/96181/ [01:06:47] mwalker: That should be fine [01:07:10] ok; so git isn't going to give me crap about having .gitmodules files not in the root [01:07:11] We've been doing a recursive submodule checkout for deployment for quite a while now [01:07:48] not in the root? [01:08:06] not in mediawiki/core or something [01:08:28] shouldn't do... [01:08:39] git submodule update --init --recursive extensions/Collection [01:08:42] *thumbs up* -- that jives with what I just experimented with [01:09:02] so at least two people think it behaves rationally [01:23:55] (03PS3) 10Dzahn: add ferm rule to only allow nrpe/5666 from intern [operations/puppet] - 10https://gerrit.wikimedia.org/r/96177 [01:24:57] mutante: thanks :-] [01:25:31] (03PS4) 10Dzahn: add ferm rule to only allow nrpe/5666 from intern [operations/puppet] - 10https://gerrit.wikimedia.org/r/96177 [01:25:45] arr, that also needs retabbing, it never stops [01:26:30] (03PS1) 10Dzahn: retab role/gitblit to spaces [operations/puppet] - 10https://gerrit.wikimedia.org/r/96186 [01:27:06] whenever tab got killed off puppet manifest, we can add a step in the lint check to bail out whenever a .pp contains a leading tab [01:30:01] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Tue Nov 19 01:29:55 UTC 2013 [01:30:51] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 3 hours [01:38:46] (03PS1) 10Reedy: Tabs to spaces [operations/dns] - 10https://gerrit.wikimedia.org/r/96187 [01:38:48] (03CR) 10jenkins-bot: [V: 04-1] Tabs to spaces [operations/dns] - 10https://gerrit.wikimedia.org/r/96187 (owner: 10Reedy) [01:39:05] (03Abandoned) 10Reedy: Tabs to spaces [operations/dns] - 10https://gerrit.wikimedia.org/r/96187 (owner: 10Reedy) [01:40:08] (03PS1) 10Reedy: Tabs to spaces [operations/dns] - 10https://gerrit.wikimedia.org/r/96188 [01:40:22] all of the merge conflicts [01:40:42] you don't have to abandon to rebase [01:40:47] just use the same change-id [01:40:54] even if you start from scratch [01:41:01] Yeah [01:41:05] Mostly laziness [01:41:22] Though it's probably more steps to abandon and do again... [01:46:53] hashar: lol, 2 people, same thought https://wikitech.wikimedia.org/wiki/Talk:Puppet [01:48:23] mutante: very easily fixed... [01:50:14] Reedy: yea. https://wikitech.wikimedia.org/wiki/Puppet_coding#tab_character_found_on_line_.. but we don't want huge "retab them all" patches either [01:50:25] awwwwww [02:00:00] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Tue Nov 19 01:59:51 UTC 2013 [02:00:42] (03PS1) 10Dzahn: retab misc/pdf and role/pdf [operations/puppet] - 10https://gerrit.wikimedia.org/r/96190 [02:00:50] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 3 hours [02:01:28] (03CR) 10Dzahn: "btw, did the pdf sprint come up with things to add here?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/96190 (owner: 10Dzahn) [02:04:53] (03PS1) 10Springle: try to reduce "no working slave" notice flood in pmtpa dberror.log [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96191 [02:05:33] (03CR) 10Springle: [C: 032] try to reduce "no working slave" notice flood in pmtpa dberror.log [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96191 (owner: 10Springle) [02:06:31] !log springle synchronized wmf-config/db-pmtpa.php [02:06:48] Logged the message, Master [02:10:34] !log LocalisationUpdate completed (1.23wmf3) at Tue Nov 19 02:10:34 UTC 2013 [02:10:47] Logged the message, Master [02:10:54] (03PS1) 10Springle: try to reduce "no working slave" notice flood in pmtpa dberror.log (attempt 2) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96192 [02:11:28] (03CR) 10Springle: [C: 032] try to reduce "no working slave" notice flood in pmtpa dberror.log (attempt 2) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96192 (owner: 10Springle) [02:12:11] !log springle synchronized wmf-config/db-pmtpa.php [02:12:23] Logged the message, Master [02:19:12] !log LocalisationUpdate completed (1.23wmf4) at Tue Nov 19 02:19:11 UTC 2013 [02:19:24] Logged the message, Master [02:29:54] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Tue Nov 19 02:29:49 UTC 2013 [02:30:53] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 3 hours [02:50:35] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Nov 19 02:50:34 UTC 2013 [02:50:47] Logged the message, Master [03:00:03] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Tue Nov 19 02:59:53 UTC 2013 [03:00:53] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 3 hours [03:27:52] (03PS1) 10Tim Starling: Updated wikipedia.org etc. A/AAAA records to point to eqiad [operations/dns] - 10https://gerrit.wikimedia.org/r/96198 [03:28:00] (03CR) 10jenkins-bot: [V: 04-1] Updated wikipedia.org etc. A/AAAA records to point to eqiad [operations/dns] - 10https://gerrit.wikimedia.org/r/96198 (owner: 10Tim Starling) [03:29:57] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Tue Nov 19 03:29:48 UTC 2013 [03:30:47] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 3 hours [03:59:57] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Tue Nov 19 03:59:50 UTC 2013 [04:00:47] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 3 hours [04:24:23] (03PS2) 10Tim Starling: Updated wikipedia.org etc. A/AAAA records to point to eqiad [operations/dns] - 10https://gerrit.wikimedia.org/r/96198 [04:24:31] (03CR) 10jenkins-bot: [V: 04-1] Updated wikipedia.org etc. A/AAAA records to point to eqiad [operations/dns] - 10https://gerrit.wikimedia.org/r/96198 (owner: 10Tim Starling) [04:26:06] (03PS3) 10Tim Starling: Updated wikipedia.org etc. A/AAAA records to point to eqiad [operations/dns] - 10https://gerrit.wikimedia.org/r/96198 [04:30:19] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Tue Nov 19 04:30:10 UTC 2013 [04:30:49] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 3 hours [04:32:30] /away afk [04:49:12] ori-l: you know https://github.com/crucially/timesplicedb ? [04:49:30] sounds vaguely familiar but i'm not certain if i've seen it before [04:49:41] no, looks interesting [04:49:57] hrmmmm, on second look the commits are really old [04:55:27] (03CR) 10Tim Starling: "* PS2: tried adding "0" after "::" to fix test failure" [operations/dns] - 10https://gerrit.wikimedia.org/r/96198 (owner: 10Tim Starling) [04:58:54] ori-l: http://dpaste.com/1472454/plain/ [04:59:33] (03CR) 10Tim Starling: [C: 032] Updated wikipedia.org etc. A/AAAA records to point to eqiad [operations/dns] - 10https://gerrit.wikimedia.org/r/96198 (owner: 10Tim Starling) [05:00:27] !log updating DNS to I3adffd88 [05:00:41] Logged the message, Master [05:03:39] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 3 hours [05:06:46] heh [05:25:55] stupid bugzilla [05:26:46] want to add a whiteboard entry to more than one bug, see the "Change several bugs at once" option "YAY!"... in the whiteboard field "--do_not_change--" :( :( [05:30:09] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Tue Nov 19 05:29:59 UTC 2013 [05:30:39] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 3 hours [05:36:54] I just don't even [05:38:36] my favorite aspect of BZ is how it shouts your mistakes from the mountaintop [05:38:48] like, you file a bug, and then: "oh yeah, forgot to CC so-and-so" [05:39:10] so you add a CC. quoth bugzilla: "I just told these 40 people that you screwed up" [05:39:23] "I'm sure they appreciate the spam." [05:39:31] :) [05:46:40] I think most people have CC notifications disabled. [05:46:54] The default got flipped at some point. [05:49:09] (03PS1) 10Springle: not 5.1.53 compatible (5.1.56+) [operations/puppet] - 10https://gerrit.wikimedia.org/r/96204 [05:50:14] (03CR) 10Springle: [C: 032] not 5.1.53 compatible (5.1.56+) [operations/puppet] - 10https://gerrit.wikimedia.org/r/96204 (owner: 10Springle) [05:59:49] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Tue Nov 19 05:59:47 UTC 2013 [06:00:34] TimStarling: if the udpprofiler collector is choking on the volume of stats, is there any reason not to halve the % of requests that are randomly sampled for profiling, changing it from 2% to 1%? [06:00:39] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 3 hours [06:01:04] it's been choking on the volume of stats for a couple of years, I think [06:01:15] it needs to be rewritten [06:01:33] I'm in the process of doing that [06:01:50] stats aren't sampled, only profiling is sampled [06:03:20] I'm not sure what you mean. A profiling data is collected from a random sample of web requests [06:03:47] by the 'mt_rand() % 50 ) == 0' check in StartProfiler.php [06:03:54] yes, and the collector also collects packets sent by wfIncrStats() [06:04:06] stats and profiling, the two sources of data [06:04:08] oh, that's what you meant [06:04:08] right [06:05:11] what causes most of the UDP traffic at the collector at present? [06:05:28] let me check. [06:06:51] Special:Random uses mt_rand(). [06:08:24] Elsie: thank you for that pearl of wisdom [06:08:33] No problem. [06:08:38] I learned from you. [06:08:58] https://en.wikipedia.org/wiki/Wikipedia:FAQ/Technical#Is_the_.22random_article.22_feature_really_random.3F [06:09:05] TimStarling: stats are about 30% [06:10:20] what limits the performance of the collector? [06:13:35] TimStarling: I'm not sure; probably the fact that for each sample it has to look up the entry in the bdb file and then write it back [06:14:25] I don't think persistence is actually required [06:14:35] it wipes all its data with every clear command anyway [06:15:13] so presumably you could replace that with a hash_map or something [06:15:24] but that wouldn't be multithreaded [06:15:29] who/what sends the clear command? [06:15:51] there's a script called clear-profile which is run occasionally [06:16:15] it is useful if you want something other than aggregate statistics out of the old web interface [06:16:47] i.e. if you want to know what is going on now, rather than over the last 6 months or whatever [06:18:12] any reason not to get that from graphite? [06:18:18] I mean, it doesn't provide an ordering by default [06:18:39] but one could presumably write something that queries its API for keys in a certain keyspace [06:18:59] and then queries each key for the mean (or whatever) for a given time range [06:19:11] it is difficult to get numbers out of graphite [06:20:04] and it only shows you a time series, if you want any other sort of data, graphite is not the best solution [06:20:41] it's basically not profiling at all [06:21:56] i've been stuffing them into redis [06:21:57] https://dpaste.de/YYts/raw/ [06:22:01] as an experiment [06:22:18] but i set out to reproduce exactly the functionality of collector [06:22:37] redis is not multithreaded either [06:22:49] no, i was going to shard the load over multiple instances [06:22:54] on the same host [06:23:34] but do you have other recommendations? [06:23:37] i was looking at leveldb [06:24:04] i guess just a hash-map in memory? [06:24:17] if it doesn't crash i guess that's all the persistence you need [06:24:20] well, I would start out by replacing BDB with hash_map or judy or something and see if that is fast enough [06:25:17] OK, sounds like a plan. anything else on the collector wishlist, while I'm at it? [06:25:36] (that wasn't a sarcastic offer, in case it sounded like one) [06:26:25] I think the main problem with just using a different hashtable would be packet loss due to stalls during queries [06:26:48] i.e. queries from graphite [06:27:53] PROBLEM - udp2log log age for lucene on oxygen is CRITICAL: CRITICAL: log files /a/log/lucene/lucene.log, have not been written in a critical amount of time. For most logs, this is 4 hours. For slow logs, this is 4 days. [06:27:59] heh [06:29:06] maybe it just needs a buffer thread [06:29:47] one very simple solution that I used in eventlogging is basically that -- having a separate standalone executable that reads from a UDP socket and publishes it over a ZeroMQ PUB socket [06:29:53] RECOVERY - udp2log log age for lucene on oxygen is OK: OK: all log files active [06:30:03] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Tue Nov 19 06:29:53 UTC 2013 [06:30:08] which gives you the ability to have an arbitrary number of subscribers and configurable per-subscriber buffering on the publisher side [06:30:33] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 3 hours [06:31:57] then you could have a separate consumer process the data for graphite [06:33:59] yes, that would work [06:34:08] you would still need a collector for the profiling interface though [06:35:31] yeah, and it would still need to not stall while serving aggregate figures for the web interface [06:36:36] but that's not too hard [06:39:49] zeromq subscribers also handle I/O in a background thread [06:40:44] but the API hides that from you, you're just dealing with a socket [06:41:30] http://zeromq.org/topics:omq-is-just-sockets [06:58:56] (03CR) 10ArielGlenn: [C: 032] pass pep8 E128 (continuation lines under-indented) [operations/software] - 10https://gerrit.wikimedia.org/r/95587 (owner: 10Hashar) [06:59:53] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Tue Nov 19 06:59:46 UTC 2013 [07:00:08] (03CR) 10ArielGlenn: [C: 032] pep8: ignore E128 [operations/software] - 10https://gerrit.wikimedia.org/r/95588 (owner: 10Hashar) [07:00:33] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 3 hours [07:03:54] (03CR) 10ArielGlenn: [C: 032] checkhost.py: report hosts in various manifests and lists [operations/software] - 10https://gerrit.wikimedia.org/r/95586 (owner: 10ArielGlenn) [07:18:46] (03CR) 10Hashar: "Lacking time to review, sorry." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95996 (owner: 10Aude) [07:29:59] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Tue Nov 19 07:29:54 UTC 2013 [07:35:08] PROBLEM - SSH on amslvs1 is CRITICAL: Server answer: [07:36:09] RECOVERY - SSH on amslvs1 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [07:51:12] (03PS1) 10ArielGlenn: remove hardcoded salt command timeout, replace with option [operations/software] - 10https://gerrit.wikimedia.org/r/96214 [07:53:16] (03CR) 10ArielGlenn: [C: 032] remove hardcoded salt command timeout, replace with option [operations/software] - 10https://gerrit.wikimedia.org/r/96214 (owner: 10ArielGlenn) [08:01:51] (03PS1) 10ArielGlenn: remove dysprosium from decom temporarily (to be reclaimed) [operations/puppet] - 10https://gerrit.wikimedia.org/r/96216 [08:03:30] (03CR) 10ArielGlenn: [C: 032] remove dysprosium from decom temporarily (to be reclaimed) [operations/puppet] - 10https://gerrit.wikimedia.org/r/96216 (owner: 10ArielGlenn) [08:05:11] PROBLEM - SSH on amslvs1 is CRITICAL: Server answer: [08:06:11] RECOVERY - SSH on amslvs1 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [08:46:05] (03CR) 10Hashar: [C: 031] "Code is good, might want to make the list of fonts one element per line." (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/96190 (owner: 10Dzahn) [08:46:39] (03CR) 10Hashar: [C: 031] retab role/gitblit to spaces [operations/puppet] - 10https://gerrit.wikimedia.org/r/96186 (owner: 10Dzahn) [08:54:51] (03PS1) 10ArielGlenn: qualify vars planet_domain_name, planet_languages [operations/puppet] - 10https://gerrit.wikimedia.org/r/96225 [09:01:17] hashar: I am looking at the ferm policy thing. Anything else needed before tomorrow's zuul upgrade ? [09:01:49] (03CR) 10Akosiaris: [C: 032] retab role/gitblit to spaces [operations/puppet] - 10https://gerrit.wikimedia.org/r/96186 (owner: 10Dzahn) [09:26:00] (03PS2) 10Ori.livneh: Use log scale for 5xx errors in "(cdn) HTTP Error Rate" [operations/puppet] - 10https://gerrit.wikimedia.org/r/95064 (owner: 10Nemo bis) [09:26:24] (03PS2) 10Ori.livneh: Also add 2 months and 1 year graphs in "(cdn) HTTP Error Rate" [operations/puppet] - 10https://gerrit.wikimedia.org/r/95068 (owner: 10Nemo bis) [09:30:52] (03CR) 10Ori.livneh: [C: 032] Use log scale for 5xx errors in "(cdn) HTTP Error Rate" [operations/puppet] - 10https://gerrit.wikimedia.org/r/95064 (owner: 10Nemo bis) [09:31:00] (03CR) 10Ori.livneh: [C: 032] Also add 2 months and 1 year graphs in "(cdn) HTTP Error Rate" [operations/puppet] - 10https://gerrit.wikimedia.org/r/95068 (owner: 10Nemo bis) [09:39:39] (03PS1) 10Akosiaris: ferm rule for bacula director [operations/puppet] - 10https://gerrit.wikimedia.org/r/96226 [09:42:25] (03CR) 10Akosiaris: [C: 04-2] "There is a bug in ferm with hostname resolution and IPv4/IPv6 in case a hostname does not resolv for both IPv4 and IPv6. Will update this " [operations/puppet] - 10https://gerrit.wikimedia.org/r/96226 (owner: 10Akosiaris) [09:53:43] (03CR) 10Akosiaris: [C: 032] "I like Antoine's suggestion, so please do what he suggests. Other than that, LGTM" [operations/puppet] - 10https://gerrit.wikimedia.org/r/96190 (owner: 10Dzahn) [10:08:44] (03PS1) 10ArielGlenn: fix case of multiple filters broken in earlier refactor [operations/software] - 10https://gerrit.wikimedia.org/r/96231 [10:09:53] (03CR) 10ArielGlenn: [C: 032] fix case of multiple filters broken in earlier refactor [operations/software] - 10https://gerrit.wikimedia.org/r/96231 (owner: 10ArielGlenn) [10:19:58] akosiaris: the serial number in /var/lib/puppet/server/ssl/ca/serial is wrong (there are the elasticsearch hosts with certs... only on sockpuppet sadly... created afterwords) [10:20:22] not sure what else might be out of date, I can copy over certs and the new serial file but what else might not be current? [10:21:10] strontium is the other missing cert [10:21:17] lol ? [10:21:28] so let me get this straight [10:21:57] sockpuppet has more certificates than palladium in its store ? [10:22:02] oh yes [10:22:08] so these other certs were created [10:22:11] thankfully that is solvable [10:22:25] nov 11 (stronitum) [10:22:38] nov 17 & 18 (elasticxxxx) [10:22:57] palladium does not fortunately have new certs after nov 5 when the serial file went over [10:23:08] thank god [10:23:17] so yes, it's solvable realtively easy, I'm jut not sure what other files besides those (certs + serial) we need [10:23:28] ther emay be some other lists [10:23:34] just tar the entire dir [10:23:52] and it should work [10:24:09] actually let me just double check one thing [10:24:09] but we also need to stop this from happening again.... [10:24:19] hold on a sec [10:24:40] probably purge all puppet/puppetmaster packages on sockpuppet [10:24:41] ? [10:25:04] yeah the formey cert is the most recent on palladium and it's also on sockpuppet so we are good there [10:25:19] ok... i 'll fix then [10:25:21] I changed all the docs yesterday and let people know but [10:25:28] (to use palladium as ca) [10:25:39] but I didn't know then people had already been doing installs :-D [10:25:48] well i suppose we need to disable all puppet stuff on stafford/sockpuppet [10:25:53] disable, yes [10:25:58] I'm still reluctant to purge it [10:26:04] i had disable puppet-merge and commits [10:26:09] cause eg today here we are copying things right? [10:26:14] i had not thought of puppetca unfortunately [10:26:20] it's hard to get them all [10:26:44] yeah no purge... just uninstall the packages i think [10:26:54] and keep a backup too... [10:27:08] ok I 'll fix those. Thanks for reporting [10:27:20] yep, yay for daily cleanup scripts [10:27:37] I should have seen this yesterday but I had a bug in them from refactoring over the weekend :-/ [10:27:39] so... some certs we revoked/cleaned yesterday ? [10:27:50] yes [10:27:54] those have not happenned on sockpuppet [10:28:02] so we need to do them again [10:28:04] not sure about those [10:28:11] that's fine: just tar over [10:28:18] ok [10:28:28] I'll run my report and clean em again, for good this time [10:28:42] can you do a favor [10:28:52] ? [10:29:00] can we see the list of things revoked in the last 4 days on palladium [10:29:19] because I may not have been the only person doing cleanup [10:29:24] *maybe* [10:30:45] I guess not, I can just keep a copy of the whole crl list, lemme do that [10:35:14] (03PS1) 10Akosiaris: Remove puppetmaster from sockpuppet/stafford [operations/puppet] - 10https://gerrit.wikimedia.org/r/96234 [10:36:27] apergos: so after submitting ^ i will aptitude remove puppetmaster packages on sockpuppet and stafford manually [10:36:39] looking [10:37:15] ok [10:37:24] that's good [10:40:52] (03CR) 10Akosiaris: [C: 032] Remove puppetmaster from sockpuppet/stafford [operations/puppet] - 10https://gerrit.wikimedia.org/r/96234 (owner: 10Akosiaris) [10:50:57] heh 16 of em [10:58:46] (03PS1) 10Akosiaris: ferm rule for bacula connections from internal [operations/puppet] - 10https://gerrit.wikimedia.org/r/96237 [11:00:00] (03CR) 10Akosiaris: [C: 032] ferm rule for bacula connections from internal [operations/puppet] - 10https://gerrit.wikimedia.org/r/96237 (owner: 10Akosiaris) [11:01:22] is report.py documented or even just mentioned anywhere?? [11:01:34] (the wikitech page where I just linked it doesn't count ;) ) [11:05:55] so there is /var/lib/puppet/server/ssl/ca.new waiting to be moved into place; can I just move the old one out of the way and the new one into place and not disrupt things too much? [11:05:58] akosiaris: [11:06:11] ca.new ? [11:06:21] I just lost you [11:06:29] I took a tarball from sockpuppet [11:06:41] untarred it into ca.new on palladium [11:06:43] huh you did too ? [11:06:46] oh? [11:06:48] ok [11:06:52] i was about to deploy it [11:06:56] so no harm done [11:06:57] hahaha [11:07:06] so ... you must restart apache [11:07:11] but other than that yes [11:07:14] just swap them [11:07:19] ok great [11:07:26] sorry, didn't realize you were doing that bit too :-D [11:07:49] no worries. As long as we didn't cause each other problems it is cool [11:07:55] heh [11:08:06] and now let's make sure some run is ok [11:09:24] yes, good [11:09:39] so stafford is clean now [11:09:51] no longer running apache or puppetmaster [11:10:15] I 've kept /var/lib/puppet just in case [11:10:18] good [11:10:19] moving to sockpuppet [11:10:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: Connection refused [11:10:35] I'm going to do my revocations again on palladium [11:10:39] huh [11:10:56] hmmm shouldn't neon be updated already ? [11:11:01] let's run puppet manually [11:12:42] so palladium/strontium have spiky load but are holding up [11:13:06] maybe we could consider lowering the interval to 20 ? [11:13:37] I dn't think so, there were spikes to 100% cpu from time to time [11:13:51] yes spikes... not sustained load [11:14:16] stafford had sustained load 100% like forever before we had problems [11:14:26] we had problems for a long time with stafford [11:14:41] hmmm [11:15:04] we just lived with it in that state for a long time, not really sure why [11:15:09] well it is not like we can't revert... [11:15:12] that's true [11:15:14] we can just try it [11:15:42] I 'll finish up first with the other thingies and then just do it for test [11:15:56] worse case scenario? revert [11:15:59] :-D [11:16:05] you're itching to get it down to 20 eh? [11:16:26] which is a 50% increase btw [11:16:42] I kind of left 15 (100% increase) is a bit too much [11:16:46] felt* [11:16:58] you think :-D [11:17:12] I don't want our runs to start getting really slow [11:17:42] there is that too... [11:18:52] !log remove puppetmaster and apache packages from stafford. It no longer is a puppetmaster [11:19:06] yay [11:19:07] Logged the message, Master [11:22:50] (03CR) 10Ori.livneh: "> We were setting Receive Packet Steering as "ff" (i.e. all CPUs, up to 16" [operations/puppet] - 10https://gerrit.wikimedia.org/r/95963 (owner: 10Faidon Liambotis) [11:25:06] !log remove puppetmaster and apache packages from sockpuppet. It no longer is a puppetmaster [11:25:19] Logged the message, Master [11:33:13] PROBLEM - SSH on amslvs1 is CRITICAL: Server answer: [11:33:51] damn that is not good [11:34:12] RECOVERY - SSH on amslvs1 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [11:34:48] phew.... heavy load but it looks ok [11:37:27] what's our plan b if amslvs1 falls over? [11:37:31] or maybe i shouldn't ask that [11:37:47] amslvs3 ? [11:37:57] and then start moving traffic to eqiad ? [11:38:02] ergh [11:38:17] eh I mean amslvs2 [11:38:29] amslvs3/4 server other stuff right ? [11:38:49] lemme check because I don't remember which has which in esams [11:38:49] well it is not in peak yet... though rising [11:39:00] yeah that's what's iffy [11:39:30] 1,3 are redundant and 2,4 are redundant [11:43:10] palladium is ready to become a second puppetmaster; that is, it is already, and now I just need to tell the minions about it [11:43:15] (tested one manually, works fine) [11:52:52] ?? [11:53:00] second saltmaster you mean [11:53:03] apergos: ^ [11:53:06] er yes [11:53:09] :-D :-D [11:53:23] PROBLEM - swift-container-server on ms-be1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:53:23] PROBLEM - swift-account-replicator on ms-be1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:53:23] PROBLEM - RAID on ms-be1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:53:23] PROBLEM - swift-account-server on ms-be1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:53:32] PROBLEM - DPKG on ms-be1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:53:33] PROBLEM - swift-container-auditor on ms-be1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:53:33] PROBLEM - swift-container-replicator on ms-be1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:53:33] PROBLEM - puppet disabled on ms-be1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:53:38] er? [11:53:42] PROBLEM - swift-object-replicator on ms-be1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:53:42] PROBLEM - swift-container-updater on ms-be1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:53:42] PROBLEM - swift-account-auditor on ms-be1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:53:42] PROBLEM - swift-object-auditor on ms-be1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:53:42] PROBLEM - swift-account-reaper on ms-be1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:53:46] ok ok [11:53:52] PROBLEM - swift-object-updater on ms-be1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:53:52] PROBLEM - swift-object-server on ms-be1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:54:22] PROBLEM - SSH on ms-be1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:55:10] ah, that's gonna be a power cycle. yuck [11:55:18] it's had a rough day: http://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&h=ms-be1002.eqiad.wmnet&m=cpu_report&s=by+name&mc=2&g=load_report&c=Swift+eqiad [11:55:42] pile of soft lockup, can't get in [11:56:26] !log powercycled ms-be1002, couldn't get in (lots of BUG: soft lockup - CPU#18 stuck for 22s etc on console) [11:56:38] Logged the message, Master [11:58:02] PROBLEM - Host ms-be1002 is DOWN: PING CRITICAL - Packet loss = 100% [11:58:24] RECOVERY - puppet disabled on ms-be1002 is OK: OK [11:58:32] RECOVERY - swift-object-replicator on ms-be1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [11:58:33] RECOVERY - swift-container-updater on ms-be1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [11:58:33] RECOVERY - swift-account-auditor on ms-be1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [11:58:33] RECOVERY - swift-object-auditor on ms-be1002 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [11:58:33] RECOVERY - Host ms-be1002 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [11:58:33] RECOVERY - swift-account-reaper on ms-be1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [11:58:42] RECOVERY - swift-object-server on ms-be1002 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [11:58:42] RECOVERY - swift-object-updater on ms-be1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [11:59:02] nice..... [11:59:11] looks ok [11:59:12] RECOVERY - SSH on ms-be1002 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [11:59:13] RECOVERY - swift-account-server on ms-be1002 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [11:59:13] RECOVERY - swift-container-server on ms-be1002 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [11:59:13] RECOVERY - swift-account-replicator on ms-be1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [11:59:13] RECOVERY - RAID on ms-be1002 is OK: OK: optimal, 14 logical, 14 physical [11:59:23] RECOVERY - DPKG on ms-be1002 is OK: All packages OK [11:59:23] RECOVERY - swift-container-auditor on ms-be1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [11:59:23] RECOVERY - swift-container-replicator on ms-be1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [11:59:31] nothing weird during the boot so [12:19:25] hey [12:19:27] what's up? [12:20:13] quiet righ now [12:21:50] (03PS1) 10ArielGlenn: set multiple salt masters for minions [operations/puppet] - 10https://gerrit.wikimedia.org/r/96242 [12:25:49] (03PS2) 10ArielGlenn: set multiple salt masters for minions [operations/puppet] - 10https://gerrit.wikimedia.org/r/96242 [12:26:42] (03PS3) 10ArielGlenn: set multiple salt masters for minions [operations/puppet] - 10https://gerrit.wikimedia.org/r/96242 [12:26:43] better go eat lunch or something [12:36:25] (03PS1) 10Akosiaris: Change the way cron run times are calculated [operations/puppet] - 10https://gerrit.wikimedia.org/r/96247 [12:44:53] (03PS1) 10Akosiaris: Allow HTTP(S) access in gitblit [operations/puppet] - 10https://gerrit.wikimedia.org/r/96250 [12:48:30] akosiaris: nope [12:48:47] ? [12:49:13] (03CR) 10Faidon Liambotis: [C: 04-2] "gitblit is behind the misc-lb Varnish cluster, it doesn't need to have HTTP exposed to users directly." [operations/puppet] - 10https://gerrit.wikimedia.org/r/96250 (owner: 10Akosiaris) [12:49:43] yeah i thought about that [12:50:04] we are sure we don't want to have it open anyway ? [12:51:45] yes [12:53:17] gitblit is very easy to overload [12:53:37] having it exposed directly might be an easy way to DoS it, inadvertedly or not [12:57:18] ok abandoning then. So... that closes the things that need to be changed to have a default DROP policy in ferm [12:57:23] :-) [12:58:12] (03Abandoned) 10Akosiaris: Allow HTTP(S) access in gitblit [operations/puppet] - 10https://gerrit.wikimedia.org/r/96250 (owner: 10Akosiaris) [12:58:36] (03PS1) 10Faidon Liambotis: Varnish: sync mobile UA redirects with squid [operations/puppet] - 10https://gerrit.wikimedia.org/r/96251 [12:59:13] ori-l: ^^ [13:05:55] (03CR) 10Ori.livneh: [C: 031] "LGTM, thanks" [operations/puppet] - 10https://gerrit.wikimedia.org/r/96251 (owner: 10Faidon Liambotis) [13:06:25] (03CR) 10Faidon Liambotis: [C: 032] Varnish: sync mobile UA redirects with squid [operations/puppet] - 10https://gerrit.wikimedia.org/r/96251 (owner: 10Faidon Liambotis) [13:15:12] (03PS4) 10ArielGlenn: set multiple salt masters for minions [operations/puppet] - 10https://gerrit.wikimedia.org/r/96242 [13:18:19] (03CR) 10ArielGlenn: [C: 032] set multiple salt masters for minions [operations/puppet] - 10https://gerrit.wikimedia.org/r/96242 (owner: 10ArielGlenn) [13:40:56] huh [13:41:05] grrrit-wm: doesn't show C+0 comments? [13:59:49] paravoid: it does, there is one by you a bit above (2 hours ago) [14:00:35] but there are some exceptions, so it's not always obvious what made an event be relayed (or not) [14:17:05] (03PS5) 10Aude: Enable Wikidata build on beta labs [WIP] [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95996 [14:17:32] (03CR) 10Aude: [C: 04-1] "(rebased)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95996 (owner: 10Aude) [14:26:55] (03PS2) 10Akosiaris: misc::contint merged in role::ci::master [operations/puppet] - 10https://gerrit.wikimedia.org/r/95706 (owner: 10Hashar) [14:28:52] (03CR) 10Akosiaris: [C: 032] misc::contint merged in role::ci::master [operations/puppet] - 10https://gerrit.wikimedia.org/r/95706 (owner: 10Hashar) [14:33:39] (03PS1) 10ArielGlenn: put salt minion multiple masters in yaml format [operations/puppet] - 10https://gerrit.wikimedia.org/r/96260 [14:34:41] akosiaris: thank you :) [14:34:57] (03CR) 10ArielGlenn: [C: 032] put salt minion multiple masters in yaml format [operations/puppet] - 10https://gerrit.wikimedia.org/r/96260 (owner: 10ArielGlenn) [14:35:20] (03CR) 10Akosiaris: [C: 032] deployment: integration/jenkins for Jenkins CI slaves [operations/puppet] - 10https://gerrit.wikimedia.org/r/95705 (owner: 10Hashar) [14:37:48] hashar: :-) [14:41:00] (03CR) 10MaxSem: "Eh, the UA removals were intentional as they are covered by other strings such as midp." [operations/puppet] - 10https://gerrit.wikimedia.org/r/96251 (owner: 10Faidon Liambotis) [14:41:08] paravoid, ^^^ [14:41:11] hey [14:41:22] this wasn't obvious, I git blame'd [14:41:46] I doubt the second part (lg/nec -> case insensitive) was intentional though [14:42:50] MaxSem: my impression is that most of these UAs are redundant with "mobi" [14:42:57] nope [14:43:22] a lot of crap refuses to follow this convention [14:43:47] I said "most" :) [14:44:26] I tried really hard to simplify the regex, but that's the only removals I could come up with [14:44:42] really?? [14:47:22] anyway, I have tests that confirm that these removals dont break anything;) [14:47:56] I wish we had some vcl tests in the repo :D [14:48:58] 15:44 < ori-l> in october, 53652 events out of 7567429 had more than one redirect (0.7%) [14:49:01] 15:45 < ori-l> in november it's 61151 / 4602964, or 1.3%, nearly double [14:49:11] we were investigating http://ganglia.wikimedia.org/latest/graph.php?r=year&z=xlarge&title=Mobile+Sending+%28navStart+to+fetchStart%29+and+Redirecting&vl=&x=&n=&hreg[]=client-side&mreg[]=browser.%28redirecting%7Csending%29.mobile_median>ype=line&glegend=show&aggregate=1&embed=1&_=1384864168557 [14:50:48] what does it measure, exactly? [14:52:54] (03CR) 10Hashar: [C: 04-1] "There is an Icinga check that might use that value as well, I am referring to the one that complains from puppet freshness." [operations/puppet] - 10https://gerrit.wikimedia.org/r/96247 (owner: 10Akosiaris) [14:57:22] akosiaris: mind merging in the Zuul configuration for gearman?it is back compatible with the current version (tested on labs) https://gerrit.wikimedia.org/r/#/c/93457/ :D [14:57:26] that will be one less step for tomorrow [14:57:59] hashar: how do you figure that? I thought it just kept track of when the last smpttrap 'ok' arrived and if the time was too long then we got a whine [14:58:37] apergos: haven't looked at the freshness check in puppet, but IIRC the message is something like: "puppet hasn't run for 10 hours" or something like that [14:58:42] hashar paravoid I am seeing intermittent 503s on beta labs, is that known? [14:58:48] yes, the 10 or 3 is hardcoded [14:59:13] chrismcmahon: not afaik [14:59:41] apergos: so the icinga check could use a multiple of the puppet cron frequency [14:59:44] paravoid: thanks, I do BZ [14:59:57] it could but it does not afaik [15:00:16] (03PS1) 10Akosiaris: Cleanup puppetmaster's apache Listen ports [operations/puppet] - 10https://gerrit.wikimedia.org/r/96262 [15:00:33] I mean if we changed the cron (which we have done in the past) to be a half hour then should the whine occur in 5 hours instead of 10? (or 1.5 instead of three, which we have now) [15:02:13] apergos: I guess so [15:02:17] aka after X occurences [15:02:19] hashar: so you want to link the interval to the puppet freshness whine period ? [15:02:26] but maybe you guys would prefer after 12 hours (aka twice per day) [15:02:38] well you weren't there for the discussion but [15:02:48] na I don't want anything, just pointing that the cron interval is linked to the icinga check freshness [15:03:00] I wanted puppet freshnes to be long enough not to give a ton of false positives [15:03:02] in case akosiaris forgot about it :] [15:03:15] but short enough that the person who broke something might see it and try to fix it [15:03:17] I never even thought about linking them [15:03:21] 3 hours seemed like worth trying [15:03:26] but it might make sense.... [15:03:47] we could say.... 3*interval ? [15:04:03] or you can have the freshness delay to be a variable that would be defined next to the one in cron [15:04:09] so you can easily tweak them [15:04:46] 1.5 hours is pretty short [15:05:00] (and if we get runs every 20 minutes, whines after an hour are really really short) [15:05:04] don't you want to be warned early whenever puppet dies / don't run ? [15:05:11] i just touk 3 out of an RNG... we can have whatever we feel like [15:05:37] I want to be warned early enough that the person who knows what they touched has a chance to fix it [15:05:54] but not so early that I know I"m in the middle of fixing it and I get the pile of icinga [15:06:27] akosiaris: are you still working on ferm? [15:06:41] if yes, try doing blog next [15:06:44] (holmium) [15:06:58] paravoid: I just started looking at changing the default policy [15:07:03] okay [15:07:03] blog ? what blog ? [15:07:15] wasn't this given away ? (outsourced.....) [15:07:16] the blog.wm.org host [15:07:19] it wasn't, no [15:07:28] who knows [15:07:42] sigh... [15:09:31] we have a memcached alert broken for quite some time [15:09:48] because we run memcached on localhost and we have no nrpe there since it's public [15:09:51] annoying [15:16:08] (03PS5) 10Akosiaris: add ferm rule to only allow nrpe/5666 from intern [operations/puppet] - 10https://gerrit.wikimedia.org/r/96177 (owner: 10Dzahn) [15:16:57] damn ... i had to manually rebase this one [15:18:58] (03CR) 10Akosiaris: [C: 032] "Had to manually rebase this one due to tab vs spaces change in" [operations/puppet] - 10https://gerrit.wikimedia.org/r/96177 (owner: 10Dzahn) [15:19:03] \O/ [15:20:05] (03CR) 10Hashar: "I think I based that change out of the varnish role class which use inheritance." [operations/puppet] - 10https://gerrit.wikimedia.org/r/77034 (owner: 10Hashar) [15:21:16] do we have any infrastructure to run recurring jobs beside using crontabs ? [15:21:37] why beside crontabs ? [15:22:00] even things like celery use crontab like behaviour [15:25:28] I guess I should use puppet / crontab indeed [15:25:41] or maybe I could just use Jenkins *evil* [15:25:57] I have to write down some reporting tasks for ci [15:26:26] * akosiaris feels sorry for ya [15:27:49] (03PS1) 10Akosiaris: Change default ferm policy to DROP [operations/puppet] - 10https://gerrit.wikimedia.org/r/96265 [15:28:36] PROBLEM - jenkins_service_running on gallium is CRITICAL: Timeout while attempting connection [15:29:06] PROBLEM - zuul_service_running on gallium is CRITICAL: Timeout while attempting connection [15:42:57] ahhh [15:43:52] might be nrpe not being able to reach gallium [15:44:22] akosiaris: nrpe on gallium (port 5666) is being dropped, the accept rule is for 10.0.0.0/8 [15:44:40] maybe nrpe reach that machine using some public IP [15:44:52] huh ? [15:45:11] huh... so [15:45:26] neon has a public ip and gallium has a public ip as well [15:45:32] so that makes sense [15:47:13] poor modules/base/files/firewall/defs.production doesn't have the wikimedia public networks though :/ [15:47:29] yes it doesn't [15:47:46] i have a pending change waiting for us to populate that as well [15:48:40] if you can fix the check by hardocding neon IP in nrpe ferm rule, that would be nice. [15:48:45] (03PS1) 10Manybubbles: Logging changes for elasticsearch [operations/puppet] - 10https://gerrit.wikimedia.org/r/96266 [15:48:48] the checks will be handy for tomorrow [15:49:13] I can do that in a temporary way but we really need to fix it correctly [15:49:41] if I get some time tomorrow it would be nice [15:50:11] temp fix with a # FIXME HACK [15:54:10] (03PS1) 10Hashar: nrpe: iptables accept neon public IP address [operations/puppet] - 10https://gerrit.wikimedia.org/r/96267 [15:54:16] akosiaris: ^^^ [15:56:22] akosiaris: btw, I left you a comment on that bacula/ferm thing [15:56:29] about using ferm's @resolve [15:56:37] which I then realized it's broken for dual-stack [15:56:44] the fix isn't hard but isn't trivial either [15:56:51] I'll work on it when we start needing it [16:01:03] (03CR) 10Hashar: "If you want to test it out on gallium (Jenkins server) during European morning, I am volunteering the box. For an internal box we can u" [operations/puppet] - 10https://gerrit.wikimedia.org/r/96265 (owner: 10Akosiaris) [16:01:33] ori-l: would you have a few minutes to look at a zmq issue on vanadium for me (or tell me who would know about this)? [16:01:47] s/for/with/ [16:01:52] (03CR) 10Faidon Liambotis: [C: 04-1] "We still have files/firewall/main-input-default-drop.conf which is essentially that. This was missed when the stanzas were never moved to " [operations/puppet] - 10https://gerrit.wikimedia.org/r/96265 (owner: 10Akosiaris) [16:02:10] ori-l: vanadium is analytics now [16:02:12] er [16:02:17] apergos: vanadium is analytics now [16:03:04] RECOVERY - puppet disabled on analytics1021 is OK: OK [16:03:04] RECOVERY - SSH on analytics1021 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [16:03:04] RECOVERY - Disk space on analytics1021 is OK: DISK OK [16:03:04] RECOVERY - RAID on analytics1021 is OK: OK: no disks configured for RAID [16:03:09] who would be a good person to ask about libraries over there? otto mata? [16:03:13] RECOVERY - Host analytics1021 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [16:03:14] (03CR) 10Akosiaris: [C: 04-1] "https://gerrit.wikimedia.org/r/#/c/96177/" (036 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/75777 (owner: 10Chad) [16:03:20] ori-l has made himself available if needed, but the first point of contact should be ops & analytics, i.e. otto [16:03:23] RECOVERY - DPKG on analytics1021 is OK: All packages OK [16:03:37] great, I'll check in with him [16:04:15] ottomata: would you have a few minutes to look at a zmq issue on vanadium with me? [16:04:48] that might have to be asked again later closer to sf morning [16:06:34] paravoid: ok thanx I just read the comment on ferm and @resolve [16:06:58] would love to, in meeting atm, almost out [16:07:22] I'm here for a good while yet so whenever's good, thanks [16:09:22] akosiaris: I just left you another one too :) [16:10:52] argh: ETOOMANYINTERRUPTS [16:18:53] (03CR) 10Faidon Liambotis: "It's a bitmask, so ff is actually for 8 CPUs." [operations/puppet] - 10https://gerrit.wikimedia.org/r/95963 (owner: 10Faidon Liambotis) [16:24:13] (03CR) 10Akosiaris: [C: 032] nrpe: iptables accept neon public IP address [operations/puppet] - 10https://gerrit.wikimedia.org/r/96267 (owner: 10Hashar) [16:26:33] RECOVERY - jenkins_service_running on gallium is OK: PROCS OK: 1 process with regex args ^/usr/bin/java .*-jar /usr/share/jenkins/jenkins.war [16:26:43] PROBLEM - Puppet freshness on analytics1021 is CRITICAL: No successful Puppet run in the last 3 hours [16:26:53] RECOVERY - zuul_service_running on gallium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/local/bin/zuul-server [16:33:23] apergos: what's up? [16:33:30] i need to run to city to meet yuri to check out a cowork space soon [16:33:33] but I think I can help for a bit [16:33:38] (03PS2) 10Ottomata: Logging changes for elasticsearch [operations/puppet] - 10https://gerrit.wikimedia.org/r/96266 (owner: 10Manybubbles) [16:33:43] (03CR) 10Ottomata: [C: 032 V: 032] Logging changes for elasticsearch [operations/puppet] - 10https://gerrit.wikimedia.org/r/96266 (owner: 10Manybubbles) [16:34:12] so I'm trying to get the salt clint to be happy over there, apparently it's been unhappy since oct 23 [16:34:31] and it's unhappy because it thinks the version of libzmq over ther eis < 3.2 [16:34:55] which is true indeed for the version in /usr/local/lib/python2.7/dist-packages/zmq [16:35:57] but not true for the version in [16:36:06] sorry I have already forgotten where the system one is [16:36:08] sec [16:37:07] /usr/lib/x86_64-linux-gnu/ [16:37:23] hmm, i think i see both [16:37:26] so I am wondering if that copy in /usr/local/lib/blah is really necessary [16:37:27] libzmq1 and libzmq3 installed [16:38:40] I actually did an strace of the check it does, to see what it looks at first, and /usr/local/lib/python2.7/dist-packages/zmq gets hit first, sadly [16:38:43] a lib in /usr/local/ ? [16:38:52] why ? why ? why ? [16:39:03] hm [16:39:14] that is not in the python-zmq package [16:39:28] the /usr/local/lib bit [16:39:32] is it in a package ? please don't tell me someone installed it with pip .... [16:39:35] argh [16:39:37] snif [16:39:54] well I was hoping you might know the packages over there [16:39:57] since I have no idea [16:40:12] you are not alone [16:40:12] i don't, ori-l would know better, i'm looking though [16:40:22] ah [16:41:17] heh [16:41:28] it has an entire python installation in /usr/local/ [16:42:38] /usr/local/cellar [16:42:39] ?????? [16:42:47] well I think that wasn't put there by apt/dpkg [16:42:53] soooooo [16:43:04] I think i will call it a day ... [16:43:14] celler? isn't that mac homebrew? [16:43:15] hahaha [16:43:19] I will go to bed tonight... and NOT think about it [16:43:25] I dunno (not mac person) [16:43:33] go to bed, ottomata and I are looking at it [16:43:40] tomorrow I will find a big flamethrower and figure this out [16:43:45] well I mean, go to bed when it is time to go to bed [16:43:48] sooooo ok apergosthe problem is that zmq is the wrong version? [16:43:49] lol [16:43:50] and do other things til then [16:43:56] the /usr/loca/lib/ one is 2.2.0.1 [16:44:00] yes, the version in /usr/local is crap [16:44:02] probably /usr/local gets loaded in preference [16:44:07] yep [16:44:20] what if we just (temp?) move the /usr/local/lib/pythonx/zmq out of the way and try? [16:44:24] i'm not sure how you are testing [16:44:27] well I wonder what we break there [16:44:44] and that is why I asked you to look in on it [16:44:45] that is probably because it has a python3.3 in there [16:44:45] yeah, you might want to get ori-l on this [16:44:52] cause otherwise I would be fine with nuking it or whatever [16:45:03] this is eventlogging server [16:45:05] i know very little about it [16:45:07] nuke it nuke it!!!! [16:45:10] hahaha [16:45:11] maybe [16:45:14] i do know that [16:45:21] originally eventlogging was not well puppetized [16:45:26] hold yer horses, we will nuke when the time is right :-D [16:45:28] and then ori spent a lot of time productionizing and puppetizing [16:45:33] so its possible this is leftover cruft [16:45:34] ok so I will ping the heck out of ori-l [16:45:46] and see what he has to say about cruft removal [16:45:49] k [16:45:58] in the meantime real quick before you disappear [16:46:07] the analytics boxen *.eqiad.wmnet [16:46:15] what is the story with network in / out? [16:46:31] oh that should be fixed [16:46:39] 014188 4 -rw-r--r-- 1 root staff 379 Aug 27 00:12 ./lib/python2.7/dist-packages/easy-install.pth [16:46:40] I ask because I have the weirdest symptom ever with them: can't get new salt key from those clients to palladium [16:46:46] puppet runs, so how does that work? [16:46:55] 1851410 8 -rw-r--r-- 1 root staff 7550 Aug 27 00:12 ./lib/python2.7/dist-packages/eventlogging-0.6_20130827-py2.7.egg/eventlogging/jrm.pyc [16:46:56] nonetheless... they and vanadium are about the only issue [16:46:56] s [16:46:58] what port is salt on ? any new ports ? [16:47:04] good q [16:47:13] so this looks like eventlogging change indeed [16:48:46] http://docs.saltstack.com/topics/tutorials/firewall.html according to this, 4505 and 4506 [16:48:51] on the master [16:49:56] Snaps_: so, varnish people rewrote varnishncsa, this just landed in master... :/ [16:50:26] and they are very vague about ports on the client side so maybe no fixed set there [16:50:31] (03PS1) 10Hashar: rename misc::parsoid to role::parsoid::production [operations/puppet] - 10https://gerrit.wikimedia.org/r/96270 [16:50:37] (03PS1) 10Hashar: lint role/parsoid [operations/puppet] - 10https://gerrit.wikimedia.org/r/96271 [16:51:47] (03CR) 10Hashar: "I have no clue who from ops is the parsoid referent." [operations/puppet] - 10https://gerrit.wikimedia.org/r/96270 (owner: 10Hashar) [16:51:59] apergos: sorry, yeah the networking seems fixed, but there are very specific network ACL firewall rules in for how the analytics subnets are allowed to talk to the rest of the cluster [16:52:04] (03CR) 10Hashar: "I have no clue who from ops is the parsoid referent." [operations/puppet] - 10https://gerrit.wikimedia.org/r/96271 (owner: 10Hashar) [16:52:12] is the salt stuff new? [16:52:23] e.g. after februaryish? [16:52:24] palladium being an eqiad saltmaster is new [16:52:32] it's been sockpuppet (pmtpa) til now [16:52:49] ah hm ok [16:52:59] so yeah if I need to open an rt ticket or something that's fine [16:53:00] I did not specify any salt stuff in the orignal ACL ticket: https://rt.wikimedia.org/Ticket/Display.html?id=4433 [16:53:15] LeslieCarr: can confirm whether or not salt to palladium is allowed [16:53:21] ok [16:53:47] paravoid: good for them :) [16:54:58] paravoid: hi, any progress on https://gerrit.wikimedia.org/r/#/c/88261/ ? Its really a one liner change, nothing surprising there :) [16:54:58] then I think anyways you are off the hook no matter what ottomata, than you [16:55:00] *thank [16:55:54] yurik_: no it's not [16:56:04] paravoid: what do you mean? [16:56:05] and stop pinging me every day twice, one from adam and one from you [16:56:21] paravoid: adam is not here atm, didn't know he pinged you [16:56:30] I'll get to it when I get the chance [16:56:37] paravoid: so what do you mean its not ? [16:57:54] it unsets a header (which we never had before - so a noop), and it sets that header in one of the cases. Btw, could bblack assist with it? It would make all zero users very happy :) [16:58:15] no, I'll do it [16:59:59] I'm still waiting for that homepage fix for a while [17:00:06] maybe I should start pinging you twice per day too :) [17:00:36] paravoid: please don't compare a two line code change in varnish with a significant + code rewrite [17:00:43] a significant virtualhost? [17:00:49] it's like 10 lines of virtualhost at most [17:00:59] and I've commited of doing that if you fix the mediawiki side [17:01:06] and we've been saying that for 6 months [17:01:38] yes, 10 lines of that plus the change to our side of code. It still amounts to much more than the 2 line varnish change. Regardless, I am working on it right now :) [17:02:25] and btw, I have been very busy trying to get the ESI that mark asked for, only to find out that the ESI is broken on the varnish side and it is now stalled :( [17:03:14] everybody is busy and everything is broken, wooooooo! [17:03:15] :) [17:03:16] because ops were complaining a lot about cache fragmentation :) [17:03:37] ottomata, that's called business as usual :) [17:04:33] yurik_, i'm leaving in here in 20 mins, will come say hi, then have scrum of scrums at 1:30, probably don't have time to go to lunch with y'all today [17:04:37] i will grab somethjing on my way in [17:05:24] but i'll be around and hanging and working for the whole afternoon [17:05:34] hmm, maybe I hsould just wait til after scrum of scrums [17:05:34] hmmm [17:05:42] ottomata: no worries. The place is cool, adam just got it [17:05:44] in [17:05:46] awesome [17:05:52] its ok if I show up around 2:30 then? [17:06:07] up to you, but there are lots of good spots here to have avideo conf [17:06:18] hmmmm [17:06:20] ok [17:06:20] yurik_: I think that ESI bug got fixed a while ago [17:06:21] might as well come and use the place to see if it works ;) [17:06:26] yeah [17:06:28] true [17:06:34] yurik_: also, simplifying the VCL config was way way higher in priority than ESI. [17:07:39] paravoid: nope, bblack helped to patch it, but then it turned out that the patch is much bigger than what we applied, and that patch touches the frequently used zip code, so its on a backburner, especially considering that mark wants to migrate to v4 soon [17:08:30] v4….varnish? [17:08:35] yep [17:08:38] 3.0.4 i think [17:08:41] oh phew [17:08:47] not 4.x [17:08:48] ok [17:08:56] is there 4.x? checking... [17:08:58] no [17:08:59] ha [17:09:06] not stable [17:09:14] ottomata, you're adorable [17:09:33] uhhhhhh, not sure why but thanks? [17:09:35] than again, i am never sure with mark, i am sure he would love to get us to v4 ;) [17:10:08] bleeding edge is coool! [17:11:02] no, i just had some discussions with ma rk and fa idon recently about 4.x and how far off it is, so it was on my mind, and when you said v4 i was like uhhhhh [17:11:03] cool. [17:11:36] ori-l: would you have time later today to look at a zmq library issue on vanadium with me? [17:12:30] sure, what's up? [17:12:49] well the stuff in /usr/local/ is breaking salt-client [17:13:02] salt-minion [17:13:03] gah [17:13:31] anyways, is the stuff in /usr/local/lib/python2.7/dist-packages/zmq really necessary? [17:13:37] can probably just delete it [17:13:39] I couldn't actually even see what installed that tbh [17:13:47] ottomata: there is a shake shack downstairs! :) [17:13:52] this is from way back in the day [17:13:55] ohhhHHHh that would be eeasy [17:13:56] HMM [17:13:56] i got a special amnesty from mark [17:14:03] before i knew any puppet [17:14:09] ok, I'm not gonna break something by tossing that right? [17:14:10] :-D [17:14:16] yurik_, if I left like right now, got there by 1, would we have time to scarf and then get me to my meeting in time? [17:14:27] ottomata: yep [17:14:30] apergos: probably not, [17:14:30] ok on it, be there asap [17:14:37] if so it would be my fault and i'll fix it [17:14:37] yep yep [17:14:54] it's fully puppetized, should not depend on random bits [17:15:04] ok [17:15:15] I'll move it for now, so that just in case, etc. [17:15:20] thanks! [17:15:23] caution schmaution [17:15:26] thank you [17:15:26] ottomata: sorry, misleading, witchcraft sandwithes is downstairs [17:15:37] the shackshack is in madison sq park :( [17:15:43] chelsy market is nearbyish [17:15:54] sorry for channel spam :) [17:16:04] witchcraft in tribeca? [17:16:14] i used to work right by there [17:16:21] we are trying out a coworking space at 26 & west side highway [17:16:27] (03CR) 10Faidon Liambotis: [C: 04-1] "I have multiple reservations about this." [operations/puppet] - 10https://gerrit.wikimedia.org/r/88261 (owner: 10Dr0ptp4kt) [17:16:33] there you go yurik_ [17:19:31] paravoid: thanks, those are important points, but the current burning issue is the fact that many carriers who do not support opera are getting free banners incorrectly, causing massive mis-advertising for their users - seeing it as free when they are not. I agree that we should have a more thorough solution to proxies, but that would take much longer to solve [17:19:36] dr0ptp4kt: ^ [17:19:58] we are talking about millions of users btw [17:20:02] yurik_ thx [17:20:10] ori-l, still not sure what you mean, but perhaps you are advising me to wear my helmet! [17:20:11] ok! [17:20:13] why do ALL of your requests come as "urgent we need this now"? [17:20:24] back in a bit [17:20:29] sorry, your emergency is not my urgency [17:20:43] reminder for room! send me your updates for scrum of scrums, need them within an hour [17:20:45] :) [17:20:49] this is wrong, it has the potential of destroying our caches and our appservers currently don't have much leeway [17:21:59] paravoid: this patch has been in the gerrit for the past month and a half, and Zero has been on our cases for a while. If cache fragmentation was a concern, mark should have gotten the ESI in. He said two weeks ago that ESI is NOT his priority. [17:22:11] see logs [17:23:23] month and a half is definitly NOT burning, its just that we have had received a lot of heat over this from partners [17:23:46] we had the text-eqiad migration ongoing, so this couldn't get attention [17:24:12] plus, you had two patches to the same effect for a while, one from you and one from adam, so we were waiting for you to figure it out between each other first [17:24:29] anyway, I prioritized it and gave it a review as you requested [17:24:52] the review is factually correct, which has nothing to do with how long that patch has been sitting in gerrit :) [17:25:21] paravoid: we need a solution, not a review. having valid concerns do not solve issues [17:25:42] so as i said - if cache fragmentation is a concern, lets get ESI fixed [17:25:56] that would go from N carriers to 1 [17:26:03] that review is orthogonal to ESI [17:26:17] guys, let me respond in bugzilla. i'm not *too* worried about cache object proliferation here, as opera is really the only X-Forwarded-By field that will create measurably more objects...and in reality it's a core set of languages where this would occur. [17:26:29] getting to the blame game.... [17:26:35] let's face it, everyone is swamped. [17:26:50] esi could have gone in earlier if staffing hadn't drained down. [17:27:00] but there's nothing we can do about that. [17:27:19] paravoid: your core argument is cache fragmentation, and ESI was proposed as the solution, but later mark said it wasn't high on priorities, hence you can't use it as an argument :) [17:27:59] I'm not sure what you want as an outcome [17:28:08] i would have appreciated if the concerns were addressed in earlier comments, but paravoid i understand you weren't watching it too closely because it wasn't centrally on your plate. the thing i'm worried about here is the continued presentation of the inaccurate banners. i would like to move forward with this as a short-term solution, then plan for medium and longer term. can we live with that? [17:28:45] lemme respond in bugzilla, though... [17:29:08] there's no Bug header in that changeset, so please either point me to it or add me to Cc [17:29:16] (and thanks for being reasonable :) [17:29:34] paravoid: the outcome is that we should stop displaying banners to people who are not getting it for free [17:29:58] as they complain to the carriers and we get the blame. I understand your concerns - that's why i was trying to get ESI stuff in as fast as we could [17:30:56] and being severely understaffed and swamped is totally understood. I don't mean to blame anyone with my posts, just trying to get what's best for everyone [17:32:13] !log FPL link CV71028 flapped 20 minutes ago [17:32:27] Logged the message, Mistress of the network gear. [17:33:28] yurik_: my point is, I don't have the time to fix this and I'm not going to merge something that I know is wrong [17:33:53] I'm fine with finding a short/mid term fix, though, as dr0ptp4kt proposed. [17:34:42] !log CV71028 flapped again [17:34:57] Logged the message, Mistress of the network gear. [17:35:12] !log added palladium as a salt master for common-infrastructure4 firewall [17:35:27] Logged the message, Mistress of the network gear. [17:37:08] (03PS2) 10Umherirrender: enable Echo on all beta.wmflabs.org-wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95450 [17:39:44] paravoid: this header IS the short term solution :) (Adam is here now, we are discussing it together) For each carrier, we have two types of users - those that come directly, and those going through Opera. At least 7 carriers (one having 60+mil users) support opera, the rest do not. Those who support it need to be warned of navigating outside of Zero and need proper attributions, those who... [17:39:46] ...do not should not. [17:40:03] We could disable all of opera detection [17:40:12] in that case we are under-reporting banners [17:43:03] all of opera users will be treated as non-free, and if that carrier is "xx.zero.*" only, they won't be able to even access zero. site because it would show red banner and redirect to m. which is not free [17:43:10] (03CR) 10Ori.livneh: "@Faidon: Sure, sounds reasonable to me. I don't really have any claim to competence in this domain." [operations/puppet] - 10https://gerrit.wikimedia.org/r/95963 (owner: 10Faidon Liambotis) [17:47:55] paravoid: dr0ptp4kt how about we discuss this in a bit over hangout - this way we can all get to the same page instead of posting here and letting the issue drag out? This is a burning issue for us, and as time goes by we have more and more negativity from half of the carriers and users [17:49:23] guys, i said "bugzilla", but i meant gerrit for my notes. will post those soon enough. [17:50:45] back in a little while (food)... and then mostly not here for the evening (long day is long)... [17:54:39] (03CR) 10Faidon Liambotis: [C: 032] rename misc::parsoid to role::parsoid::production [operations/puppet] - 10https://gerrit.wikimedia.org/r/96270 (owner: 10Hashar) [17:55:10] Bit of a query courtesy of myself and tgr: Would testcommons.wikimedia.org, i.e. a stage0 wiki that acts as a foreign repo for testwikis, be a possibility at some point? [17:56:16] We'd mostly like it for testing things that we can't test locally or on the beta cluster - load balanced shared-db foreign repos primarily [17:57:30] (03PS2) 10Faidon Liambotis: Indent & format role/parsoid.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/96271 (owner: 10Hashar) [17:58:11] paravoid, dr0ptp4kt, I just sent an invite for a hangout at 15:00 EST/12:00 PST, lets get it sorted out [18:01:32] (03CR) 10Dr0ptp4kt: ""First of all, this is a wrong approach for doing what you are trying to do (treating opera as a carrier? lookups in the zero database? X-" [operations/puppet] - 10https://gerrit.wikimedia.org/r/88261 (owner: 10Dr0ptp4kt) [18:05:55] (03CR) 10Dr0ptp4kt: "Two post-notes. I meant "knew" instead of "new"." [operations/puppet] - 10https://gerrit.wikimedia.org/r/88261 (owner: 10Dr0ptp4kt) [18:14:42] (03CR) 10Faidon Liambotis: "> Who can do this, and how long will it take? I'm truly concerned we're going to be doubling the existing wait :(" [operations/puppet] - 10https://gerrit.wikimedia.org/r/88261 (owner: 10Dr0ptp4kt) [18:18:37] (03CR) 10Faidon Liambotis: [C: 032] Indent & format role/parsoid.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/96271 (owner: 10Hashar) [18:25:17] (03CR) 10Dr0ptp4kt: "On the MediaWiki side are you referring to setting up a separate namespace like NS_PROXIES? Or just having a set of well-defined URLs usin" [operations/puppet] - 10https://gerrit.wikimedia.org/r/88261 (owner: 10Dr0ptp4kt) [18:26:15] ottomata: going well. not like it matters as much but now that we've fixed all the issues I can do these rolling restarts while keeping the cluster green [18:28:36] paravoid: would it be ok with you if i take https://rt.wikimedia.org/Ticket/Display.html?id=6344 ? [18:31:38] !log ori synchronized php-1.23wmf3/resources/mediawiki/mediawiki.js 'Id2835eca4: Enable module storage for 0.05% of visitors w/storage-capable browsers' [18:31:52] Logged the message, Master [18:32:24] !log ori synchronized php-1.23wmf4/resources/mediawiki/mediawiki.js 'Id2835eca4: Enable module storage for 0.05% of visitors w/storage-capable browsers' [18:32:39] Logged the message, Master [18:38:56] apergos: fyi for later bookmark: https://wikitech.wikimedia.org/wiki/Nova_Resource:Planet [18:39:18] (there's instance venus and mars, afair one of them was puppetmaster::self and the other was regular) [18:39:20] thanks [18:39:29] for testing either way.. yw [18:40:05] qchris: *waves* when you have a moment; it appears like the extension-Collection group does not exist in gerrit -- I was hoping you could create it; re-add it to the extension/Collection repo; and add me to it [18:41:15] mwalker: Let me take a look ... [18:42:42] mwalker: The group exists ... but it's not publicly visible. [18:42:55] ah; ok; that sort of makes sense [18:43:10] in that case; can you add me, MaxSem, and cscott to it [18:43:11] Isn't this the repo that has been created a few days ago by ^d [18:43:23] Oh ... ^d is not around. [18:43:25] extension/Collection has been in existence for a while [18:43:33] chad created all the submodules for me [18:44:00] *yesterday he created them [18:44:12] Let me look at the setup more closely. [18:47:20] mwalker: Done. [18:47:27] qchris: thanks :) [18:48:23] !log kaulen overloaded at 100 % CPU since 5 min ago, bugzilla almost unusable [18:48:37] Logged the message, Master [18:49:07] mutante: wanna kick it? [18:51:22] uh it's better [18:52:48] Nemo_bis: eh, was gone to get food. yea, looks ok now [18:53:09] mutante: can you help qchris get RT access? [18:53:20] oh, sorry [18:53:24] he's already pinged you [18:53:31] !log but now ok [18:53:46] Logged the message, Master [18:54:16] ottomata: has he? [18:54:21] qchris?^ [18:54:23] qchris: need RT? [18:54:32] mutante: Yes please :-) [18:54:37] mutante: That sounds like a meme [18:56:12] qchris: please send a quick mail to ops-requests@rt.wikimedia.org (that autocreates a user if you dont already have one and i can just give it the permissions) [18:57:25] mutante: Do I have to send from a @wikimedia.org address? [18:58:21] qchris: i found your personal mail and forwarding it, so dont worry [18:58:28] mutante: Ok. Thanks. [18:58:59] (it's better to mail RT instead of people but i got it, will get back to you shortly via ticket) [18:59:37] mutante: I tried, but it told me that I lack permission to ask for permission :-( [19:00:13] hmm.. ops-requests@ should be unrestricted, we use it all the time [19:00:46] but yea, you'll get a login user [19:01:04] mutante: https://wikitech.wikimedia.org/wiki/RT <-- tells to use access-requests@... for all access requests [19:04:20] qchris: i see, we should clarify that. that was started mostly for shell access requests etc but of course it would also apply for getting RT access itself. [19:04:52] the role requestor is supposed to have perms on the ticket though [19:05:06] writes a note [19:07:02] !log reedy updated /a/common to {{Gerrit|I9b48be4c0}}: try to reduce "no working slave" notice flood in pmtpa dberror.log (attempt 2) [19:07:07] (03PS1) 10Reedy: Non wikipedias to 1.23wmf4 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96297 [19:07:16] Logged the message, Master [19:07:30] (03CR) 10Reedy: [C: 032] Non wikipedias to 1.23wmf4 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96297 (owner: 10Reedy) [19:07:49] (03Merged) 10jenkins-bot: Non wikipedias to 1.23wmf4 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96297 (owner: 10Reedy) [19:08:50] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: https://gerrit.wikimedia.org/r/96297 [19:09:06] Logged the message, Master [19:18:21] (03PS6) 10Reedy: Enable MassMessage on all wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/91344 (owner: 10Legoktm) [19:19:45] (03PS7) 10Reedy: Enable MassMessage on all wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/91344 (owner: 10Legoktm) [19:20:01] (03CR) 10Reedy: [C: 032] Enable MassMessage on all wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/91344 (owner: 10Legoktm) [19:20:11] (03Merged) 10jenkins-bot: Enable MassMessage on all wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/91344 (owner: 10Legoktm) [19:22:09] !log reedy synchronized wmf-config/ 'Enable MassMessage Ib29cb042e739e26fc5ef56f41317454ea690d0cb' [19:22:22] Logged the message, Master [19:22:28] legoktm: ^^ [19:22:38] omgspam. [19:22:51] Reedy: <3 Love it [19:23:05] Can we take bets now? [19:23:21] What are we betting on? :p [19:27:28] PROBLEM - Puppet freshness on analytics1021 is CRITICAL: No successful Puppet run in the last 3 hours [19:30:48] Reedy: tell me the bet :( [19:37:53] !log ori synchronized php-1.23wmf3/extensions/WikimediaEvents 'Update WikimediaEvents to 4aa9c629e5: Controlled experiment to assess performance of module storage' [19:37:57] (03PS1) 10Manybubbles: Install curl when installing Elasticsearch [operations/puppet] - 10https://gerrit.wikimedia.org/r/96305 [19:38:09] Logged the message, Master [19:38:55] !log ori synchronized php-1.23wmf4/extensions/WikimediaEvents 'Update WikimediaEvents to 4aa9c629e5: Controlled experiment to assess performance of module storage' [19:39:10] Logged the message, Master [19:39:37] Reedy: done thanks [19:39:46] cool [19:39:50] * Reedy looks at mediawiki-config [19:40:31] (03PS2) 10Reedy: (bug 57042) Update $wgUploadNavigationUrl on tewiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95459 (owner: 10Odder) [19:40:37] (03CR) 10Reedy: [C: 032] (bug 57042) Update $wgUploadNavigationUrl on tewiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95459 (owner: 10Odder) [19:44:27] (03Merged) 10jenkins-bot: (bug 57042) Update $wgUploadNavigationUrl on tewiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95459 (owner: 10Odder) [19:44:50] (03PS3) 10Reedy: (bug 56334) Namespace l10n for angwiki and angwiktionary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95846 (owner: 10Odder) [19:44:56] (03CR) 10Reedy: [C: 032] (bug 56334) Namespace l10n for angwiki and angwiktionary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95846 (owner: 10Odder) [19:49:11] !log reedy synchronized php-1.23wmf3/extensions/MassMessage 'Update to master' [19:49:25] Logged the message, Master [19:52:01] (03PS2) 10Dzahn: retab misc/pdf and role/pdf [operations/puppet] - 10https://gerrit.wikimedia.org/r/96190 [19:52:13] (03CR) 10Dzahn: retab misc/pdf and role/pdf (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/96190 (owner: 10Dzahn) [19:53:48] (03Merged) 10jenkins-bot: (bug 56334) Namespace l10n for angwiki and angwiktionary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95846 (owner: 10Odder) [19:58:07] (03PS3) 10Reedy: Change numerous global functions for anonymous ones [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95479 [19:58:12] (03CR) 10Reedy: [C: 032] Change numerous global functions for anonymous ones [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95479 (owner: 10Reedy) [20:00:44] !log ori synchronized php-1.23wmf4/extensions/WikimediaEvents 'Update WikimediaEvents to d4f2a43' [20:01:00] Logged the message, Master [20:01:32] !log ori synchronized php-1.23wmf3/extensions/WikimediaEvents 'Update WikimediaEvents to d4f2a43' [20:01:33] Reedy: ok, done-done [20:01:39] heh [20:01:40] ook [20:01:46] Logged the message, Master [20:02:29] * ori-l facepalms [20:02:39] no, i didn't update the submodule [20:02:50] xD [20:03:54] * bd808 notes that if you put your had palm up on your desk and then place your forehead on your palm you can facepalm and headdesk simultaneously. [20:04:03] *hand [20:04:21] !log ori synchronized php-1.23wmf3/extensions/WikimediaEvents 'Update WikimediaEvents to d4f2a43' [20:04:35] Logged the message, Master [20:04:43] !log ori synchronized php-1.23wmf4/extensions/WikimediaEvents 'Update WikimediaEvents to d4f2a43' [20:04:44] PROBLEM - Apache HTTP on mw1152 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:04:46] Reedy: done-done-done. [20:04:58] Logged the message, Master [20:05:02] um. [20:05:16] mw1152? [20:05:34] PROBLEM - Apache HTTP on mw1151 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:05:39] Fatal error: Call to undefined method TableDiffFormatterFullContext::_start_diff() in /usr/local/apache/common-local/php-1.23wmf4/extensions/AbuseFilter/Views/AbuseFilterViewDiff.php on line 30 [20:05:44] RECOVERY - Apache HTTP on mw1152 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.026 second response time [20:06:11] * MatmaRex hides [20:06:11] not related to my change [20:06:14] PROBLEM - Apache HTTP on mw1150 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:06:25] RECOVERY - Apache HTTP on mw1151 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.833 second response time [20:06:30] that fatal may or may not be my fault. i used some technically private core class in that AF code [20:06:40] to render pretty diffs [20:06:59] siebrand probably broke it in his documentation and cleanup spree [20:08:00] _start_diff, _block and _end_diff [20:08:06] mw1152 is bits [20:08:09] (03CR) 10Dzahn: [C: 032] "done per hashar's comment." [operations/puppet] - 10https://gerrit.wikimedia.org/r/96190 (owner: 10Dzahn) [20:08:41] Looks like the _ just need to go away [20:10:14] RECOVERY - Apache HTTP on mw1150 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.381 second response time [20:10:38] What's up with jenkins? [20:11:48] (03CR) 10Reedy: "recheck" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95479 (owner: 10Reedy) [20:11:55] Reedy: Did you just accuse me of breaking ur Wikapeadia without proof? [20:12:06] siebrand: Nope, MatmaRex did [20:12:18] Reedy: Phew :) Might as well have been me :) [20:12:24] shush ;) [20:12:39] easily enough fixed [20:13:12] (03Merged) 10jenkins-bot: Change numerous global functions for anonymous ones [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95479 (owner: 10Reedy) [20:13:25] https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&s=by+name&c=Bits%2520application%2520servers%2520eqiad&tab=m&vn=&hide-hf=false [20:13:29] Hmm [20:13:34] the bits caches too [20:13:37] Load spike and then a big network I/O drop [20:13:38] siebrand: well, technically it was you, but that was because i used some private-but-not-marked-as-such functions in AbuseFilter code [20:13:49] ... [20:13:53] so i guess the blame lays with me [20:14:00] Back to drinks in hotel room after midnight.... [20:14:02] I pushed a module update [20:14:10] or maybe MaxSem, he poked around differenceengine and things recently too [20:14:27] * MaxSem hides out of shame [20:14:37] [20:13:38] (CR) Hoo man: [C: 2] "I wonder how that went through code review... (might be my fault, though)" [extensions/AbuseFilter] - https://gerrit.wikimedia.org/r/96318 (owner: Reedy) [20:14:47] hoo is also admitting possible blame [20:14:50] busy threads is highhttps://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=Bits+application+servers+eqiad&h=mw1152.eqiad.wmnet&jr=&js=&v=40&m=ap_busy_workers&vl=threads&ti=Busy+Threads [20:14:55] looks like a lot of my submodules were outdated [20:15:14] PROBLEM - Apache HTTP on mw1150 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:15:29] (03PS3) 10Reedy: (bug 44629) Clean up $wgMetaNamespace, $wgMetaNamespaceTalk [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95796 (owner: 10Odder) [20:15:42] (03CR) 10Reedy: [C: 032] (bug 44629) Clean up $wgMetaNamespace, $wgMetaNamespaceTalk [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95796 (owner: 10Odder) [20:16:08] geez, I hope this doesn't break anything :-) [20:16:14] RECOVERY - Apache HTTP on mw1150 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.699 second response time [20:16:46] ori-l: I've got a probable cause, but apparently a lot more lag this week than last monday [20:16:54] what's the probable cause? [20:17:11] Switching all those wikis to 1.23wmf4 in one go [20:17:22] We had a similar problem last week, but it was more more quickly after [20:17:29] ie not over an hour later [20:18:08] doesn't seem probably, esp since the increase here is sudden: https://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=Bits+application+servers+eqiad&h=mw1152.eqiad.wmnet&jr=&js=&v=40&m=ap_busy_workers&vl=threads&ti=Busy+Threads [20:18:32] probable [20:18:53] Your fault then? :P [20:20:10] ori-l, https://graphite.wikimedia.org/dashboard/temporary-36 [20:20:12] the last time this happened, the incorrect symlink to the autonym font was causing lots of 404 requests [20:20:50] but this doesn't appear to be the case, i don't see new requests or failed requests on en, mw.o, or sv.o [20:20:53] or this is usual RL post-scap pertrubance? [20:21:00] looks most likely so far [20:21:18] esp. since it appears to be dropping [20:21:27] No one ran scap ;) [20:22:10] cache bust due to module perturbance [20:22:19] yeah, back to normal now [20:23:06] well Reedy, wikiversion changes are a good approximation:P [20:26:35] Traffic levels are back up [20:31:48] (03Merged) 10jenkins-bot: (bug 44629) Clean up $wgMetaNamespace, $wgMetaNamespaceTalk [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95796 (owner: 10Odder) [20:34:26] (03PS2) 10Reedy: Make current $wgNoFollowLinks = true explicit [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94508 (owner: 10Nemo bis) [20:34:31] (03CR) 10Reedy: [C: 032] Make current $wgNoFollowLinks = true explicit [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94508 (owner: 10Nemo bis) [20:35:31] (03PS1) 10Dzahn: tabbing, quoting & aligning [operations/puppet] - 10https://gerrit.wikimedia.org/r/96354 [20:36:44] PROBLEM - Host analytics1021 is DOWN: PING CRITICAL - Packet loss = 100% [20:40:26] greg-g: Confirming that we (Jon and I) are deploying VectorBeta in 20 minutes [20:43:45] no more traffic issues with bits? [20:43:58] * marktraceur doesn't know [20:44:26] Back to normal [20:45:21] looks ok [20:45:26] yeah, dangit Reedy [20:45:30] :) [20:45:36] marktraceur: go forth and break the cluster [20:45:41] I mean [20:45:45] WOO [20:45:54] * marktraceur rm -rf /a/common [20:46:42] i've still got a couple of things to deploy [20:46:49] waiting on jenkins currently... [20:47:13] Reedy: Poke me when you're done? I can do prep until then and standby 'til you're ready. [20:47:32] Oook [20:48:02] (03Merged) 10jenkins-bot: Make current $wgNoFollowLinks = true explicit [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94508 (owner: 10Nemo bis) [20:48:11] 1 to go [20:48:29] (03PS2) 10Reedy: Set zero load on snapshot hosts [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95957 (owner: 10Tim Starling) [20:48:35] (03CR) 10Reedy: [C: 032] Set zero load on snapshot hosts [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95957 (owner: 10Tim Starling) [20:50:03] (03PS2) 10Ottomata: Install curl when installing Elasticsearch [operations/puppet] - 10https://gerrit.wikimedia.org/r/96305 (owner: 10Manybubbles) [20:50:27] (03PS2) 10MarkTraceur: Enable VectorBeta on group0 wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95486 [20:50:53] (03CR) 10Ottomata: [C: 032 V: 032] Install curl when installing Elasticsearch [operations/puppet] - 10https://gerrit.wikimedia.org/r/96305 (owner: 10Manybubbles) [20:55:08] c'mon jenkins! [20:55:14] (03Merged) 10jenkins-bot: Set zero load on snapshot hosts [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95957 (owner: 10Tim Starling) [20:55:20] heh [20:55:51] !log reedy synchronized wmf-config/ [20:55:58] marktraceur: All good to go [20:56:04] Just be aware Jenkins is being very slow today [20:56:07] Coolio [20:56:08] Logged the message, Master [20:56:15] * marktraceur bats eyelashes at greg-g [21:00:16] * marktraceur stops flirting with greg-g and starts deploying [21:01:47] (03CR) 10MarkTraceur: [C: 032 V: 032] Enable VectorBeta on group0 wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95486 (owner: 10MarkTraceur) [21:01:51] (03PS1) 10Dzahn: retab, quoting, linting of ishmael.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/96362 [21:02:18] Reedy: is there a specific reason why you're not deploying https://gerrit.wikimedia.org/r/#/c/94607/ ? [21:02:38] Not really [21:02:44] I haven't checked in with springle-afk about it [21:02:55] after the last set of issues (was it fr?) [21:03:09] iswiki is much smaller than fr [21:03:12] RECOVERY - Host analytics1021 is UP: PING OK - Packet loss = 0%, RTA = 1.06 ms [21:03:24] (03PS2) 10Dzahn: retab, quoting, linting of ishmael.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/96362 [21:03:24] I know [21:03:38] Hin íslenska Wikipedia fór í gang 5. desember 2003 og inniheldur núna 36.643 greinar. [21:08:07] !log mholmquist updated /a/common to {{Gerrit|Ic3a64b434}}: Enable VectorBeta on group0 wikis [21:08:19] Logged the message, Master [21:09:32] !log mholmquist synchronized wmf-config/CommonSettings.php 'Add wmgUseVectorBeta' [21:09:48] Logged the message, Master [21:10:17] !log mholmquist synchronized wmf-config/InitialiseSettings.php 'Set wmgUseVectorBeta = true for phase0 wikis' [21:10:32] Logged the message, Master [21:12:04] Scapping now for the VectorBeta deploy, things look like they're working [21:12:08] Hold on to your hats [21:15:23] (03CR) 10jenkins-bot: [V: 04-1] retab, quoting, linting of ishmael.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/96362 (owner: 10Dzahn) [21:16:49] Argh, config change sync broke mediawiki.org because the code isn't there yet [21:16:56] Plz be patient with me :) [21:17:35] !log mholmquist Started syncing Wikimedia installation... : Adding VectorBeta [21:17:51] Logged the message, Master [21:18:46] Scap isn't the best way to fix this [21:19:06] Revert the config and/or just sync-dir the folder [21:19:34] Reedy: Should I stop the scap or do this in parallel? [21:20:27] Should be fine in parallel [21:20:42] (we know mw.org is down) [21:20:45] Yarp [21:21:08] Syncing [21:21:22] !log mholmquist synchronized php-1.23wmf4/extensions/VectorBeta/ [21:21:29] Fixed! [21:21:31] Thanks Reedy [21:21:37] Logged the message, Master [21:21:43] (03CR) 10Legoktm: "It also needs to be added to wmf-config/extension-list." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95486 (owner: 10MarkTraceur) [21:22:39] legoktm: Does it? I've heard conflicting advice on that front. [21:22:52] If you want any localisation it does [21:23:15] Christ [21:23:22] heheheh [21:23:32] indeed, and missing right now :P [21:24:02] Well that's because the scap isn't done [21:24:02] RECOVERY - Puppet freshness on analytics1021 is OK: puppet ran at Tue Nov 19 21:23:59 UTC 2013 [21:24:07] Or so I thought [21:24:14] My other extension config changes haven't done that [21:24:19] wow, Vectorbeta crops the logo on mediawiki.org :) [21:24:27] In fact, I *added* it, and someone told me to take it out [21:24:37] (not for VB, for MMV) [21:24:47] Probably me [21:24:49] Temporarily [21:24:51] jdlrobson: See Eloquence's note [21:24:55] kk [21:24:56] AAaaaaaghhhh [21:25:01] it can cause issues if its not in both wmf branches iirc [21:25:05] (the extension) [21:25:07] can/will [21:25:34] Luckily I'm only 24 minutes into my deploy window [21:25:38] Plennya time [21:26:14] aha! [21:26:33] so the big square next to section titles in "Beta" is there to tick it! [21:26:40] eureka! [21:27:13] Eloquence: seems like we've revealed a core bug :) the container holding the image settings a width of 10em but doesn't specify a background size mm [21:27:40] (03PS1) 10MarkTraceur: Add VectorBeta to extension-list [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96369 [21:28:13] twkozlowski: You're not the first person to miss the "that's a checkbox" eureka moment :) [21:29:11] marktraceur: then maybe that's not that obvious to people, I suppose... [21:29:22] Yeah, we know [21:29:28] It's an open design issue [21:29:40] design issues later, broke wikis now :) [21:29:41] Personally I blame RoanKattouw_away [21:29:45] greg-g: They're fixed! [21:29:47] oh [21:29:48] good! [21:29:50] Besides you told me to break the cluster [21:29:53] I couldn't let you down [21:29:53] design issues now then [21:29:59] ;-) [21:30:00] marktraceur: last time I joke with you [21:30:14] greg-g: Breaking the cluster is srs bsnss [21:30:40] YuviPanda: By the way, this time I totally accept responsibility [21:30:58] Where's my barnstar [21:31:22] careful .. those things are sharp and pointy [21:31:50] Eloquence: Not planning on going through security in Tokyo with 'em [21:37:33] Sometimes when I watch scap I wonder if I've seen a server name before and it's just randomly printing (mw|srv)\d\d\d?\d? to confuse me [21:38:38] marktraceur: it sure seems that way, eh? [21:38:51] Lil bit [21:39:05] it's how we make sure all deployers go insane [21:39:22] greg-g: But you'd have to be insane to be a deployer [21:39:22] see also: Reedy [21:39:25] It's a srv22 [21:39:35] marktraceur: good try [21:41:39] jdlrobson: https://wikitech.wikimedia.org/wiki/How_to_deploy_code [21:41:50] !log mholmquist Finished syncing Wikimedia installation... : Adding VectorBeta [21:42:06] Logged the message, Master [21:42:42] (03CR) 10MarkTraceur: [C: 032] Add VectorBeta to extension-list [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96369 (owner: 10MarkTraceur) [21:42:51] (03CR) 10MarkTraceur: [V: 032] Add VectorBeta to extension-list [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96369 (owner: 10MarkTraceur) [21:43:55] !log mholmquist updated /a/common to {{Gerrit|Ib532e5558}}: Add VectorBeta to extension-list [21:44:11] Logged the message, Master [21:44:44] "I'm getting intermittent style load failures again today" [21:44:54] says who? [21:45:00] the sync causing 50# again ? [21:45:05] on VP/T [21:45:06] !log mholmquist synchronized wmf-config/extension-list 'Add VectorBeta to extension-list' [21:45:07] ugh [21:45:11] thedj: on what wiki? [21:45:14] en.wp [21:45:21] now that's a full report :) [21:45:23] Logged the message, Master [21:45:52] So now I'm just gonna do mw-update-l10n and sync-l10nupdate-1 1.23wmf4 because there's no code changes [21:46:09] Just...fyi [21:46:29] thedj: that was an hour ago or so [21:46:35] when something else allegedly broke [21:46:45] Reedy probably knows [21:48:32] greg-g: wmf3 l10n update failed because VectorBeta isn't in it, but that shouldn't matter, should it? [21:49:04] ah right, i see it in the irc log [21:49:58] voiced answer [21:50:14] Reedy: greg-g says it does so I'm going to add VectorBeta to wmf3 and sync-dir it, but he said to double-check with you [21:50:34] It's weird because I'm not deploying VB to any wmf3 wikis, ever [21:50:47] marktraceur: Yeah [21:50:57] In theory it shouldn't, but scap gets pissed off [21:51:01] right [21:51:07] So my habit has been to just stage it in the previous [21:51:13] Sigh. [21:51:15] OK [21:51:17] On it [21:51:29] greg-g: May need like...10-20 extra minutes, but I see nobody else is scheduled [21:52:10] marktraceur: yeah, you're good [21:53:46] but that issue from 1,5 hour ago. how did that get to affect bits for en.wp ? [21:54:33] PROBLEM - Backend Squid HTTP on sq37 is CRITICAL: Connection refused [21:55:13] fatal in abusefilter it was ? [21:56:44] nah [21:56:48] That hasn't 'been fixed on the cluster [21:59:13] !log mholmquist synchronized php-1.23wmf3/extensions/VectorBeta/ 'Sync VectorBeta code to wikis that will never use it' [21:59:27] Logged the message, Master [21:59:43] * marktraceur - helpful log messages are his specialty [22:02:11] So now sync-l10nupdate-1 on both branches? This is totally surreal to me, updating an l10n cache that doesn't need to be updated. [22:02:22] Reedy: ^ [22:02:35] Nope [22:02:43] just do wmf4 [22:02:49] 'kaaaaay [22:02:59] marktraceur is having a break down over here [22:03:48] I'm pretty sure there's a bug for this specific issue ;) [22:03:55] what would be the 2 sentence description of this craziness and solution? :) [22:04:03] Well [22:04:17] ie: I'm looking for bug fodder for tech debt list [22:04:19] extensions deployed on only one branch have to have source tree on all active branches [22:04:30] Differentiating between "extension is in wmf4" and "extension is not in wmf3" would be good, and then having the scripts be aware of that. [22:04:35] otherwise l10nupdate breaks [22:04:44] so, the issue is really in l10n? [22:04:53] Which is why we added that "feature" for labs [22:04:57] right [22:04:58] So they can have their own extensions [22:05:07] should we mimic that feature for production? [22:05:20] if ( file_exists( "$wmfConfigDir/extension-list-$wmfExtendedVersionNumber" ) ) { [22:05:20] $wgExtensionEntryPointListFiles[] = "$wmfConfigDir/extension-list-$wmfExtendedVersionNumber"; [22:05:20] } [22:05:43] I did, but it add some amount of maintenance load [22:05:49] right, that thing you hacked in that day I wa sworried broke things :) [22:06:02] (even though it didn't, apparently) ;) [22:06:10] The file needs to be there for the new deploy and the current [22:06:26] Which can sort of only be done after the oldest one has been removed [22:06:55] I guess it needs to be made almost conditional [22:06:57] this sounds like an imprecise bug report :) [22:07:16] Almost imprecise [22:07:26] git submodule add && sync-dir is less work [22:07:50] if people realize they need to [22:07:56] that's the issue, I believe [22:08:00] yeah [22:08:07] something should either warn them, or deal with it [22:08:11] Scap should go lolno and not let them continue [22:08:17] right [22:08:22] E_LOLNO [22:08:35] In theory we shouldn't care that X isn't in branch A, but is in branch B [22:08:48] but apparently we have to :) [22:08:49] but has the potential to be a bug [22:11:03] the question being quite how we handle this issue [22:11:33] Making extension-list a bit less simple might work [22:12:12] having "$IP/path/to/Extension/Extension.php" => array ( '1.23wmf4+' ) [22:12:22] or something to denote from X version forward [22:12:59] * greg-g nods [22:13:02] I like that [22:13:06] file a bug! [22:13:54] I'm trying to remember why Aaron added the " *" after the entries in wikiversions.dat [22:15:07] OK, now that there are messages at https://www.mediawiki.org/wiki/Special:Preferences#mw-prefsection-betafeatures I think we can close my window [22:15:08] list( $dbName, $version, $extVersion ) = $items; [22:15:22] marktraceur: good, it was getting cold in here [22:15:39] HAH [22:15:42] I say again [22:15:43] HAH [22:16:03] oh, I forgot i wasn't going to joke with you anymore [22:16:48] Aaron|home: AaronSchulz What was the reason for adding the asterix at the end of the wikiversions.dat lines? [22:18:06] to use beta versions of extensions on certain wikis (which requires splitting some caches) [22:18:15] nobody uses that though... [22:20:15] marktraceur: everything ok? [22:20:26] (meant to hit enter a little bit ago, forgot) [22:21:15] Seems like, yes. [22:21:23] With the deploy anyway [22:21:49] yeah, other life things we can talk about in another channel, but for here, good [22:53:23] !log reedy synchronized php-1.23wmf4/extensions/AbuseFilter/ 'bug 57268' [22:53:37] Logged the message, Master [22:55:40] (03PS1) 10Bsitu: Remove html tag from echo email footer address [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96392 [22:56:37] (03CR) 10Bsitu: [C: 04-2] Remove html tag from echo email footer address [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96392 (owner: 10Bsitu) [23:03:19] RECOVERY - RAID on analytics1009 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [23:03:20] RECOVERY - Disk space on analytics1013 is OK: DISK OK [23:03:29] RECOVERY - SSH on analytics1009 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [23:03:29] RECOVERY - puppet disabled on analytics1009 is OK: OK [23:03:29] RECOVERY - SSH on analytics1013 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [23:03:29] RECOVERY - puppet disabled on analytics1013 is OK: OK [23:03:29] RECOVERY - DPKG on analytics1009 is OK: All packages OK [23:03:30] RECOVERY - Disk space on analytics1009 is OK: DISK OK [23:03:30] RECOVERY - DPKG on analytics1013 is OK: All packages OK [23:03:31] RECOVERY - RAID on analytics1013 is OK: OK: no disks configured for RAID [23:03:31] RECOVERY - Host analytics1009 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [23:03:32] RECOVERY - Host analytics1013 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [23:09:15] (03PS1) 10Jdlrobson: Disable infobox experiment [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96394 [23:10:56] (03CR) 10Ori.livneh: [C: 031] "Yeah." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96394 (owner: 10Jdlrobson) [23:12:51] (03CR) 10Ori.livneh: [C: 032] Disable infobox experiment [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96394 (owner: 10Jdlrobson) [23:13:20] !log ori updated /a/common to {{Gerrit|Iee7332c97}}: Disable infobox experiment [23:13:35] Logged the message, Master [23:14:15] !log ori synchronized wmf-config/InitialiseSettings.php 'Iee7332c97: Disable infobox experiment' [23:14:30] Logged the message, Master [23:23:38] gwicke: we shouldn't be advertising URLs with hostnames specific to a DC. if that's truly going to be a supported public facing service (a la the API) then we can get it a decent hostname [23:24:02] IMO [23:25:41] PROBLEM - DPKG on ms-be1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:25:41] PROBLEM - swift-object-replicator on ms-be1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:25:41] PROBLEM - swift-account-server on ms-be1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:25:41] PROBLEM - swift-container-server on ms-be1001 is CRITICAL: Timeout while attempting connection [23:25:49] PROBLEM - RAID on ms-be1001 is CRITICAL: Timeout while attempting connection [23:25:49] PROBLEM - swift-account-auditor on ms-be1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:25:49] PROBLEM - swift-account-replicator on ms-be1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:25:49] PROBLEM - swift-object-updater on ms-be1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:25:49] PROBLEM - swift-object-server on ms-be1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:25:49] PROBLEM - swift-object-auditor on ms-be1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:25:49] PROBLEM - Disk space on ms-be1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:25:50] PROBLEM - puppet disabled on ms-be1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:25:50] PROBLEM - swift-container-replicator on ms-be1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:25:59] PROBLEM - swift-account-reaper on ms-be1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:25:59] PROBLEM - swift-container-updater on ms-be1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:26:09] PROBLEM - swift-container-auditor on ms-be1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:26:12] ohai icinga-wm [23:27:10] eh? [23:27:21] greg-g: eh! [23:27:33] wth [23:28:05] * jeremyb tries to divine what greg-g is responding to [23:28:17] jeremyb: icinga explosion [23:28:18] maybe i need a divining rod for that [23:28:24] dowsing [23:28:27] greg-g: it's all one box. it's not an spof afaik [23:28:37] still [23:29:07] greg-g: just needs some service dependencies so it knows that if the box is down then it shouldn't complain about other stuff on that box. [23:29:10] famous last words [23:32:15] jeremyb: this will be an API for a few months while we are getting the content API ready [23:32:26] i don't follow [23:32:27] I don't expect us to move away from equiad in that timeframe [23:32:34] grrrrr [23:32:37] i don't care [23:32:54] http://www.w3.org/Provider/Style/URI.html [23:32:56] !! [23:33:00] and in any case, only you know that equiad is a DC [23:33:33] plusone to a real uri [23:33:34] :) [23:33:40] changing uris sucks [23:33:46] * gwicke tries to teach his spellcheck that eqiad is speled eqiad [23:34:00] heh, I've given up with spellcheck :) [23:34:09] greg-g: that URL will cease to work a few months from now [23:34:13] and this is what ops came up with [23:34:37] gwicke: is there a ticket or something where ops came up with that? [23:34:38] it's not my idea [23:35:00] rt 6107 [23:35:34] Roan proposed the domain, to be precise [23:36:10] I'd care more if this was something for the longer term [23:36:17] gwicke: maybe that's ok for some uses. it's not ok if you're advertising on a mailing list [23:36:22] roan said "or whatever will suffice." [23:36:43] (again, IMO) [23:36:43] jeremyb: lobby for another domain if you care [23:36:49] I don't [23:36:50] gwicke: i will do just that :) [23:38:51] hrmmmmmmmmmm [23:39:20] well don't have to tackle localssl right now at least because it's eqiad only [23:39:24] for now [23:39:27] \o/ [23:40:50] gwicke: anyway, what about the gauranteed to be around forever part. is this service going to be permanent? [23:41:02] jeremyb: no, as I said in the mail [23:41:05] (i.e. should we maybe use a hostname with something like -test. in it) [23:41:40] it is production (not testing), but a temporary API [23:44:10] gwicke: ok, i think i don't understand entirely. the new API will be permanent and also a superset of the just announced one? will it be trivial to redirect all requests for just announced over to the new one? (how will you map URLs from one to the other?) [23:45:03] the new API will be a REST content API [23:45:21] the just announced one looks approximately like rest [23:45:22] redirecting should be possible, but I don't see a need for it [23:45:33] it is not general though [23:45:52] no wikitext, metadata, point-in-time query etc [23:46:19] https://www.mediawiki.org/wiki/User:GWicke/Notes/Storage has ideas for the content API [23:49:37] gwicke: the basic GETs you support now (/enwiki/Main_Page && /enwiki/Main_Page?oldid=...) will be easily mapped to some new URL? [23:49:44] i care less about the POSTs [23:50:39] jeremyb: since we know all major users of this temporary API I don't plan to bend over backwards too much [23:51:23] we can keep the old URL online for a bit to give people time to change their config, but then we can just switch off the old one [23:52:08] ok, let me look at this another way: [23:53:47] if we have 2 primary DCs and eqiads offline and parsoid's still only in eqiad is it ok for you to have an extended outage of this new service? or is that impossible because e.g. visual editor wouldn't tolerate such an extended outage? [23:54:15] jeremyb: we'll have two primary DCs within the next two months? [23:54:18] that's news to me [23:54:43] !log ori synchronized php-1.23wmf3/resources/startup.js 'touch' [23:54:58] Logged the message, Master [23:55:02] gwicke: i guess it's highly unlikely. but even if we had the space we wouldn't have the hardware all ready [23:55:11] we plan to have the replacement API ready before January [23:55:12] gwicke: i didn't specify a timeframe :) [23:55:32] oh, wow, is half of november gone already? huh [23:57:57] gwicke: so i think at the very least we should be either redirecting requests to the old place the root of the new place or some page with info about the new place or serve a static page in place with info about the new place. even if there's not a mapping of new path to old path at least they don't just get host not found, etc. [23:58:10] there will be eventlogging alerts in a moment [23:58:13] i'm on top of it [23:58:41] !log Restarting EventLogging jobs on vanadium [23:58:57] Logged the message, Master [23:59:22] jeremyb: sure, we can leave behind a message for a bit [23:59:58] there are only a handful external users though, and we are in contact with them