[00:16:26] (03PS1) 10Ottomata: Writing Kafka stats to jmxtrans outfile at /var/log/kafka/kafka-jmx.log [operations/puppet] - 10https://gerrit.wikimedia.org/r/93897 [00:17:29] (03CR) 10Ottomata: [C: 032 V: 032] Writing Kafka stats to jmxtrans outfile at /var/log/kafka/kafka-jmx.log [operations/puppet] - 10https://gerrit.wikimedia.org/r/93897 (owner: 10Ottomata) [00:25:10] every single custom ganglia view is empty at the moment [00:25:13] at least for me [00:25:38] ori-l: same [00:25:43] well, of the two I just tested [00:25:52] this sometimes happens when the front-end gives up on waiting for the back end service to return the XML document that describes all the metrics [00:25:59] I can click more, but I think it's not needed [00:26:06] (03PS1) 10Eloquence: Increase upload size limit for chunked and URL uploads to 1000MB. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93900 [00:26:17] ottomata: is it possible that you are adding a lot metrics and that it is overwhelming ganglia? [00:27:21] hm, ori-l, i doubt it, i'm actually reducing the number of metrics compared to what was (sortof) there last week [00:27:40] reducing how? just reporting fewer metrics? [00:28:06] yea [00:28:24] i have been restarting gmetad occaisionally on nickel as I blast away old .rrds there [00:28:28] i'm bascialy done though [00:29:06] i wish i understood why this happens [00:29:09] i've seen it before [00:29:29] it has fixed itself in the past, i think for now i'll just wait before poking nickel [00:29:35] !log aaron synchronized php-1.23wmf1/includes/clientpool '313457c3ed8d58d7193d806e0155daea59adada4' [00:29:52] Logged the message, Master [00:30:49] !log aaron synchronized php-1.23wmf2/includes/clientpool 'f1df92c9308c3350098dc9ca9e7ac1a221810cd7' [00:31:05] Logged the message, Master [00:35:18] paravoid: has any of that temp data cleared? [00:39:07] (03PS1) 10Ottomata: Giving analytics102{1,2} a public address. [operations/dns] - 10https://gerrit.wikimedia.org/r/93903 [00:40:03] (03CR) 10Ottomata: "Also, I noticed that analytics1003 and analytics1004 (which currently have public IPs), did not have their .eqiad internal addresses remov" [operations/dns] - 10https://gerrit.wikimedia.org/r/93903 (owner: 10Ottomata) [00:43:21] (03PS1) 10Ottomata: analytics102{1,2} are Kafka brokers and need public addresses. [operations/puppet] - 10https://gerrit.wikimedia.org/r/93904 [00:47:54] (03PS1) 10Ottomata: (WIP) Initial Debian version [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/93905 [00:48:30] (03Abandoned) 10Ottomata: (WIP) Initial Debian version [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/93905 (owner: 10Ottomata) [00:49:47] (03PS14) 10Ottomata: (WIP) Initial Debian version [operations/software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/78782 (owner: 10Faidon Liambotis) [00:59:43] ...and the graphs are back. [01:16:17] 0 5 */3 * * flock -n /var/lock/update-special-pages /usr/local/bin/update-special-pages > /var/log/updateSpecialPages.log 2>&1 [01:16:38] Can someone look for evidence of that cron job running on terbium as the apache user? I can't read the syslog [01:18:36] Reedy: confirmed it is in the cron of user apache, but that log file doesn't exist [01:19:32] mutante: Yup, as I had and also commented on https://bugzilla.wikimedia.org/show_bug.cgi?id=53227#c37 :P [01:19:52] 39 even [01:20:33] Could it be failing because it can't write to that log file? [01:21:41] Reedy: no, it works [01:22:09] it creates the logfile when i run it as apache [01:22:36] i stopped it right away, but; [01:22:42] Uncategorizedcategories got 24 rows in 0.01s [01:23:03] That log file should still be there after [01:23:20] It's not helped by it running on over 800 wikis, doing tens of jobs per wiki... [01:23:41] "It doesn't work" [01:23:45] It's fine on aawiki! [01:24:21] Reedy: sudo -u apache flock -n ... bla bla [01:24:35] Yeah [01:24:39] logfile ends up being [01:24:41] -rw-r--r-- 1 root root 6.7K Nov 6 01:21 updateSpecialPages.log [01:24:52] I believe I did something like that last time for the last manual run [01:25:31] i can run it in a screen, as apache [01:25:51] but why did it not run by itself [01:26:06] or is the logfile just removed since the last run [01:27:14] I don't think so [01:27:51] The following data is cached, and was last updated 05:57, 10 September 2013. A maximum of 2,000 results are available in the cache. [01:28:24] That's apparently the last run, but run by what exactly, I'm not sure [01:29:42] Hour 5, every 3 days? [01:42:39] Reedy: ok, pretty sure it's the permissions, i commented the actual exection in the script, so all it lists is db names [01:42:51] then touched the logfile, let apache write to it [01:42:58] change cronjob to "in a minute".. and works [01:47:42] Reedy: i say next run it will work and the root cause is having to puppetize that logfile.. commented on ticket.. ttyl [01:53:19] (03PS1) 10Yuvipanda: Labs: Rename confusing role name to more accurate role name [operations/puppet] - 10https://gerrit.wikimedia.org/r/93910 [01:53:32] woo, a patch to ops/puppet after so long! [02:00:55] YuviPanda, mutante RobH re the earlier ping - I don't think petan has access, it's just myself and jeremyb [02:02:16] Thehelpfulone: he has signed an NDA, though [02:02:27] Thehelpfulone: since he's admin on toollabs [02:03:10] yeah I knew that - I think a number of volunteers have signed NDAs in the past though for various reasons - and the newer ones specify exactly what the NDA covers AFAIK, so you'd want to double check with legal each time [02:03:23] ah, right [02:21:04] there isn't just one type of NDA... [02:21:07] afaict [02:21:52] !log LocalisationUpdate completed (1.23wmf2) at Wed Nov 6 02:21:52 UTC 2013 [02:22:10] Logged the message, Master [02:32:14] mutante, yeah, I meant the newer ones for volunteers [02:33:36] Thehelpfulone: when asked i said they should copy the one you have :p [02:33:44] or unify [02:33:59] see reply on bz on spam filter.. bbl [02:35:33] !log LocalisationUpdate completed (1.23wmf1) at Wed Nov 6 02:35:33 UTC 2013 [02:35:48] Logged the message, Master [03:23:59] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Nov 6 03:23:59 UTC 2013 [03:24:16] Logged the message, Master [05:05:32] (03CR) 10Ori.livneh: [C: 032] udp2log: demux now filter groups stricly [operations/puppet] - 10https://gerrit.wikimedia.org/r/93864 (owner: 10Hashar) [05:09:17] !log on tin: running bug 53687 cleanup script (WikimediaMaintenance/bug-53687/fixOrphans.php) on 37 wikis [05:09:35] Logged the message, Master [06:57:55] (03PS1) 10Spage: Enable Flow on QA pages on beta labs. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93927 [07:01:03] (03CR) 10Hashar: [C: 032] "Go go go :]" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93927 (owner: 10Spage) [07:01:13] (03Merged) 10jenkins-bot: Enable Flow on QA pages on beta labs. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93927 (owner: 10Spage) [07:40:54] (03PS1) 10ArielGlenn: mark stuff in decomm.pp with their rt tickets for easier tracking [operations/puppet] - 10https://gerrit.wikimedia.org/r/93930 [07:41:44] (03CR) 10ArielGlenn: "this is not really meant to get merged, it's just here to aid in eventual removal of this file" [operations/puppet] - 10https://gerrit.wikimedia.org/r/93930 (owner: 10ArielGlenn) [07:43:31] apergos: I wouldn't mind seeing it merged. [07:43:42] seems useful [07:44:10] well this is here mostly so robh can look at it, he was planning n getting rid of this file within days [07:44:30] however if there is a delay on that I'll go ahead [07:46:44] I guess palladium is still in process of being set up? (noticing puppet errors like 'no /var/lib/puppet/server/ssl/certs/puppet.pem') [07:46:58] akosiaris: ^ [07:47:28] ah, rain... [07:47:43] yeah [07:47:59] I was so tired yesterday that slept through the night without waking up at all from the storm [07:48:05] I just woke up and saw my balcony wet [07:48:21] paravoid: apergos: yes palladium is almost setup [07:48:28] I woke at one point, thunder etc, it was nice... not enough to be sleepless, just enough to be lulled back to sleep again [07:48:34] ok, I will ignore then, thanks [07:48:42] the only manual things left to do is copy private and /var/lib/puppet/server/ssl [07:48:49] cool! [07:49:12] private needs some cleanup too [07:49:27] let's move the canonical location in /var/lib/git or something (but maybe keep a checkout in ~) [07:49:38] the hook is buggy iirc [07:49:43] I just got derailed by strontium not having any *working* connection to the switch. [07:49:46] not sure if it uses the gitpuppet account nowadays [07:49:48] paravoid: yes i agree [07:49:55] cool [07:49:58] andrewbogott_afk actually made it better [07:50:13] because it was bad imho but it needs a little more love [07:50:35] i don't think though that cding to /var/lib/git/operations/private is the best way forward [07:50:51] nod [07:51:01] that's why I said "maybe keep a checkout in ~" [07:51:32] probably the repo in ~ needs to be the "gerrit" is this case [07:51:47] and /var/lib/git/operations/private in machines need to be... well those [07:51:47] for priate? [07:51:51] fuck no :) [07:52:00] where would you have it then ? [07:52:03] I don't trust gerrit for *all* of our passwords [07:52:16] it was "gerrit" for a reason [07:52:29] i meant it as the primary source of information [07:52:38] it'd still be stored there [07:52:45] the place where we do changes [07:52:56] gerrit is publically accessible [07:53:04] stop talking about gerrit [07:53:08] it was a metaphor [07:53:11] ah [07:53:27] and a bad one as it seems [07:53:34] :-D [07:53:42] so... reiterating [07:53:48] yeah, start over :) [07:54:11] I think that ~ is a good place for the primary repo to be in [07:54:17] the repo where we commit [07:54:25] and the /var/lib/etcetc repos [07:54:29] just pulling from that one [07:54:36] agreed [07:54:39] :-) [07:54:54] well, probably pushing the other way around, not pulling [07:54:57] but still, same thing [07:55:02] the process though still needs some love though [07:55:09] it does :) [07:55:12] some hooks to avoid commits in /var/lib/etc [07:55:29] some informative messages and a clear documentation [07:56:31] how does this differ from what we have now? [07:57:03] everyday use ? it does not [07:57:25] but it sure was not clear to me when i first joined [07:57:32] sure [07:57:48] and I have seen more than once the private dirs in sockpuppet and stafford being out of sync [07:58:05] maybe after I get 'done' with cleanup, I should go on a documentation spree, updating docs for all our processes [07:58:14] and that's what i want to avoid in the future [07:58:19] ah [07:59:04] I think I've been lucky (but I always check that my stuff synced too before leaving) [07:59:05] well ... many of our processes are not clearly documented so if you did that it would be great [07:59:23] I will think about it... it's another long thankless task. but maybe! [07:59:43] i can say thank you. As many times as you 'd like :-) [07:59:49] (recently the number of 'this isn't what the docs say' has gone up) [07:59:51] hahaha [08:00:09] we'll see :-D [08:06:01] (03PS1) 10ArielGlenn: horrible hack to fix up puppet on formey til the host goes away [operations/puppet] - 10https://gerrit.wikimedia.org/r/93932 [08:07:05] (03CR) 10ArielGlenn: [C: 032] horrible hack to fix up puppet on formey til the host goes away [operations/puppet] - 10https://gerrit.wikimedia.org/r/93932 (owner: 10ArielGlenn) [08:15:52] is there currently any ganglia work going on? [08:16:17] some hosts seem to have disappeared all of a sudden [08:16:34] for example cerium and xenon [08:17:50] not by me [08:18:33] was following the cassandra test at https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&tab=ch&vn=&hreg%5B%5D=%28cerium%7Cxenon%7Cpraseodymium%29 [08:23:52] Nov 6 06:31:19 neon kernel: [2556701.921304] EXT3-fs error (device md0): ext3_lookup: deleted inode referenced: 21526 [08:24:06] (sorry, on another track right now) [08:26:19] Nov 6 06:31:18 neon kernel: [2556701.046954] EXT3-fs error (device md0): ext3_lookup: deleted inode referenced: 21531 [08:26:19] Nov 6 06:31:18 neon kernel: [2556701.055236] Aborting journal on device md0. [08:26:19] Nov 6 06:31:18 neon kernel: [2556701.080431] EXT3-fs (md0): error: ext3_journal_start_sb: Detected aborted journal [08:26:20] joy [08:26:37] Nov 6 06:31:18 neon icinga: Warning: Unable to move file '/tmp/checkJELVaB' to check results queue. [08:27:28] (03CR) 10Akosiaris: "Any reason we haven't merged this yet ?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/88441 (owner: 10Reedy) [08:28:24] need a little help here: least disruptive way to fsck root filesystem remotely? [08:29:25] apart from rebooting ? [08:29:46] well yeah.. it's neon, I'll do it if I have to but ugh [08:30:08] if that's the best way then ok [08:30:23] it is [08:30:26] bummer [08:30:30] * apergos gets on it [08:31:11] just touch /forcefsck [08:31:16] and it should it on reboot [08:31:28] k [08:31:47] I assume it would anyways, filesystem errors. but done [08:31:58] heh can't touch it cause [08:32:01] ro filesystem :-D [08:32:19] (03PS8) 10Faidon Liambotis: Update font packages to not use virtual packages [operations/puppet] - 10https://gerrit.wikimedia.org/r/88441 (owner: 10Reedy) [08:32:27] (03CR) 10Faidon Liambotis: [C: 032 V: 032] Update font packages to not use virtual packages [operations/puppet] - 10https://gerrit.wikimedia.org/r/88441 (owner: 10Reedy) [08:33:45] well let's see what happens on the reboot [08:35:15] !Log rebooting neon, ext3 errors on root filesystem [08:35:36] Logged the message, Master [08:36:37] yep it's checking *whew* [08:40:02] I wouldn't "whew" [08:40:04] its memory is bad [08:40:09] I've said so two months ago [08:40:15] there's an RT about it [08:40:19] it's going to keep corrupting [08:41:16] I'd rather have it recover now than die a horrible death though [08:42:47] had a bunch of these on reboot: [08:42:49] http://pastebin.com/4ptJMpN0 [08:43:11] and a bunch of [08:43:13] [ 134.306849] md/raid1:md0: redirecting sector 2047336 to other mirror: sda1 [08:43:14] ah, could be a disk failure then [08:43:15] for various sectors [08:43:26] I'll put it all into rt [08:43:52] first let's make sure that everything's running over there [08:50:43] (03CR) 10Akosiaris: "To be clear, parsoid is not a system group (per definition, system group => uid >= 1000, whereas parsoid gid == 1002 )." [operations/puppet] - 10https://gerrit.wikimedia.org/r/91043 (owner: 10Dzahn) [08:56:33] (03PS1) 10ArielGlenn: add empty pmtpa lucene frontends and indexers [operations/puppet] - 10https://gerrit.wikimedia.org/r/93937 [08:58:10] Aaron|home: hey [08:59:19] * Aaron|home lurks at 1AM [08:59:19] sorry, I wasn't going to ask you anything [08:59:19] (03CR) 10ArielGlenn: [C: 032] add empty pmtpa lucene frontends and indexers [operations/puppet] - 10https://gerrit.wikimedia.org/r/93937 (owner: 10ArielGlenn) [08:59:19] you just asked me a question while I was asleep and wanted to answer :) [08:59:19] but it can wait until (your) tomorrow, sure :) [09:04:59] gwicke_away: how's ganglia now? (well.. whenever you're around again) [09:07:15] PROBLEM - search indices - check lucene status page on search1003 is CRITICAL: Connection timed out [09:08:15] RECOVERY - search indices - check lucene status page on search1003 is OK: HTTP OK: HTTP/1.1 200 OK - 269 bytes in 0.002 second response time [09:10:55] (03PS2) 10Akosiaris: sudo -u parsoid access for parsoid admins [operations/puppet] - 10https://gerrit.wikimedia.org/r/91043 (owner: 10Dzahn) [09:15:05] (03PS1) 10Hashar: beta: auto update Parsoid dependencies using npm [operations/puppet] - 10https://gerrit.wikimedia.org/r/93939 [09:20:47] (03CR) 10Hashar: "Roan, Gabriel, that change would make the Jenkins job updating the code to use 'npm install' for Parsoid, thus automatically updating the " [operations/puppet] - 10https://gerrit.wikimedia.org/r/93939 (owner: 10Hashar) [09:21:15] (03CR) 10ArielGlenn: "not sure if 303 is the way to go, see comment on bug" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/92925 (owner: 10Dzahn) [09:31:11] apergos: may you blindly merge in a beta change for me please? https://gerrit.wikimedia.org/r/#/c/93939/ :D [09:31:15] apergos: tested already [09:31:38] no, but I'll look at in a little while (doing other reviews right now) [09:31:45] thx :) [09:31:49] * apergos not interested in blind gerrit dates :-P [09:32:00] ahah [09:32:10] I will double check my change meanwhile [09:44:25] (03CR) 10Akosiaris: [C: 032] sudo -u parsoid access for parsoid admins [operations/puppet] - 10https://gerrit.wikimedia.org/r/91043 (owner: 10Dzahn) [09:50:15] PROBLEM - Varnish traffic logger on cp1046 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:50:25] PROBLEM - Varnish HTCP daemon on cp1046 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:50:35] PROBLEM - Varnish HTTP mobile-backend on cp1046 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:54:05] RECOVERY - Varnish traffic logger on cp1046 is OK: PROCS OK: 2 processes with command name varnishncsa [09:54:15] RECOVERY - Varnish HTCP daemon on cp1046 is OK: PROCS OK: 1 process with UID = 111 (vhtcpd), args vhtcpd [09:54:25] RECOVERY - Varnish HTTP mobile-backend on cp1046 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.001 second response time [09:58:43] (03PS2) 10ArielGlenn: toy scripts playing with long-range compression [operations/dumps] (ariel) - 10https://gerrit.wikimedia.org/r/63139 (owner: 10Twotwotwo) [10:03:30] (03CR) 10ArielGlenn: [C: 032] "Thanks! If you have more stuff like this hidden away, please get it in." [operations/dumps] (ariel) - 10https://gerrit.wikimedia.org/r/63139 (owner: 10Twotwotwo) [10:04:07] apergos: it was originally posted on the mailing list, good thing I convinced him to submit it as patch ;) [10:04:11] thanks for merging [10:05:20] paravoid: is there documention about why we use gerrit? I remember the debate to replace it, but can't find the outcome. [10:06:12] ok, before I start the next one let me look at yours hashar [10:08:27] matanya: see the history of https://www.mediawiki.org/wiki/Git/Conversion/issues and discussions linked from there, there's plenty [10:08:51] yes, and you can keep on convincing him and others to submit their tools [10:08:55] ideally they would do two things [10:09:03] 1) get themselves git accounts, ask for branches [10:09:12] Nemo_bis: i mean this: https://www.mediawiki.org/wiki/Git/Gerrit_evaluation [10:09:21] 2) put their tools in their branches and ask for merge into my branch when they like [10:09:33] matanya: yes, there's also the parent page of the page I linked etc. [10:09:52] but no result listed there [10:10:00] result? [10:10:15] I"m happy to have copies of their stuff in with the production dump etc stuff, but if they have their own branches they can commit at will, which is a plus [10:11:31] Nemo_bis: the tool of choice after the discussion [10:11:32] matanya: ah, you mean for gerrit vs. other similar tools evaluation; there was an email [10:11:46] yes Nemo_bis [10:11:47] wow really? npm install? :-( [10:13:12] Nemo_bis: and the content of it was "gerrit" ? [10:14:53] hashar: can some sort of git deploy scheme not work for this? [10:16:39] matanya: it wasn't hard to find with the information on that page (like at 5th line)... I copied it on the page though, thanks for heads up [10:18:23] thank you [10:19:42] (03CR) 10ArielGlenn: "I'll try an email ping, would like to get her active and committing." [operations/dumps] (ariel) - 10https://gerrit.wikimedia.org/r/63390 (owner: 10Sanja pavlovic) [10:23:01] has anyone played with forcing all traffic on mobile to HTTPS? [10:23:01] apergos: git deploy does not work on beta / labs :( [10:23:08] apergos: but yeah, eventually we will migrate to git deploy [10:23:27] ah at all? well that sucks majorly [10:24:08] ok well I hate it but I'll merge it (untested) [10:24:51] mark, paravoid IMPORTANT: has there been any auto-redirect changes that force HTTPS for mobile/zero? [10:25:10] (03CR) 10ArielGlenn: [C: 032] beta: auto update Parsoid dependencies using npm [operations/puppet] - 10https://gerrit.wikimedia.org/r/93939 (owner: 10Hashar) [10:25:15] not that I'm aware of [10:25:25] not from the ops side for sure [10:25:45] paravoid: i'm trying to test zero, and i get bounced to HTTPS - which means ALL of zero right now might be broken [10:25:50] and everyone is getting charged [10:29:10] same, not that I'm aware of - but if anywhere, it's probably done by mediawiki ;) [10:33:18] so.. etherpad upgrade time [10:42:25] !log rebooted zirconium for kernel upgrade [10:42:42] !log upgraded etherpad-lite to 1.3.0 on zirconium [10:42:45] (03PS1) 10Mark Bergsma: Setup LVS monitoring of text-lb.esams [operations/puppet] - 10https://gerrit.wikimedia.org/r/93943 [10:42:45] Logged the message, Master [10:43:01] Logged the message, Master [10:43:57] PROBLEM - Host zirconium is DOWN: CRITICAL - Host Unreachable (208.80.154.41) [10:44:14] (03CR) 10Mark Bergsma: [C: 032] Setup LVS monitoring of text-lb.esams [operations/puppet] - 10https://gerrit.wikimedia.org/r/93943 (owner: 10Mark Bergsma) [10:45:27] RECOVERY - Host zirconium is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [10:46:44] mark: so, shall I merge the US West -> ulsfo? [10:46:49] yes [10:46:55] what does the new epl version have in it, anything nice? [10:46:58] mark: I don't think much would have been done in core or centralauth since the large https only push / yurik [10:47:07] although i don't follow development that much [10:47:24] (03PS2) 10Faidon Liambotis: Point US West Coast states to ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/93752 [10:47:59] (03CR) 10Faidon Liambotis: [C: 032] Point US West Coast states to ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/93752 (owner: 10Faidon Liambotis) [10:48:40] (03PS2) 10Faidon Liambotis: Add Canada and its provinces/territories [operations/dns] - 10https://gerrit.wikimedia.org/r/93756 [10:49:23] mark: canada? it's 4 million more [10:49:43] ok [10:49:57] can you watch the graphs by the time US/CA wake up? [10:50:02] I can [10:50:17] observium? [10:50:25] what are the graphs you're specifically monitoring? [10:50:35] (03CR) 10Faidon Liambotis: [C: 032] Add Canada and its provinces/territories [operations/dns] - 10https://gerrit.wikimedia.org/r/93756 (owner: 10Faidon Liambotis) [10:50:44] yeah, that tinet link [10:50:57] it shouldn't go above 60% or so [10:52:32] 529. xe-0/0/3.98 [10:52:33] Transit: cr1-ulsfo [10:52:43] right? [10:53:20] no, xe-0/0/3 [10:53:55] well yeah, the other one was the vlan [10:54:21] i know, that doesn't include the other vlan [10:54:32] right, I'm just looking at that [10:54:42] ok [11:03:51] PROBLEM - Host text-lb.esams.wikimedia.org is DOWN: CRITICAL - Network Unreachable (91.198.174.192) [11:04:12] hey [11:05:48] (03CR) 10ArielGlenn: [C: 031] "why inherit (from jobrunner::base) when you could include? is there inheritance you are using?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/77034 (owner: 10Hashar) [11:11:02] p858snake|l: when was that push? [11:11:35] when we went https only months ago? [11:11:39] have a look at the blog [11:12:47] akosiaris1: you perhaps don't even know you tweet, but you got a RT ;) https://twitter.com/EtherpadOrg/status/398042986308980736 [11:12:52] p858snake|l: when was that push? [11:13:01] sorry, dup [11:13:33] i tried the local carrier - they are fine [11:13:46] so it must be just non-carriers that get bounced somehow [11:15:20] NEW: Localisation updates from http://translatewiki.net. ohhhh that's nice! [11:15:32] yurik: https://blog.wikimedia.org/2013/09/10/https-by-default-beta-program/ [11:27:31] RECOVERY - Host text-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 95.62 ms [11:27:45] Nemo_bis: I also identi.ca don't I ? [11:28:08] akosiaris1: I doubt it, that was killed [11:28:13] thank god :-) [11:28:19] or whatever... [11:28:44] seems like we are a good user of etherpad foundation :-) [11:28:55] you can find its swan scream (?) here https://identi.ca/notice/101377209 [11:29:01] yes, they mention WMF in their homepage [11:29:24] yeah but they used to only mention the labs instance [11:29:33] let's see... did they update that info? [11:39:21] PROBLEM - LVS HTTP IPv4 on wikidata-lb.esams.wikimedia.org is CRITICAL: Connection timed out [11:39:31] PROBLEM - LVS HTTP IPv4 on text-lb.esams.wikimedia.org is CRITICAL: Connection timed out [11:40:59] mark & paravoid, & p858snake|l - thanks for your help - somehow on my browser i had forceHTTPS cookie set. Removing it fixed the issue, but i still don't know how i could have received it [11:44:12] (03PS1) 10Mark Bergsma: Add AAAA records & reverse DNS for the esams text Varnish servers [operations/dns] - 10https://gerrit.wikimedia.org/r/93945 [11:45:10] (03CR) 10Mark Bergsma: [C: 032] Add AAAA records & reverse DNS for the esams text Varnish servers [operations/dns] - 10https://gerrit.wikimedia.org/r/93945 (owner: 10Mark Bergsma) [11:46:44] akosiaris1: I think their info is more generic than that [11:51:22] dont appear to be able to get to http://www.wikidata.org/wiki/ :/ [11:52:48] looks like something somewhere is wrong for half or the world or so :P [12:00:45] huh [12:00:49] mark: ^ [12:00:53] ak^ [12:00:56] what? [12:01:02] can't access wikidata [12:01:07] 11:39 <+icinga-wm> PROBLEM - LVS HTTP IPv4 on wikidata-lb.esams.wikimedia.org is CRITICAL: Connection timed out [12:01:12] :< [12:01:14] from europe [12:01:20] seems to work on my proxy, via us [12:01:40] what IP do you see wikidata being resolved to? [12:01:43] can someone help? [12:01:56] 208.80.152.218 [12:02:04] www.wikidata.org [12:02:08] works for :208.80.154.242 [12:02:08] with the www [12:02:19] PING wikidata-lb.esams.wikimedia.org (91.198.174.237): 56 data bytes [12:02:22] 91.198.174.237 [12:02:32] and that does not work for you? [12:02:39] nope [12:02:41] nope [12:02:53] ok, because it works for me [12:02:54] from the us PING wikidata-lb.eqiad.wikimedia.org (208.80.154.242) 56(84) bytes of data. [12:02:54] looking [12:03:02] i can ping, but not access wikidata [= [12:03:04] esams does not work [12:03:29] oh, it works for me because of ipv6, heh [12:03:33] oh [12:03:51] looking [12:03:54] k [12:05:04] huh [12:06:05] is https://gerrit.wikimedia.org/r/#/c/93945/ the one to blame? [12:06:07] paravoid: http only also :P [12:06:12] I know [12:06:14] this is weird [12:07:14] RECOVERY - LVS HTTP IPv4 on wikidata-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.0 301 Moved Permanently - 719 bytes in 0.192 second response time [12:07:18] ok, I switched traffic to the other LVS box [12:07:28] working again [= [12:07:55] !log stopping pybal on amslvs1, wikidata-lb:80 issues, pending further investigation [12:07:56] of course it just redirects us straight to https :P [12:08:10] Logged the message, Master [12:11:45] paravoid: same thing happened to text-lb.esams.wikimedia.org ipv4 http lvs at the same time (not sure if you want to do the same for that too as it still times out (not sure where it is used though..)) [12:11:54] nowhere [12:11:55] it's new [12:12:17] [= [12:18:42] !log started pybal on amslvs1 again; seems to work again, root cause unknown [12:18:47] oh well... [12:18:55] Logged the message, Master [12:18:57] huh [12:19:07] hah! [12:40:46] (03PS2) 10ArielGlenn: decommision db3[29] db4[2-6] db5[1235689] [operations/puppet] - 10https://gerrit.wikimedia.org/r/93052 (owner: 10Springle) [12:41:17] (03PS1) 10Yurik: Made netmapper updates more frequent [operations/puppet] - 10https://gerrit.wikimedia.org/r/93946 [12:42:03] bblack ^ [12:52:14] PROBLEM - Varnish HTCP daemon on cp1046 is CRITICAL: NRPE: Call to popen() failed [12:52:14] PROBLEM - Varnish traffic logger on cp1046 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:52:34] PROBLEM - Varnish HTTP mobile-backend on cp1046 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:54:05] RECOVERY - Varnish traffic logger on cp1046 is OK: PROCS OK: 2 processes with command name varnishncsa [12:54:15] RECOVERY - Varnish HTCP daemon on cp1046 is OK: PROCS OK: 1 process with UID = 111 (vhtcpd), args vhtcpd [12:54:24] RECOVERY - Varnish HTTP mobile-backend on cp1046 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.001 second response time [13:53:29] akosiaris: can you please merge: https://gerrit.wikimedia.org/r/#/c/93798/ [13:54:36] dr0ptp4kt: can you please comment on https://gerrit.wikimedia.org/r/#/c/92288/ [13:55:43] matanya: is the queue drained indeed ? [13:55:55] yes, confimed by gwicke_away [13:56:04] see: https://gerrit.wikimedia.org/r/#/c/93800/ [13:57:03] (03CR) 10Akosiaris: [C: 032] jobrunner: remove legacy parsoid jobs [operations/puppet] - 10https://gerrit.wikimedia.org/r/93798 (owner: 10Matanya) [13:57:51] done [13:58:08] thanks [14:08:19] (03PS1) 10Ottomata: Updating README with jmxtrans documentation; adding example jmxtrans json file [operations/puppet/kafka] - 10https://gerrit.wikimedia.org/r/93947 [14:09:03] hashar: did all jenkins jobs migrate to their now /srv/ssd/gerrit on gallium? [14:09:06] (03CR) 10Ottomata: [C: 032 V: 032] Updating README with jmxtrans documentation; adding example jmxtrans json file [operations/puppet/kafka] - 10https://gerrit.wikimedia.org/r/93947 (owner: 10Ottomata) [14:09:38] matanya: probably. haven't time to check it out but I will clean up the old one eeventually [14:10:56] (03PS1) 10Ottomata: Fixing kafka-jmxtrans.json.md link in README.md [operations/puppet/kafka] - 10https://gerrit.wikimedia.org/r/93948 [14:11:06] (03CR) 10Ottomata: [C: 032 V: 032] Fixing kafka-jmxtrans.json.md link in README.md [operations/puppet/kafka] - 10https://gerrit.wikimedia.org/r/93948 (owner: 10Ottomata) [14:11:07] hashar: if you wish, i can create a patch (if it will help you) [14:11:19] sure :-] [14:11:45] (03PS1) 10Ottomata: Adding port in ganglia URL in README [operations/puppet/kafka] - 10https://gerrit.wikimedia.org/r/93949 [14:11:54] (03CR) 10Ottomata: [C: 032 V: 032] Adding port in ganglia URL in README [operations/puppet/kafka] - 10https://gerrit.wikimedia.org/r/93949 (owner: 10Ottomata) [14:17:46] (03PS1) 10Matanya: gerrit: remove jenkins refer to old directory [operations/puppet] - 10https://gerrit.wikimedia.org/r/93950 [14:17:51] hashar: ^ [14:25:11] matanya: add me as reviewers [14:25:18] ah no [14:25:20] already done [14:25:33] no that is uneeded [14:25:37] need to clean up first [14:25:41] ok [14:25:45] so just ignore [14:25:52] (03Abandoned) 10Hashar: gerrit: remove jenkins refer to old directory [operations/puppet] - 10https://gerrit.wikimedia.org/r/93950 (owner: 10Matanya) [14:26:00] i might create a new clean and nice one [14:26:38] heya paravoid, you aroudn? I want to fininsh up the varnishkafka packaging [14:26:43] got some qs [14:26:46] I am [14:29:18] PROBLEM - Frontend Squid HTTP on amssq33 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:29:18] PROBLEM - Frontend Squid HTTP on amssq44 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:29:38] PROBLEM - Backend Squid HTTP on amssq44 is CRITICAL: Connection timed out [14:29:58] PROBLEM - Backend Squid HTTP on amssq33 is CRITICAL: Connection timed out [14:30:10] RECOVERY - Frontend Squid HTTP on amssq44 is OK: HTTP OK: HTTP/1.0 200 OK - 1417 bytes in 0.193 second response time [14:30:21] ah hey so [14:30:21] (03PS1) 10Ottomata: Adding Yuvi Panda on analytics nodes - RT 6103 [operations/puppet] - 10https://gerrit.wikimedia.org/r/93952 [14:30:23] first one [14:30:28] RECOVERY - Backend Squid HTTP on amssq44 is OK: HTTP OK: HTTP/1.0 200 OK - 1425 bytes in 0.191 second response time [14:30:30] what's the best way to generate a manpage for varnishkafka [14:30:42] i can use help2man, or just make a static one [14:30:49] RECOVERY - Backend Squid HTTP on amssq33 is OK: HTTP OK: HTTP/1.0 200 OK - 1423 bytes in 0.457 second response time [14:30:55] but it'd be nice if the makefile or the package did it, right? [14:30:59] automatically [14:31:00] ? [14:31:08] RECOVERY - Frontend Squid HTTP on amssq33 is OK: HTTP OK: HTTP/1.0 200 OK - 1408 bytes in 0.530 second response time [14:31:20] (03CR) 10Ottomata: [C: 032 V: 032] Adding Yuvi Panda on analytics nodes - RT 6103 [operations/puppet] - 10https://gerrit.wikimedia.org/r/93952 (owner: 10Ottomata) [14:31:26] just write groff? [14:31:48] well, i was thinking it'd be nice if the file was just generated from the help message automatically [14:31:58] that way if it changes, the man page will be updated when the package is built [14:32:00] or at compile [14:32:06] not worth it? [14:32:06] manpages should be more verbose than --help output [14:32:12] or else there's no point in having a manpage at all [14:32:17] imho [14:32:29] hmm, ok well I wasn't going to write any more than that [14:32:41] so no manpage then? [14:33:34] !log dist-upgrade && reboot on amssq47..62 [14:33:51] Logged the message, Master [14:38:08] PROBLEM - Varnish HTTP text-backend on amssq47 is CRITICAL: Connection timed out [14:38:58] PROBLEM - Host amssq47 is DOWN: PING CRITICAL - Packet loss = 100% [14:39:58] RECOVERY - Varnish HTTP text-backend on amssq47 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.192 second response time [14:40:08] RECOVERY - Host amssq47 is UP: PING OK - Packet loss = 0%, RTA = 95.64 ms [14:43:51] (03PS1) 10Ottomata: Including analytics::users on namenodes. [operations/puppet] - 10https://gerrit.wikimedia.org/r/93954 [14:44:09] (03CR) 10Ottomata: [C: 032 V: 032] Including analytics::users on namenodes. [operations/puppet] - 10https://gerrit.wikimedia.org/r/93954 (owner: 10Ottomata) [14:49:48] PROBLEM - Varnish HTTP text-backend on amssq51 is CRITICAL: Connection refused [14:51:58] PROBLEM - Varnish HTTP text-backend on amssq54 is CRITICAL: Connection refused [14:57:08] PROBLEM - Varnish HTTP text-backend on amssq57 is CRITICAL: Connection refused [14:59:38] PROBLEM - Host amssq59 is DOWN: PING CRITICAL - Packet loss = 100% [15:00:20] ottomata: hey. how are you ? [15:00:26] hiyaaa [15:00:27] good! [15:00:40] how are you? [15:00:57] nice. Check your email. You will find a present regarding snappy [15:01:06] I am fine thanks [15:01:09] RECOVERY - Host amssq59 is UP: PING OK - Packet loss = 0%, RTA = 96.66 ms [15:01:24] sorry it took so long [15:02:08] PROBLEM - Varnish HTTP text-backend on amssq60 is CRITICAL: Connection refused [15:03:08] PROBLEM - Varnish HTTP text-backend on amssq59 is CRITICAL: Connection refused [15:03:09] PROBLEM - NTP on amssq52 is CRITICAL: NTP CRITICAL: Offset unknown [15:04:57] awessooome, thanks! [15:05:04] no problem, hasn't been an issue yet [15:05:33] so ok, if I remove my manual libsnappyjava.so, remove the libsnappy1 package that is on the kafka brokers now [15:05:35] and then install this one [15:05:37] it should work? [15:05:54] PROBLEM - Varnish HTTP text-frontend on amssq62 is CRITICAL: Connection refused [15:06:00] yes but you can upgrade the package.. no need to remove [15:06:20] and when you give me the thumbs up, in apt it goes [15:06:29] so no more manual stuff. [15:06:55] RECOVERY - Varnish HTTP text-frontend on amssq62 is OK: HTTP OK: HTTP/1.1 200 OK - 199 bytes in 0.194 second response time [15:07:25] ottomata: a question. Do all of these people requesting access to hadoop have some guidelines on how to proceed with whatever they want do ? Some documentation ? Just curious cause I wanna see if I can use that too [15:08:14] RECOVERY - NTP on amssq52 is OK: NTP OK: Offset -0.001010417938 secs [15:08:21] ottomata: can we please cleanup role/analytics.pp? [15:08:22] it's very confusing [15:08:24] not really, they just want to play [15:08:26] sure [15:09:19] first, having both a base class and a common class is confusing [15:09:33] second, the ::dclass class being a role class is also confusing, this isn't a role [15:09:44] and also, why do all analytics nodes need dclass? [15:09:44] RECOVERY - Varnish HTTP text-backend on amssq51 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.191 second response time [15:09:54] RECOVERY - Varnish HTTP text-backend on amssq57 is OK: HTTP OK: HTTP/1.1 200 OK - 190 bytes in 0.192 second response time [15:09:54] RECOVERY - Varnish HTTP text-backend on amssq54 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.193 second response time [15:09:58] they dont' all need it, but the hadoop nodes do [15:10:14] RECOVERY - Varnish HTTP text-backend on amssq59 is OK: HTTP OK: HTTP/1.1 200 OK - 190 bytes in 0.193 second response time [15:10:24] RECOVERY - Varnish HTTP text-backend on amssq60 is OK: HTTP OK: HTTP/1.1 200 OK - 190 bytes in 0.194 second response time [15:12:24] PROBLEM - Host mw72 is DOWN: PING CRITICAL - Packet loss = 100% [15:13:24] RECOVERY - Host mw72 is UP: PING OK - Packet loss = 0%, RTA = 35.34 ms [15:15:43] (03PS1) 10Mark Bergsma: Add records text-lb and login-lb.esams [operations/dns] - 10https://gerrit.wikimedia.org/r/93955 [15:16:09] (sorry meeting) [15:16:20] (03CR) 10Mark Bergsma: [C: 032] Add records text-lb and login-lb.esams [operations/dns] - 10https://gerrit.wikimedia.org/r/93955 (owner: 10Mark Bergsma) [15:20:14] PROBLEM - NTP on amssq62 is CRITICAL: NTP CRITICAL: Offset unknown [15:21:55] (03CR) 10Faidon Liambotis: "The only datacenter that has no private transport is esams (for ulsfo works just fine and that's the expectation for any new DCs that we a" [operations/dns] - 10https://gerrit.wikimedia.org/r/93903 (owner: 10Ottomata) [15:22:01] ottomata: ^ [15:22:29] (03PS1) 10Faidon Liambotis: Kill the java module with fire [operations/puppet] - 10https://gerrit.wikimedia.org/r/93956 [15:24:12] no that's not true [15:24:20] what is not true? [15:24:34] we may not always have transport from each caching dc [15:24:36] (03CR) 10Ottomata: [C: 032] Kill the java module with fire [operations/puppet] - 10https://gerrit.wikimedia.org/r/93956 (owner: 10Faidon Liambotis) [15:25:02] you mean for future caching DCs? [15:25:14] RECOVERY - NTP on amssq62 is OK: NTP OK: Offset 0.003049135208 secs [15:25:25] yeah [15:25:33] iiinnnteresting, IPv6 eh? [15:25:37] that would be cool with me [15:25:46] hm [15:26:22] it would need acl changes [15:28:00] do you think that's worth it? i'm going to have to add firewall rules if we gave these nodes public IPs anyway [15:28:42] so, wait, how would this work? the nodes all have IPv6 addies [15:28:44] it's all on fixed tcp port nrs? [15:28:51] How to run bot from http://tools.wmflabs.org ? [15:28:53] yes [15:28:57] then we can do that [15:29:05] :9092 [15:29:12] not that I think it's a huge deal to give public IPs to those analytics nodes [15:29:19] yeah [15:29:34] but, well, if it's solely for this, really, no point [15:29:57] yeah i'd prefer if they remained internal, we're only giving them public so other DCs can talk to them directly [15:30:06] so if we can do that via IPv6, that's fine [15:30:16] would I then also have to bother with local firewall rules? [15:30:24] or could we just rely on the network acl? [15:30:31] just acl [15:30:40] great, then this is less complex (I think?) [15:30:46] it's not complex at all [15:30:51] great [15:30:58] as long as both apache kafka and varnishkafka do ipv6 [15:31:03] I hope they do, it's 2013 :) [15:31:20] (03PS2) 10BBlack: Made netmapper updates more frequent [operations/puppet] - 10https://gerrit.wikimedia.org/r/93946 (owner: 10Yurik) [15:31:56] (03CR) 10BBlack: [C: 032] "Not that it should matter given the human delays in updating the source list, but it's not like it hurts anything either." [operations/puppet] - 10https://gerrit.wikimedia.org/r/93946 (owner: 10Yurik) [15:32:16] OMG FIVE MINUTES MORE!!!111 [15:32:20] OPS IS DELAYING US [15:32:59] and don't you dare leaving that change for review for another few days [15:33:16] :) [15:33:18] paravoid: it usually takes about 15-20 for some reason, and when we are testing things with the partner on the phone, it IS important [15:33:28] but thank you for your good humour [15:35:59] (03PS1) 10Mark Bergsma: Send esams wikivoyage & wikidata traffic to the Varnish cluster [operations/dns] - 10https://gerrit.wikimedia.org/r/93958 [15:37:52] yurik: the max delay with the old numbers is 11:29, with an average of 5:45. With the new numbers it's 5:45 and 2:52. [15:38:01] if you're seeing 15+ minute delays, the holdup is elsewhere... [15:38:39] bblack: that's theory - my practice (tested several times) - i change things a few min before n:n0, and it takes about 15-19 min for it to start working [15:39:04] perhaps, will need to think where else it might get cached [15:39:18] thx for +2ing [15:39:25] yurik: you should validate that your changes appear to wget quickly [15:39:34] PROBLEM - LVS HTTP IPv4 on text-lb.esams.wikimedia.org is CRITICAL: Connection timed out [15:39:42] bblack: thx, will test that shortly [15:40:53] could be something on the other end of the process, too, depending on how you define "start working". Just because netmapper starts using new X-CS codes doesn't mean the result isn't cached elsewhere [15:41:10] like... in varnish [15:41:24] wait, varnish caches? [15:41:44] I thought its job was to consume gobs of memory and make my life complicated :) [15:58:18] (03PS1) 10Mark Bergsma: Include role::cache::text instead of role::cache::varnish::text [operations/puppet] - 10https://gerrit.wikimedia.org/r/93961 [15:59:28] (03Abandoned) 10Hashar: Merge upstream 'v0.7.1' into master [operations/debs/jenkins-debian-glue] - 10https://gerrit.wikimedia.org/r/93476 (owner: 10Hashar) [16:00:00] (03PS2) 10Mark Bergsma: Include role::cache::text instead of role::cache::varnish::text [operations/puppet] - 10https://gerrit.wikimedia.org/r/93961 [16:00:28] (03Abandoned) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [operations/debs/latexml] - 10https://gerrit.wikimedia.org/r/90717 (owner: 10Hashar) [16:01:36] (03CR) 10Mark Bergsma: [C: 032] Include role::cache::text instead of role::cache::varnish::text [operations/puppet] - 10https://gerrit.wikimedia.org/r/93961 (owner: 10Mark Bergsma) [16:03:50] (03CR) 10Andrew Bogott: [C: 032] Labs: Rename confusing role name to more accurate role name [operations/puppet] - 10https://gerrit.wikimedia.org/r/93910 (owner: 10Yuvipanda) [16:04:21] (03CR) 10Andrew Bogott: [V: 032] Labs: Rename confusing role name to more accurate role name [operations/puppet] - 10https://gerrit.wikimedia.org/r/93910 (owner: 10Yuvipanda) [16:06:27] RECOVERY - LVS HTTP IPv4 on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 63530 bytes in 1.092 second response time [16:09:33] (03PS1) 10Andrew Bogott: Fix typo in role name [operations/puppet] - 10https://gerrit.wikimedia.org/r/93962 [16:09:37] PROBLEM - LVS HTTP IPv4 on text-lb.esams.wikimedia.org is CRITICAL: Connection timed out [16:10:36] so noone's wondering about LVS pages at all? :) [16:10:52] I'm kind of wondering about it [16:11:05] ok, reason is because service IPs apparently weren't bound yet to the realservers [16:11:07] (03CR) 10Andrew Bogott: [C: 032] Fix typo in role name [operations/puppet] - 10https://gerrit.wikimedia.org/r/93962 (owner: 10Andrew Bogott) [16:11:10] but I'm also passing out on my keyboard, so I thought having a root window open may have been a bad idea [16:11:14] there's not traffic on that cluster yet, but i'm hoping to change that real soon :) [16:11:22] I know that you were working on them and that text-lb isn't live yet [16:11:27] RECOVERY - LVS HTTP IPv4 on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 63532 bytes in 0.489 second response time [16:11:31] i know you know [16:11:37] I was wondering if anyone else knew or cared ;) [16:11:41] :-) [16:11:50] because pretty soon, that page coming in is "oh crap" time [16:12:06] the good news of course, is that we won't have 10 of them anymore, one per project [16:12:19] yay [16:12:28] although it was funny [16:12:34] in the office, yes [16:12:36] having my phone beep once wasn't very alarming [16:12:46] "someone's just texting me" [16:12:53] but hearing ten of them was "oh fuck" [16:12:55] no special ringtone? [16:13:01] not anymore dammit [16:13:12] I don't get senderid anymore :( [16:13:14] just random numbers [16:13:20] sigh [16:13:22] should finish that arduino thing [16:13:33] there were some scamming texts around here lately [16:13:44] so I think vodafone might just blocked free text senderid [16:14:10] I've been watching you chat in here about other things and figured if there was an actual emergency you might be acting differently :-P [16:14:28] (but yes I look in here after every one of these pages.. just in case) [16:14:52] :) [16:16:04] oh sigh [16:16:12] varnish debs not finishing postinst [16:16:21] because persistent backend didn't come up [16:16:28] and I started varnish manually, so it's up now [16:16:34] so future postinst retries also fail [16:16:51] vi /var/lib/dpkg/info/varnish.postinst [16:16:56] comment-out the update-rc.d call [16:16:58] apt-get -f install [16:17:02] yeah I do that often [16:17:03] ok mark, just tested, ipv6 works, but not so much by address, because of how librdkafka parses the addy:port, [16:17:05] k [16:17:14] do these have ipv6 hostnames? if not can we add some? [16:17:29] you can add some yes [16:17:31] [2620:...]:80 is the standard format [16:17:35] ok [16:17:37] just like you did with the (ulsfo) varnish servers [16:17:38] before you add hostnames, add stable addresses though [16:18:00] the ones that are there now are dynamic? [16:18:07] yes, based on mac address [16:19:14] (03PS1) 10Akosiaris: Check for administratively disabled puppet [operations/puppet] - 10https://gerrit.wikimedia.org/r/93965 [16:20:04] for i in amssq{47..62}; do echo $i; ssh root@$i.esams.wikimedia.org "apt-get -f install || (/etc/init.d/varnish stop; apt-get -f install)"; done [16:20:07] did the trick just fine [16:21:26] nice ( akosiaris ) [16:21:36] (03CR) 10Hashar: "could someone merge this in please?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/65254 (owner: 10Hashar) [16:22:58] ok [16:23:06] anyone any objections to putting text traffic on varnish in esams? ;) [16:23:14] dooo iiiiiit! [16:24:04] * MaxSem grabs popcorn yelling GO FOR IT [16:24:16] * apergos grabs cheesy poofs [16:24:37] ok actually heading to cafe now, back in a bit [16:24:42] but mah... i wan cheesy poofs [16:25:49] (03CR) 10Jforrester: "Argh. We explicitly do NOT want Parsoid master to be the Parsoid version that Beta Labs talks to…" [operations/puppet] - 10https://gerrit.wikimedia.org/r/93939 (owner: 10Hashar) [16:25:56] (03CR) 10Mark Bergsma: [C: 032] Send esams wikivoyage & wikidata traffic to the Varnish cluster [operations/dns] - 10https://gerrit.wikimedia.org/r/93958 (owner: 10Mark Bergsma) [16:26:16] hashar/ariel: ^^^ [16:26:28] James_F: moaaar crazy dependencies :-] [16:26:42] ah [16:27:32] James_F, that kinda kills the idea of beta cluster [16:27:38] James_F: yesterday (for you) we talked with Roan about it [16:27:58] James_F: the parsoid setup on beta is currently manually updated. My change update another copy which is not used afaik [16:28:06] but I might be wrong [16:28:43] James_F: replying on bug [16:29:23] James_F: since DOM spec updates are not that frequent we should actually be able to test master most of the time [16:29:58] especially with a bit of coordination on when we merge things to master [16:31:36] man that does shit-all ;) [16:35:36] (03CR) 10Hashar: "Follow up on https://bugzilla.wikimedia.org/show_bug.cgi?id=56622" [operations/puppet] - 10https://gerrit.wikimedia.org/r/93939 (owner: 10Hashar) [16:35:43] gwicke: VE and Parsoid have been mutually broken for… 3 of the last 5 weeks? [16:36:00] I don't think so [16:36:24] we depended on your support for new content, which wasn't merged yesterday [16:36:37] what kind of breakage do you encounter? VE relying on a feature that is no more in Parsoid or some API that got slightly adjusted ? [16:36:48] we could have synced our merge with yours if necessary [16:37:04] hashar: "API that got slightly adjusted" == VE can't edit templates any more. Or categories. Or images. Or… [16:37:07] we recently did a DOM spec cleanup [16:37:20] basically a few renames [16:37:34] and you don't / can't retain back compatibility can you ? [16:37:36] as we have cached content we always have to support both for a while [16:37:37] hashar: That's not minor breakage. That's massive regressions. [16:37:53] hashar: We can - but then VE master needs to be updated *before* Parsoid master is. [16:38:05] it is always 'add support for the new version too' on both ends [16:38:10] hashar: And until now, Parsoid master has run ahead of VE by a lot at times. [16:38:16] gwicke++ [16:38:55] hashar: Also, this change means that Beta Labs doesn't tell us whether VE master is good to deploy to production. [16:38:57] I am wondering whether parsoid could support X versions in parrale [16:39:06] then VE could do its query saying it expecting Y version [16:39:12] hashar: Parsoid is a delicate, manual deployment done every month or so. [16:39:21] we'd like to deploy more often [16:39:26] hashar: API versioning has been requested. :-) [16:39:29] this time around we were waiting for VE.. [16:39:37] gwicke: Yeah. [16:39:41] ok ok :-] I am not blaming any of you, just trying to understand :D [16:39:44] so [16:39:45] gwicke: You're normally waiting for us. [16:39:57] Parsoid tip of master can be broken [16:40:09] it is deployed once per month while maintaining back compatibility with previous version [16:40:09] minimizing the delta and catching regressions early is a good thing IMO [16:40:14] gwicke: If we let this set-up stand, we'll have to (a) deploy Parsoid every week, and (b) VE has to drop everything, any time you change the API. [16:40:16] James_F: thus my comment on https://bugzilla.wikimedia.org/show_bug.cgi?id=56670. for every other bit of software we have, it is test2wiki/mw.o that tells us whether VE is good to deploy, not beta. [16:40:24] Parsoid master should not be broken [16:40:32] if it is, then our own tests have failed [16:40:37] and need to be fixed [16:40:44] and VE is updated more often and catch up with Parsoid change till a new version of it is deployed [16:40:48] chrismcmahon: What? No? I disagree that that's what Beta Labs is for. [16:41:17] chrismcmahon: MediaWiki.org should never, ever be broken. Every time it is, we've screwed up. [16:41:35] chrismcmahon: The fact that it's intentionally to a small wiki we notice these things on is to rescue us from our screw-up. [16:41:48] chrismcmahon: But the number of "bugs caught at MediaWiki.org" should be 0. [16:42:33] !log Moved esams ssl -> cache traffic from squid to varnish (manually) [16:42:48] Logged the message, Master [16:42:50] James_F: the problem there is that there is no Parsoid pre-prod for test2.mw.o like there is for VE. [16:42:50] that's a bit more traffic ;) [16:42:55] traditionally we have deployed at least once per week [16:43:09] chrismcmahon: See bug 56622 where I explicitly asked for this. :-) [16:43:24] we'd like to keep that pace by ensuring that our master remains stable [16:43:34] gwicke: Can you commit to that? [16:43:51] James_F: sure [16:43:54] gwicke: And lock-step deployments alongside Reedy every Thursday morning? [16:44:00] then you can use beta to ensure both master are in sync [16:44:26] And then we'll just have to hope and pray that VE-prod/Parsoid-master or VE-master/Parsoid-prod work, 'cos we're not going to be testing them. [16:44:37] (03PS1) 10Mark Bergsma: Move the textsvc ip from text to text-varnish [operations/puppet] - 10https://gerrit.wikimedia.org/r/93972 [16:44:59] E.g. wmf2 of VE working on Pasroid wmf2 but not wmf3. We don't have het-deploy of Parsoid in prod right now. [16:45:17] we have not had any major bugs because of versioning [16:45:25] the bugs have mostly been a lack of testing [16:45:42] gwicke: We last had an emergency VE deploy to fix a lack of versioning *on Monday*. [16:45:49] James_F: that is Step 2. let us also get a Parsoid host between beta and prod. right now we can do the beta part, let us reason about the next part. [16:45:50] I'd be happy to automatically deploy parsoid along with Ve [16:45:55] VE [16:46:05] gwicke: What you consider "not any major bugs" and what I consider it are different, clearly. [16:46:07] if there are any relevant changes that are not yet deployed before then [16:46:31] chrismcmahon: Or we could just do our jobs and build the two extra test wikis in Labs? [16:46:50] VE is part of the wmf branches whether Parsoid is deployed with the deployment system so that is not very helpfull :/ [16:47:00] hashar: Exactly. [16:47:17] one possibility would be to deploy VE using the deployment system, so we will be able to keep Parsoid and VE in sync [16:47:31] but then we will have an issue with VE and the MediaWiki wmf branches :/ [16:47:39] hashar: Yeah. Eww. [16:47:43] so yeah, Parsoid being hetdeployed would make it [16:47:53] aka Parsoid wmfX and wmfY [16:47:56] hashar: When git-deploy is ready, I'd love for Parsoid to move to being part of the weekly cycle. [16:48:09] hashar: But that's not been fixed for a year, and I don't expect it to suddenly happen. [16:48:21] I d love it to [16:48:43] then we can work on having it deployed every day and finally on merge :D [16:48:45] we should in general not depend on synced deploys though [16:48:55] gwicke: Agreed. [16:49:31] meanwhile, we can keep beta to point on a parsoid instance which is updated manually [16:49:35] with caching in betalabs VE will also see both old and new content [16:49:39] gwicke James_F hashar I think we could use some empirical evidence. Can we go ahead and do the automatic deploy of Parsoid + VE to beta labs and see what happens? [16:49:56] and setup yet another beta wiki that would run VE @ master and points to another parsoid instance that runs the tip of master. [16:49:58] if that helps [16:50:09] chrismcmahon: +1 [16:50:35] hashar: No, we also need to test Parsoid-master against VE-prod, not just VE-master against Parsoid-prod. [16:50:41] hashar: So… two more beta wikis. [16:51:04] you could poke Roan to setup a new parsoid instance which will boot Parsoid from /data/project/apache/common-local/php-master/extensions/Parsoid . That path is shared and updated by Jenkins automatically. [16:51:31] the wikis can actually be the same [16:51:35] sitil have to figure out a way to restart the Parsoid process on that instance though. But that might be doable with a password less ssh key [16:51:36] or some other system [16:51:42] only a single config var would differ [16:52:01] which could be passed in with a request parameter too [16:52:44] ?ve=prod&parsoid=master || ?ve=master&parsoid=prod [16:52:48] I still believe that focusing on continuous integration is better than diverting our testing resources [16:52:51] not sure how to retain the parameters between requests though [16:53:17] the bottom line is not to have our only known-good instance of Parsoid to be in production. [16:53:33] hashar: Also, adding stuff into VE to know which Parsoid to talk to (rather than listen to LocalSettings.php) feels wrong. [16:54:00] James_F: it is a VE config var [16:54:09] gwicke: Yes. [16:54:22] gwicke: And this would have us replace a config var with an array for testing? [16:54:34] !log reedy synchronized php-1.23wmf2/extensions/WikimediaMaintenance/ [16:54:51] Logged the message, Master [16:55:45] another thing that would be useful is to get integration tests when submitting patches in Gerrit. We have nothing like that on any of our repository. The idea is that whenever one send a VE patch, it would run against parsoid@master and parsoid@prod. Similarly, a patch send against Parsoid would run the tests of VE@master and VE@prod against that patch. [16:55:45] I need to write that down somewhere [16:56:01] hashar: Yes, that would be excellent. [16:56:15] hashar: Though the browser tests work effectively as smoke tests. [16:56:23] hashar: we have integration tests, but they leave out the HTTP stack [16:56:44] which is what we'll change next [16:57:58] also, don't under-estimate the resources needed to catch the regressions we are looking for [16:58:20] you need to test with a lot of pages [16:58:30] Yeah. Hundreds at the least. [16:58:40] which is why we are using a distributed setup with something like 40 cores [16:58:45] Doing this on every VE merge will be painful. [16:59:12] gwicke: But I think we should do this, and just argue for more boxes if we need them, yes? [16:59:34] doing something like our rt testing on each commit is not realistic [17:00:00] but a few hand-selected pages, sure [17:00:20] we already round-trip Obama on each commit [17:01:17] gwicke: I am not underestimating anything, I barely understand what your teams are coding anyway :-( [17:02:38] annnd Jenkins is scalable to more nodes nowadays. If we need more power we can most probably have it [17:02:38] it is not for the price of a box [17:02:38] they are definitely cheaper than engineering brains :D [17:03:05] I gave a lightning "trolling" talk a year ago during a Green-IT conference [17:03:20] yeah, would be great to expand the round-tripped pages from 1 to hundred or so [17:03:27] explaining why everyone ends up not caring that much about performance cause it is not cost effective. [17:03:31] but it will unlikely be 160k pages on each commit [17:03:58] how much time does it takes to round trip 160k pages? [17:04:10] about 12 hours currently [17:04:14] ah [17:04:26] used to be 3 days [17:04:35] Marc has done great work there [17:04:44] and it is parralelized on 40 cores right? [17:04:54] yes [17:05:26] so hmm [17:05:36] can't be made yet on each commit nor on each merge :-( [17:06:06] I guess the pages are a corpus from production pages aren't they ? [17:06:25] wondering if they could be preprocessed to remove the basic text, but that is probably not the main cause of slowness [17:06:27] yes [17:06:30] hashar: Only 160k from prod. [17:06:46] 10k each from 16 different wikis [17:07:01] hashar: Prod is 200 times larger, of course. :-) [17:07:17] I still worry a little about us missing some edge-ish cases. [17:07:32] in any case, rt testing is used to catch minor parser behavior regressions [17:07:48] and whenever you catch them you would write a test for them right ? [17:08:07] hashar: They already are caught by the tests. [17:08:12] any major issues would be caught before that long run I guess. [17:08:14] ok [17:08:27] something like the middleware bugs we ran into recently would have been caught easily by actually using the HTTP API instead of calling things directly [17:08:30] so the huge round trip can be run asynchronously once per day for you to fix later on [17:08:49] and in Gerrit would we just run the unit tests and some basic round tripping as it is the case [17:08:49] hashar: See http://parsoid.wmflabs.org:8001/ [17:08:53] even the single Obama test on each commit would have caught it [17:09:01] gwicke: Yeah. :-( [17:10:03] James_F: are the VisualEditor test in Gerrit hitting some parsoid instance ? [17:10:18] hashar: No. :-( But the browser tests do. [17:10:33] hashar: Adding VE-Parsoid round-trip testing would be a good next step. [17:10:45] I guess so [17:10:57] the browser tests, I am going to get them running for ULS [17:11:04] hashar: It's our dependency, so our responsibility to test our integration with Parsoid, not the other way around. [17:11:06] for VE that requires some parsoid instances to be setup [17:11:13] hashar: Yeah. Eww. :-( [17:11:38] the issue with rt testing has been result classification [17:11:59] and for selser testing, simulating edits efficiently [17:12:11] and then figuring out what the correct change should be [17:12:12] I got to get Zuul upgraded first anyway. Will do that next week. [17:12:19] doing ULS meanwhile. [17:12:24] Sure. [17:12:48] and then I will lookup to get some parsoid instances in labs to test against [17:14:27] hashar: Do you want to summarise what's agreed on https://bugzilla.wikimedia.org/show_bug.cgi?id=56622 ? [17:14:33] hashar: parsoid.wmflabs.org is what we are updating along with rt testing [17:14:34] * James_F is not sure he caught everything. [17:14:43] you could hit that if you want [17:22:57] * subbu just caught up on the discussion by reading the scrollback [17:24:19] gwicke, James_F hashar synced deploy may not work easily in the longer run once Flow and mobile start using Parsoid HTML. so, we have to keep that in mind as well [17:24:53] subbu: Versioned API would work, though? [17:25:49] James_F, I think you are referring to dom spec? [17:25:55] James_F: that is on the roadmap [17:26:02] or the http api? [17:26:14] yes, in terms of DOM spec / content type [17:26:20] subbu: Yes. [17:26:33] James_F: adding something on the bug [17:26:35] subbu: Version 0.0.1a, …b, …c is fine. :-) [17:26:37] hashar: Thanks. [17:26:44] the spec is already versioned, but there is no content negotiation yet [17:27:21] James_F, gwicke, as far as i understand, we dont implement dom-spec versioning because we were manually co-ordinating our deploys thus far. [17:27:42] and becuase these changes were infrequent. [17:27:55] except for the big one that recently landed. [17:28:41] with support for old/new versions in ve/parsoid, as necessary. [17:29:34] the cached html in varnishes also complicates matters on that front. [17:29:47] it has the version in the head [17:29:49] but, we have plans for implementing versioning there [17:30:04] right, but we dont yet use it for html upgrade, but want to. [17:30:06] subbu: Versioning on those would be great, too, yes. [17:30:11] subbu: -> #mediawiki-parsoid? [17:30:15] yes [17:30:26] (03PS1) 10Mark Bergsma: Add amssq47-62 IPv6 addresses [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93975 [17:31:12] (03PS2) 10Mark Bergsma: Add amssq47-62 IPv6 addresses [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93975 [17:31:51] (03CR) 10Mark Bergsma: [C: 032] Add amssq47-62 IPv6 addresses [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93975 (owner: 10Mark Bergsma) [17:32:59] !log mark synchronized wmf-config/squid.php 'Add amssq47-62 IPv6 addresses' [17:33:02] (03CR) 10Cmcmahon: "adding Bug 56622 for completeness' sake" [operations/puppet] - 10https://gerrit.wikimedia.org/r/93939 (owner: 10Hashar) [17:33:13] Logged the message, Master [17:33:50] James_F: gwicke I have to leave, sorry :/ [17:34:09] (03CR) 10Mark Bergsma: [C: 032] Move the textsvc ip from text to text-varnish [operations/puppet] - 10https://gerrit.wikimedia.org/r/93972 (owner: 10Mark Bergsma) [17:35:25] should ask brandon to integrate netmapper into mediawiki [17:43:08] thanks hashar this is a great start https://gerrit.wikimedia.org/r/#/c/93939/ [17:44:07] chrismcmahon: though that code is unused yet :-] [17:44:27] but at least that should let us start a parsoid instance running master and have one of the beta wiki point to it [17:44:43] roan should follow up today I guess. for now I am off and will be back tomorrow. [17:44:47] *wave* [17:44:48] hashar: it is merged, though, I am happy about that [17:48:24] (03CR) 10Subramanya Sastry: "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/93939 (owner: 10Hashar) [17:54:46] (03PS1) 10MarkTraceur: Add VectorBeta to the beta cluster [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93979 [17:55:12] greg-g, chrismcmahon, mind if I selfmerge this ^^ ? [17:56:10] (03CR) 10Cmcmahon: [C: 031] "fine by me" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93979 (owner: 10MarkTraceur) [17:56:22] marktraceur: I +1ededed [17:56:30] Sounds like enough [17:56:33] (03CR) 10MarkTraceur: [C: 032] Add VectorBeta to the beta cluster [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93979 (owner: 10MarkTraceur) [17:56:43] (03Merged) 10jenkins-bot: Add VectorBeta to the beta cluster [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93979 (owner: 10MarkTraceur) [17:56:59] I also need the code there [17:57:24] chrismcmahon: Right? Or is it auto-pushed somehow? [17:59:11] marktraceur: hmm, there is a list of extensions somewhere that controls what is auto-updated from git master branch. I can find the list if you need it. [17:59:39] Yeah, it looks like it's set [17:59:40] Reedy might know off the top of his head, we had to hack on it to get Flow to beta cluster [17:59:47] ok [17:59:49] The latest code update job says "registered VectorBeta" [17:59:57] So um [17:59:59] that sounds right [18:00:45] greg-g: I'm going forward with the assumption that the extension being deployed on the beta cluster for an hour is _not_ good enough. :) [18:00:57] VectorBeta will have to go out next week to test [18:01:03] marktraceur: correct [18:01:10] marktraceur: >= 1 week calendar week [18:01:15] Yup [18:01:19] Next week it shall be [18:01:25] (is my working assumption/recommendation) [18:01:37] I can be persuaded, but I'm a tough cookie [18:01:44] Heh [18:01:55] * Reedy gives marktraceur a cookie hammer [18:01:59] haha [18:02:08] Reedy: Don't really need it to go out right now, at least not that badly [18:02:18] no greg-g cracks needed [18:05:17] Quoth Fabrice, it's painful to have to wait, but quoth Mark, life is pain, princess [18:05:52] * greg-g hugs marktraceur  [18:05:59] Anyway, it's definitely on beta now [18:06:02] Looks mighty sexy too [18:06:12] thanks, I was hearing parts of that conversation, was not looking forward to another walkover [18:06:39] hey marktraceur want to make a post to the QA mail list about a sexy new feature on beta? might get some eyeballs. [18:06:51] chrismcmahon: Sure sure [18:07:03] greg-g: I tried to downplay your role :P [18:07:33] chrismcmahon: http://en.wikipedia.beta.wmflabs.org/wiki/Special:Preferences#mw-prefsection-betafeatures has the preference, once it's enabled ("Typography refresh" is the name) you should see the changes immediately [18:07:33] marktraceur: always welcomes [18:07:42] s/es/ed/ [18:07:48] awesome [18:08:03] greg-g: Honestly even if we'd deployed it at the earliest time it was ready, it wouldn't have been enough for today [18:08:10] * greg-g nods [18:08:20] that sounds like the right decision then, marktraceur, good work [18:08:27] o/ [18:08:34] Now if only I could find Jon on IRC [18:08:52] * greg-g is left hanging [18:08:55] Oh! [18:08:58] I thought it was a wave [18:09:01] * marktraceur hi5s [18:09:03] :) [18:09:20] is beta features working well on beta? [18:09:56] beta on beta :) [18:09:57] ori-l: yo dawg [18:10:01] (i ask for egotistical reasons; am wondering if module storage worked well.) [18:10:03] yeah [18:10:23] * ^d wants Extension:ProductionFeatures so we can write things for prod [18:10:48] ^d: yeah, seems to be our main failure, we never write for production [18:10:54] ori-l: It's working, and so is MultimediaViewer [18:11:09] ori-l: Though we did have some funky caching issues with the CSS for a while, maybe that was related? [18:11:46] you'd have to describe it a bit more precisely for me to say [18:11:48] chrismcmahon: hello [18:15:11] marktraceur: I was also prodded in-person :) [18:15:41] ori-l: it was a joke, I don't think he meant to indicate a need with the "yo dawg" [18:15:58] i was just saying hello [18:16:07] oh, not allowed [18:16:11] hi ori-l [18:16:16] ok, rescinded [18:16:18] :) [18:17:02] greg-g: Sowwy [18:17:08] ori-l: aude: you know, there's an open bugs about renaming BetaFeatures to something less confusing :/ [18:17:34] WMF Labs [18:17:36] oh wait [18:17:43] i don't think it's that confusing [18:17:45] just amusing :) [18:18:11] greg-g: can you poke mark about https://bugzilla.wikimedia.org/show_bug.cgi?id=56681 ? i think he fixed just that issue an hour ago or so on #-tech [18:18:29] (and if it's not fixed yet, it's pretty high-priority) [18:18:39] MatmaRex: I think you just did? :) [18:18:43] (the other mark, not our marktraceur here) [18:18:48] greg-g: well, he's not responding :P [18:18:55] I don't live near him :/ [18:19:17] I think it is fixed, is the problem still manifested? [18:19:34] * greg-g can't read german [18:19:47] greg-g: Was "Beta Labs", then "Beta Experiments"...ugh [18:20:02] Don't get me started [18:20:08] But MatmaRex and I came up with cool name ideas! [18:20:35] WikiLabs, WikiNew, NewWiki, NextWiki, Upcoming (heh), TehFuture [18:21:52] greg-g: We liked a vibe centered around "pioneering" [18:22:50] HowTheWikiWasWon ? [18:23:10] BonanzaWiki [18:23:18] greg-g: have you looked into the deploy bot thing at all? [18:23:37] aka, quit joking and get back on that task: no, not yet [18:26:05] ori-l: I want so much of the messaging in pushbot to be auto in scap/trebuchet: https://github.com/etsy/PushBot [18:26:44] how could it be? [18:27:13] paravoid: should I be giving the brokers the same AAAA name as the A record? e.g. analytics1021.eqiad.wmnet? [18:27:20] that's how the cps are done [18:27:26] is there more info somewhere about trebuchet ? [18:27:36] it sounds like a font, but i know it's not :) [18:27:40] but, if so, how would varnishkafka explicitly ask for the IPv6 addy? [18:28:18] ori-l: the "where I'm at in the process" bits, mostly [18:28:29] brb, going to chat with Mel [18:28:41] ori-l: but obviously not the queue management part [18:28:42] * greg-g goes [18:28:49] Is it wrong that I'm amused by BonanzaWiki [18:29:27] https://wikitech.wikimedia.org/wiki/Trebuchet [18:29:31] * aude answers own question [18:29:32] (03PS1) 10Ottomata: Giving analytics1021 and analytics1022 static IPv6 addresses. [operations/dns] - 10https://gerrit.wikimedia.org/r/93983 [18:29:33] <^d> marktraceur: BonziBuddyWiki [18:31:23] manybubbles: search checkpoint time? [18:32:01] ottomata: connecting now [18:32:08] ^d: you too! [18:34:36] (03CR) 10Ottomata: "I don't know if this is quite the proper thing to do, please let me know." [operations/dns] - 10https://gerrit.wikimedia.org/r/93983 (owner: 10Ottomata) [18:44:08] Wait wait wait [18:44:18] greg-g: Our deploy window seems to not be there anymore [18:51:43] marktraceur: what you mean? [18:52:08] marktraceur: oh, on the wiki page... yeah, I forgot to add it there (I had removed it before), but assume it is [18:52:10] greg-g: I thought we had a deploy window in ten minutes, it's on my calendar [18:52:11] * greg-g does now [18:52:13] Ah, K [18:52:15] Thanks :) [18:58:36] (03PS15) 10Ottomata: (WIP) Initial Debian version [operations/software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/78782 (owner: 10Faidon Liambotis) [19:10:35] aude: you should now be able to see all the tickets.. see your inbox [19:10:48] mutante: awesome! [19:10:50] thanks [19:11:51] yw [19:18:48] (03CR) 10Edenhill: [C: 031] (WIP) Initial Debian version [operations/software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/78782 (owner: 10Faidon Liambotis) [19:23:22] Scapping now y'all, hang on to your hats [19:23:30] * ^demon hides [19:24:38] Not helpful [19:25:25] !log mholmquist Started syncing Wikimedia installation... : Updates for Multimedia extensions [19:25:41] Logged the message, Master [19:28:46] (03PS1) 10Jdlrobson: Task 1355: Enable the infobox experiment (story 1301) on enwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93988 [19:29:05] ^ sounds scary [19:30:47] latinorum [19:31:40] (03CR) 10Chad: "(1 comment)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93988 (owner: 10Jdlrobson) [19:32:52] (03CR) 10Jdlrobson: "I must admit I'm cargo cult programming here." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93988 (owner: 10Jdlrobson) [19:33:04] MatmaRex: :) [19:33:17] (03PS16) 10Ottomata: (WIP) Initial Debian version [operations/software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/78782 (owner: 10Faidon Liambotis) [19:34:48] (03CR) 10Edenhill: [C: 031] (WIP) Initial Debian version [operations/software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/78782 (owner: 10Faidon Liambotis) [19:36:43] (03CR) 10Ottomata: "(1 comment)" [operations/software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/78782 (owner: 10Faidon Liambotis) [19:37:27] Is it "normal" for dsh to take much longer for a small number of hosts? [19:38:32] I've been watching the process list associated with the current scap and it seems like a specific set of hosts is taking longer than others [19:38:58] Specifically these: "mw1010.eqiad.wmnet mw1070.eqiad.wmnet mw40.pmtpa.wmnet srv270.pmtpa.wmnet mw10.pmtpa.wmnet" [19:39:29] ori-l: ^ [19:41:56] !log mholmquist Finished syncing Wikimedia installation... : Updates for Multimedia extensions [19:42:01] whew [19:42:11] Logged the message, Master [19:42:14] every scap is a new adventure! [19:42:22] Woo [19:42:31] marktraceur: testing? [19:42:37] Fabrice is now [19:42:40] cool [19:42:42] I guess I'll join him :) [19:44:44] PROBLEM - MySQL Processlist on db1045 is CRITICAL: CRIT 0 unauthenticated, 0 locked, 84 copy to table, 14 statistics [19:47:24] PROBLEM - MySQL Processlist on db1021 is CRITICAL: CRIT 0 unauthenticated, 0 locked, 66 copy to table, 7 statistics [19:53:24] RECOVERY - MySQL Processlist on db1021 is OK: OK 0 unauthenticated, 0 locked, 0 copy to table, 1 statistics [19:55:09] (03CR) 10MaxSem: "Heh - you can't configure extension vars in InitialiseSettings if your extension is conditionally enabled because upon inclusion it overwr" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93988 (owner: 10Jdlrobson) [19:58:19] greg-g: We're done now, thanks :) [19:58:19] !log shred'ding star.wm and *.wm SSL keys from kaulen. backing up old non-ASN wildcard cert and new bz certs on tridge [19:58:34] Logged the message, Master [19:58:35] Some bugs but nothing terrible [19:58:51] And it looks like there's nothing notable in the fatals [20:00:59] mutante: errr, you mean SAN? [20:01:32] vs. ASN [20:01:53] subjectAltName [20:01:59] * jeremyb runs away again :) [20:02:06] jeremyb: yes, i do. subject alternate name vs. alternate subject name.. indeed [20:02:29] fixed in wiki :p [20:02:36] marktraceur: there are new ones, http://ur1.ca/fyspx , nothing to worry about? [20:02:44] well, more I suppose [20:02:56] Hrm [20:03:08] I think there were some during the deploys [20:03:11] But they were unrelated [20:03:18] Let me watch the log [20:04:00] Looks like...SVG thumbnail generation and DifferenceEngine [20:04:16] Probably nothing super bad [20:06:34] greg-g: What are the units on the Y axis here? "10 m" can't mean "10 million" [20:07:00] marktraceur: milli [20:07:04] Ah. [20:07:08] millierrors [20:07:15] 10 milli is .1/sec, so 1/10 seconds [20:07:22] er .01/sec [20:08:00] Wooo error on beta. [20:08:03] ? [20:08:16] Got a 503 error on enwikibeta [20:08:24] On special:version [20:08:33] Fixed now [20:08:39] I am seeing 500s on production, too: https://gdash.wikimedia.org/dashboards/reqerror/deploys [20:14:38] RECOVERY - MySQL Processlist on db1045 is OK: OK 0 unauthenticated, 0 locked, 11 copy to table, 2 statistics [20:16:15] (03PS1) 10Ottomata: Adding ensure parameter to ganglia::view for removing views. [operations/puppet] - 10https://gerrit.wikimedia.org/r/93996 [20:16:49] (03CR) 10Ottomata: [C: 032 V: 032] Adding ensure parameter to ganglia::view for removing views. [operations/puppet] - 10https://gerrit.wikimedia.org/r/93996 (owner: 10Ottomata) [20:23:50] (03PS1) 10Reedy: Add 404.php symlink to wwwportal directories [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93998 [20:23:57] (03CR) 10jenkins-bot: [V: 04-1] Add 404.php symlink to wwwportal directories [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93998 (owner: 10Reedy) [20:24:13] (03PS2) 10Reedy: Add 404.php symlink to wwwportal directories [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93998 [20:24:19] (03CR) 10Reedy: [C: 032] Add 404.php symlink to wwwportal directories [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93998 (owner: 10Reedy) [20:24:26] (03CR) 10jenkins-bot: [V: 04-1] Add 404.php symlink to wwwportal directories [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93998 (owner: 10Reedy) [20:24:39] (03CR) 10Reedy: [V: 032] Add 404.php symlink to wwwportal directories [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93998 (owner: 10Reedy) [20:25:13] ottomata: empty access-request queue is nice to look at, thanks [20:25:19] !log reedy synchronized docroot/wwwportal/w/404.php 'Add 404 symlink' [20:25:36] Logged the message, Master [20:25:53] yup! i only did one [20:25:57] the other one is an RT access reqeuest [20:25:59] i dont' know how to solve [20:33:17] ottomata: i solved it that's why it's empty now:) [20:33:33] nice, because i wasn't sure how to solve the one you did [20:34:09] ah nice [20:34:20] well, alex actually did it, I just forwarded a suggestion [20:34:25] i did the analytics hadoop one [20:34:29] woohooo [20:34:36] cool [20:36:46] paravoid: getting lots of "skipped non-stash file" bits [20:36:59] * AaronSchulz wonders if just skips everything by mistake [20:37:26] of course that would still mean 2 bugs would exist...one being the DB tracking being crap and two being a regex bug in the directory scan [20:37:38] otherwise it wouldn't be leaving almost everything [20:40:00] huh, so there are thumbs in the temp container not just /temp under the thumb container [20:43:09] (03PS1) 10Ottomata: Removing unused analytics udp2log filter files [operations/puppet] - 10https://gerrit.wikimedia.org/r/94003 [20:43:11] (03PS1) 10Ottomata: Replacing tabs with spaces in misc/monitoring.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/94004 [20:48:14] (03PS1) 10Ottomata: Updating kafka ganglia view with metrics from 0.8 [operations/puppet] - 10https://gerrit.wikimedia.org/r/94028 [20:48:23] (03CR) 10Ottomata: [C: 032 V: 032] Removing unused analytics udp2log filter files [operations/puppet] - 10https://gerrit.wikimedia.org/r/94003 (owner: 10Ottomata) [20:48:34] (03CR) 10Ottomata: [C: 032 V: 032] Replacing tabs with spaces in misc/monitoring.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/94004 (owner: 10Ottomata) [20:51:01] (03CR) 10Ottomata: [C: 032 V: 032] Updating kafka ganglia view with metrics from 0.8 [operations/puppet] - 10https://gerrit.wikimedia.org/r/94028 (owner: 10Ottomata) [20:54:40] (03PS1) 10Ottomata: Adding ensure parameters to some misc::monitoring::view classes/defines [operations/puppet] - 10https://gerrit.wikimedia.org/r/94043 [20:54:50] greg-g my love, if we wanted to backport something to wmf2 to fix the embed bug, could we deploy it between now and the LD without issue? [20:55:22] (03PS2) 10Ottomata: Adding ensure parameters to some misc::monitoring::view classes/defines [operations/puppet] - 10https://gerrit.wikimedia.org/r/94043 [20:55:35] (03CR) 10Ottomata: [C: 032 V: 032] Adding ensure parameters to some misc::monitoring::view classes/defines [operations/puppet] - 10https://gerrit.wikimedia.org/r/94043 (owner: 10Ottomata) [20:55:38] greg-g: this is for a critical bugfix....seems like we should be doing it [20:56:15] https://bugzilla.wikimedia.org/show_bug.cgi?id=56405 [20:57:37] marktraceur: I say give him a few minutes to respond, and if he's not around, just do it [20:58:28] (03PS1) 10Ottomata: Backslashes are being rendered doubly in ganglia view JSON output. Not using them. [operations/puppet] - 10https://gerrit.wikimedia.org/r/94044 [20:59:09] (03PS2) 10Ottomata: Backslashes are being rendered doubly in ganglia view JSON output. Not using them. [operations/puppet] - 10https://gerrit.wikimedia.org/r/94044 [20:59:34] (03CR) 10Ottomata: [C: 032 V: 032] Backslashes are being rendered doubly in ganglia view JSON output. Not using them. [operations/puppet] - 10https://gerrit.wikimedia.org/r/94044 (owner: 10Ottomata) [21:00:29] <^demon|lunch> marktraceur: Fire away imho. I'm around if you need a hand. [21:01:10] Woot [21:05:14] (03PS1) 10Ottomata: Using OneMinuteRate instead of FifteenMinuteRate in Kafka view [operations/puppet] - 10https://gerrit.wikimedia.org/r/94046 [21:05:31] (03CR) 10Ottomata: [C: 032 V: 032] Using OneMinuteRate instead of FifteenMinuteRate in Kafka view [operations/puppet] - 10https://gerrit.wikimedia.org/r/94046 (owner: 10Ottomata) [21:06:29] * marktraceur is in ur tin testin ur codez [21:06:42] <^demon|lunch> get yer own tin! [21:08:50] Dis won haz tunas [21:12:13] <^demon|lunch> I want tuna now. [21:19:27] So we're running into issues testing it [21:19:30] But working on it [21:19:38] oh, can i haz sum tunaz? [21:22:08] jeremyb: jenkins-lolcode-lint: did you mean ? HAI \n CAN HAS TUNA? \n KTHXBYE https://en.wikipedia.org/wiki/LOLCODE [21:23:14] whoa, this is a new language for me [21:24:50] Heh [21:24:51] would be fun to see actual usage.. :) IM IN YR LOOP UP VAR!!1 [21:25:02] jeremyb: I wrote a *cough* "IDE" for that language as a college project [21:25:06] hahaa [21:25:40] Also for C++, Python, Prolog, Whitespace, and...Lisp? I can't remember what the functional one was. [21:26:22] OK, scapping again y'all [21:26:46] i doubt i can get "lci" into puppet base packages, or i would try to use it for bugzilla reporter.. hehee [21:27:01] that's the C interpreter for lolcode [21:29:01] !log mholmquist Started syncing Wikimedia installation... : Update TimedMediaHandler to master to fix embedplayer [21:29:17] Logged the message, Master [21:29:32] You better bloody have, morebots [21:30:12] (03PS1) 10Legoktm: Add 'templateeditor' to $wgAvailableRights [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94051 [21:31:07] Coren: ^ [21:32:24] We want to give it to global groups? It's very enwp-specific, though. [21:34:00] Coren: Right now people like hoo who have global editinterface can edit sysop protected pages, but not templateeditor protected ones. [21:34:22] (03CR) 10coren: [C: 032] "This makes sense, and is correct." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94051 (owner: 10Legoktm) [21:34:30] The sense, you are making it. [21:34:49] Wanna push, or should I? [21:34:56] I don't have deployment rights [21:35:04] Ah, right. [21:37:04] !log marc synchronized wmf-config/CommonSettings.php 'Make templateeditor available for global groups' [21:37:17] legoktm: ^^ [21:37:20] Logged the message, Master [21:37:23] woo thanks :) [21:37:28] now just need to find a steward [21:42:30] legoktm: Talked to anyone about *getting* deploy rights? [21:42:47] Well, I don't exactly need them. [21:42:52] PROBLEM - DPKG on cp1059 is CRITICAL: NRPE: Unable to read output [21:44:52] RECOVERY - DPKG on cp1059 is OK: All packages OK [21:45:04] legoktm: Yes, but they can be very useful :) [21:45:34] <^demon|lunch> Like useful for getting pinged to help out when you're in the middle of other things ;-) [21:45:39] Yup! [21:45:49] You can be just as annoyed with the world as ^demon|lunch is [21:46:05] For the low low price of badgering your manager and generating an SSH key [21:46:23] !log mholmquist Finished syncing Wikimedia installation... : Update TimedMediaHandler to master to fix embedplayer [21:46:37] Logged the message, Master [21:48:32] Well crap, it still doesn't look fixed [21:48:42] PROBLEM - RAID on cp1059 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:48:59] ...did I fuck up the deploy [21:49:05] Maaargggg [21:49:12] PROBLEM - SSH on cp1059 is CRITICAL: Server answer: [21:49:12] PROBLEM - Varnish traffic logger on cp1059 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:50:42] PROBLEM - RAID on cp1059 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:51:12] RECOVERY - SSH on cp1059 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [21:51:12] RECOVERY - Varnish traffic logger on cp1059 is OK: PROCS OK: 2 processes with command name varnishncsa [21:51:39] So, briefly: I'm a moron, I didn't run submodule update, and I'm going once more [21:51:42] RECOVERY - RAID on cp1059 is OK: OK: no RAID installed [21:51:52] PROBLEM - DPKG on cp1059 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:51:52] PROBLEM - Varnish HTCP daemon on cp1059 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:52:49] No objections, here we gooo [21:55:02] !log mholmquist Started syncing Wikimedia installation... : Actually deploy TimedMediaHandler fix to fix the black-box issue for embedded videos [21:55:12] PROBLEM - SSH on cp1059 is CRITICAL: Server answer: [21:55:12] PROBLEM - Disk space on cp1059 is CRITICAL: DISK CRITICAL - free space: /srv/sda3 7038 MB (2% inode=99%): /srv/sdb3 7474 MB (2% inode=99%): [21:55:19] Logged the message, Master [21:55:42] PROBLEM - RAID on cp1059 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:56:12] RECOVERY - SSH on cp1059 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [21:59:12] PROBLEM - Disk space on cp1059 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:59:42] RECOVERY - RAID on cp1059 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [21:59:48] !log mholmquist Finished syncing Wikimedia installation... : Actually deploy TimedMediaHandler fix to fix the black-box issue for embedded videos [21:59:52] RECOVERY - Varnish HTCP daemon on cp1059 is OK: PROCS OK: 1 process with UID = 111 (vhtcpd), args vhtcpd [21:59:52] RECOVERY - DPKG on cp1059 is OK: All packages OK [22:00:06] Logged the message, Master [22:00:46] !log cp1059, mobile varnish cache, was OOM but recovered by itself [22:01:02] Logged the message, Master [22:01:03] Kay, fix deployed, done and done [22:01:09] Y'all can deploy stuff again [22:01:11] Cheers [22:02:34] !log aaron synchronized php-1.23wmf1/maintenance '2d083797dbf7517575c436653e27855012d44eb4' [22:02:54] Logged the message, Master [22:02:55] marktraceur: :) [22:03:32] !log aaron synchronized php-1.23wmf2/maintenance '2d083797dbf7517575c436653e27855012d44eb4' [22:03:51] Logged the message, Master [22:23:15] (03PS2) 10Aaron Schulz: Fix up multiversion to not require dba_* functions [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93622 (owner: 10Chad) [22:35:41] (03CR) 10Aaron Schulz: "(2 comments)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93622 (owner: 10Chad) [22:37:05] (03CR) 10Chad: "(2 comments)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93622 (owner: 10Chad) [22:45:11] (03PS1) 10Catrope: Fix server name for labs parsoid (deployment-parsoid3) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94062 [22:52:23] PROBLEM - Varnish traffic logger on cp1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:53:13] RECOVERY - Varnish traffic logger on cp1047 is OK: PROCS OK: 2 processes with command name varnishncsa [22:58:23] PROBLEM - Varnish traffic logger on cp1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:58:53] PROBLEM - Varnish HTCP daemon on cp1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:59:03] PROBLEM - Varnish HTTP mobile-backend on cp1047 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:59:13] RECOVERY - Varnish traffic logger on cp1047 is OK: PROCS OK: 2 processes with command name varnishncsa [22:59:43] RECOVERY - Varnish HTCP daemon on cp1047 is OK: PROCS OK: 1 process with UID = 111 (vhtcpd), args vhtcpd [22:59:53] RECOVERY - Varnish HTTP mobile-backend on cp1047 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.001 second response time [23:11:53] (03PS1) 10Catrope: Add custom replication rules for oojs/core and oojs/ui [operations/puppet] - 10https://gerrit.wikimedia.org/r/94063 [23:13:04] (03CR) 10Krinkle: [C: 031] Add custom replication rules for oojs/core and oojs/ui [operations/puppet] - 10https://gerrit.wikimedia.org/r/94063 (owner: 10Catrope) [23:14:32] (03CR) 10Chad: [C: 031] Add custom replication rules for oojs/core and oojs/ui [operations/puppet] - 10https://gerrit.wikimedia.org/r/94063 (owner: 10Catrope) [23:25:53] (03CR) 10Reedy: "This won't cause problems for MediaWiki down the line will it? The AutoLoader just won't load it again as it's already loaded...?" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93622 (owner: 10Chad) [23:26:03] (03PS3) 10Chad: Fix up multiversion to not require dba_* functions [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93622 [23:27:33] (03CR) 10Reedy: "Actually, it's going to hijack the classes isn't it? So MWExceptions won't be thrown inside MediaWiki in the CDB code" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93622 (owner: 10Chad) [23:29:15] (03CR) 10Aaron Schulz: "Yeah, the class collision thing seems like an issue" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93622 (owner: 10Chad) [23:29:36] ^d: NAMESPACES!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! [23:30:15] <^d> dammit. [23:30:17] het deploy would actually be a reasonable use for that [23:30:56] <^d> I could rename the classes to MWHetDeployCDB.... ;-) [23:31:02] but some wmf-config stuff would need updating to totally do that... [23:31:07] there is always that :) [23:31:13] (03CR) 10Reedy: [C: 04-1] Fix up multiversion to not require dba_* functions [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93622 (owner: 10Chad)